Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics¶
Conference: CVPR 2026 arXiv: 2603.23227 Code: https://github.com/zql-kk/E3Flow Area: 3D Vision / Robot Manipulation Keywords: SE(3) Equivariance, Spherical Harmonics, Rectified Flow, Robot Policy Learning, Multi-Modal Fusion
TL;DR¶
This paper proposes E3Flow, the first equivariant flow matching policy framework based on spherical harmonic representations. It introduces a Feature Enhancement Module (FEM) to dynamically fuse point cloud and image modalities, and combines rectified flow for efficient equivariant action generation. E3Flow achieves an average success rate 3.12% higher than the strongest baseline SDP across 8 MimicGen tasks while delivering a 7× speedup in inference.
Background & Motivation¶
Diffusion Policy has demonstrated strong performance in robot policy learning, yet faces two core challenges:
Low data efficiency: It relies on large amounts of high-quality expert demonstrations, which are costly to collect.
Bottlenecks in equivariant methods: Embedding symmetry priors can substantially improve data efficiency and generalization, but existing equivariant diffusion policies suffer from: - Computational intensity: Requiring many iterative denoising steps, which compounds the already high cost of equivariant networks. - Unimodal dependency: Relying solely on point clouds or images, lacking fine-grained visual detail. - Incompatibility with fast sampling: Directly applying one-step diffusion or flow matching to equivariant policies leads to instability and performance degradation.
Core gap: No existing method unifies the data efficiency of equivariant learning with the inference speed of flow matching.
As illustrated in Figure 1(a), when a tabletop object is rotated to an unseen pose, non-equivariant DP fails to grasp it, whereas the equivariant E3Flow succeeds by correspondingly rotating the original trajectory.
Method¶
Overall Architecture¶
The E3Flow pipeline (Figure 2) proceeds as follows:
- Multimodal encoding: Eye-in-hand camera images → ResNet extracts invariant features; single-view point cloud → EquiformerV2 extracts equivariant features.
- FEM fusion: Image semantics are injected into the spherical harmonic representation of the point cloud.
- Spherical harmonic conditioning: Fused visual features + proprioceptive state are mapped into spherical harmonic space.
- Rectified flow generation: An ODE solver efficiently generates equivariant action sequences.
Key Designs¶
-
Spherical harmonic visual representation: Strict SO(3) equivariance is achieved via spherical harmonic functions.
-
Spherical harmonics \(Y_l^m(\theta,\phi)\) form an orthogonal basis on the unit sphere. Under rotation \(R\), harmonics of the same order \(l\) undergo linear mixing: \(Y_l^m(R^{-1}\hat{\mathbf{r}}) = \sum_{m'} D_{mm'}^{(l)}(R) Y_l^{m'}(\hat{\mathbf{r}})\)
- EquiformerV2 encodes point clouds into multi-order spherical harmonic features (scalar \(f_{pcd}^{(0)}\) + higher-order \(f_{pcd}^{(>0)}\)), preserving fine-grained directional and rotational detail.
- Compared to EquiBot's vector neuron approach, EquiformerV2's higher-order coefficient encoding captures more precise directional information.
-
Compared to SDP's sparse point cloud only approach, this work introduces hybrid visual input.
-
Feature Enhancement Module (FEM): Cross-modal dynamic fusion.
-
Image features are injected exclusively into the scalar component (Type-0), preserving the equivariance of higher-order components.
- Core formula: \(f_{fused} = \Pi[\Lambda(\mathcal{A}(f_{pcd}^{(0)}, f_{img}), f_{pcd}^{(0)}) \| f_{pcd}^{(>0)}]\)
- \(\mathcal{A}\): cross-modal attention, using point cloud scalar features as queries and image features as keys/values.
- \(\Lambda\): gating mechanism that adaptively balances image contribution — contrasting with the performance drop caused by naive concatenation (79.00% → 72.36%).
- \(\Pi\): projection back into spherical harmonic space.
-
Design Motivation: Naively concatenating features from different modalities disrupts the equivariant structure. FEM operates exclusively in the invariant subspace (Type-0), elegantly preserving equivariance.
-
Equivariant rectified flow: Efficient action generation.
-
Learns a straight-line interpolation path from a noise distribution to the action distribution: \(x_t = (1-t)x_0 + ta\)
- Training loss: \(\mathcal{L}_{RF}(\theta) = \mathbb{E}_{t,x_0,x_1}[\|v_\theta(x_t,t,s,v) - (a-x_0)\|^2]\)
- Since the velocity field network is equivariant, it satisfies \(v_\theta(\rho*x_t, t, \rho*s, \rho*v) = \rho * v_\theta(x_t, t, s, v)\)
- The training target is a linear transformation of equivariant actions, and the loss is invariant under group actions; therefore, an equivariant optimal solution exists.
- Default 10-step sampling yields an inference time of 0.51s, 7× faster than SDP (DDPM) at 3.73s.
Loss & Training¶
- Optimizer: AdamW, learning rate \(1 \times 10^{-4}\), batch size 64.
- EMA decay rate: 0.95.
- Trained for 500 epochs on a single NVIDIA H20 GPU.
- Evaluated every 20 epochs over 50 episodes; maximum success rate is reported.
- Action representation: 3D position + 6D rotation + 1D gripper state (position and rotation are equivariant quantities).
Key Experimental Results¶
Main Results¶
MimicGen 8-task success rate (Table 1, 100 expert demonstrations)
| Method | Equivariance | Coffee_D2 | Nut_Asm | Square | Stack3 | Avg. |
|---|---|---|---|---|---|---|
| DP | None | 44 | 54 | 10 | 32 | 47.50 |
| EquiDiff (voxel) | SO(2) | 65 | 67 | 39 | 76 | 68.50 |
| SDP (DDPM) | SE(3) | 63 | 92 | 64 | 98 | 75.88 |
| E3Flow | SE(3) | 64 | 94 | 70 | 100 | 79.00 |
Inference time (Table 2)
| Method | Avg. Inference Time (s) | Relative to E3Flow |
|---|---|---|
| EquiBot | 2.03 | 4.0× |
| DP | 0.95 | 1.9× |
| EquiDiff (img) | 2.51 | 4.9× |
| EquiDiff (voxel) | 1.10 | 2.2× |
| SDP (DDPM) | 3.73 | 7.3× |
| SDP (DDIM) | 0.46 | 0.9× |
| E3Flow | 0.51 | 1.0× |
Note: Accelerating SDP with DDIM reduces its success rate by 6.13% (75.88 → 69.75), whereas E3Flow achieves higher success at comparable speed.
Ablation Study¶
Component analysis (Table 4)
| Input | Fusion | Generation | Avg. Success Rate |
|---|---|---|---|
| PCD | — | RF | 75.88 |
| PCD | — | Diffusion | 75.23 |
| PCD+Img | cat | RF | 72.36 |
| PCD+Img | FEM | Diffusion | 77.58 |
| PCD+Img | FEM | RF | 79.00 |
Flow method comparison (Table 5)
| Method | Steps | Inference Time | Avg. Success Rate |
|---|---|---|---|
| MeanFlow | 1 | 0.17s | 54.50 |
| AlphaFlow | 1 | 0.17s | 64.62 |
| RF-1 | 1 | 0.16s | 69.00 |
| RF-5 | 5 | 0.28s | 71.00 |
| RF-10 | 10 | 0.51s | 79.00 |
Key Findings¶
- Naive multimodal feature concatenation (cat) degrades performance (79.00 → 72.36%), underscoring the importance of modality alignment.
- FEM elegantly resolves the tension between equivariance and multimodal fusion by operating exclusively in the invariant subspace (Type-0 features).
- Equivariant learning offers a pronounced advantage on complex tasks: DP achieves only 10% on Square_D2, while E3Flow reaches 70%.
- One-step sampling is unsuitable for equivariant models (MeanFlow: 54.50%), as a single forward pass is insufficient for highly abstract equivariant features to guide fine-grained actions.
- SE(3) generalization experiments (Table 3): E3Flow comprehensively outperforms SDP in zero-shot tests with 10° tilting perturbations.
- Data efficiency: E3Flow with 100 demonstrations matches the performance of other methods using 200 demonstrations (Figure 5).
Highlights & Insights¶
- First successful unification of equivariant learning and flow matching: The work demonstrates that rectified flow naturally accommodates equivariant networks — since the training target is a linear transformation of equivariant actions, the loss is invariant under group actions.
- Elegant design of FEM: Image semantics are injected only into the Type-0 invariant subspace, leaving higher-order equivariant features intact and resolving the dilemma between multimodal fusion and equivariance preservation.
- In-depth analysis of one-step methods: The paper reveals why equivariant models require multi-step sampling — the highly abstract equivariant features require more steps to be decoded into fine-grained actions.
- End-to-end equivariance proof: The complete equivariant chain from input to output is supported by rigorous mathematical guarantees.
- Practical deployment potential: 0.51s inference time combined with 100-demonstration data efficiency makes the approach suitable for real-world robotic scenarios.
Limitations & Future Work¶
- Although a single EquiformerV2 forward pass is faster than ET-SEED, it remains the inference bottleneck.
- Equivariance is validated only for SE(3); more general symmetry groups (e.g., SIM(3) with scale) are unexplored.
- Downsampling point clouds to 1,024 points may discard geometric detail, potentially insufficient for precision assembly tasks.
- Real-world experiments cover only 4 tasks, limiting the scale of evaluation.
- The sim-to-real gap and domain randomization are not discussed.
- The image encoder (ResNet) in FEM does not use pretrained weights (e.g., CLIP), which may constrain semantic understanding.
Related Work & Insights¶
- E3Flow extends the SDP framework: SDP employs spherical harmonics + diffusion, while E3Flow replaces diffusion with rectified flow and adds image input.
- The comparison with EquiDiff highlights the difference between continuous equivariance (SO(3) spherical harmonics) and discrete equivariance (SO(2) convolution).
- The FEM design paradigm is generalizable to other settings that require injecting invariant information into equivariant representations.
- The application of rectified flow in robot policy learning warrants further investigation, including fewer sampling steps and distillation approaches.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The first unification of equivariant learning and flow matching is a significant contribution, though core components build upon existing methods.
- Experimental Thoroughness: ⭐⭐⭐⭐ — 8 simulation tasks + 4 real-world tasks, with extensive ablations and baseline comparisons.
- Writing Quality: ⭐⭐⭐⭐ — Mathematical derivations are clear and figures are professionally presented.
- Value: ⭐⭐⭐⭐⭐ — Addresses the inference efficiency bottleneck of equivariant policies, offering direct practical value to the robot learning community.