S3E: Self-Supervised State Estimation for Radar-Inertial System¶
Conference: ICCV 2025 arXiv: 2509.25984 Code: Not released Area: 3D Vision Keywords: millimeter-wave radar, IMU fusion, self-supervised learning, state estimation, radar spectrum
TL;DR¶
S3E is proposed as the first method to achieve complementary self-supervised state estimation from radar signal spectra and inertial data, leveraging a rotation-based cross-fusion technique to enhance spatial structural information under limited angular resolution.
Background & Motivation¶
State of the Field¶
Millimeter-wave radar offers unique reliability under adverse conditions (fog/rain/snow), yet existing approaches face the following challenges:
Sparse point clouds: Point clouds extracted by CFAR detectors are sparse and contaminated by ghost points.
Multipath effects: High reflective power generates false-positive "ghost points."
Limited angular resolution: Single-chip radar antennas have a small antenna count.
Key motivations:
Limitations of Prior Work¶
The richer range-azimuth spectrum (RAS) should replace sparse point clouds as the primary representation.
Root Cause¶
Radar provides exteroceptive sensing (compensating IMU drift), while IMU provides proprioceptive sensing (distilling motion-consistent landmarks).
Starting Point¶
The rotational component manifests as a linear shift along the azimuth axis in the RAS.
Method¶
Rotation-based Cross Fusion¶
A core finding is that the displacement of the dominant energy between consecutive RAS frames is determined by the rotational component. Shifting the peak power of frame \(k\) linearly along the azimuth by the rotation angle \(\vartheta\) aligns its peak with that of frame \(k+1\).
The inter-frame rotation \(\boldsymbol{q}_{k+1}^k\) is obtained via IMU pre-integration and used to augment the RAS:
Consistent Landmark Extractor¶
A U-Net-based multi-head architecture is employed: - Location head: Outputs detection scores \(L \in \mathbb{R}^{H \times W}\) and extracts sub-pixel positions. - Score head: Produces normalized confidence weights to suppress ghost point influence. - Descriptor head: Generates 248-dimensional features for cross-frame landmark matching.
Sub-pixel positions are computed via Softmax weighting: $\(u_k = \sum_{(i,j) \in \mathcal{U}_k} u_{ij}[\text{Softmax}(L_{\mathcal{U}_k})]_{ij}\)$
Differentiable Velocity Estimation¶
An overdetermined system is established using the cosine constraint between the Doppler velocity of static landmarks and the vehicle velocity:
The system is solved via differentiable least squares: \((\mathbf{G}^T\mathbf{G})^{-1}\mathbf{G}^T\mathbf{B}\)
Self-Supervised Loss¶
Three constraints are jointly optimized: - Geometric constraint \(\mathcal{L}_1\): Landmark pairs satisfy the IMU transformation matrix. - Kinematic constraint \(\mathcal{L}_2\): Static landmarks satisfy the Doppler cosine constraint. - Velocity alignment \(\mathcal{L}_3\): Velocities observed via IMU and radar are consistent with the transformation-derived velocity.
Key Experimental Results¶
ColoRadar Dataset¶
Main Results¶
| Method | longboard Trans.↓ | edgar_classroom Trans.↓ | outdoors Trans.↓ |
|---|---|---|---|
| EKF-RIO | - | 5.32 | 4.65 |
| PG-RIO | - | 2.61 | 7.67 |
| Milliego | 9.14 | 2.32 | 2.02 |
| S3E | 5.69 | best | best |
S3E achieves state-of-the-art or highly competitive performance across most scenarios.
Generalization on Self-Collected Dataset¶
Evaluated on unseen scenes, S3E as a self-supervised method demonstrates superior generalization compared to the supervised Milliego.
Highlights & Insights¶
- First radar spectrum + IMU fusion: Bypasses sparse point clouds by directly utilizing the information-rich RAS.
- Self-supervised without localization ground truth: Supervision is derived from complementary radar–IMU constraints.
- Rotation–spectrum correspondence: Elegantly exploits the physical property that rotation manifests as a linear shift in the RAS.
- Differentiable velocity estimation: End-to-end training is enabled by a differentiable solver from Doppler observations to ego-velocity.
Limitations & Future Work¶
- Applicable only to single-chip low-resolution radar; 4D imaging radar may not require such complex processing.
- IMU pre-integration accuracy degrades in high-speed scenarios.
- The static landmark assumption may be insufficient in highly dynamic environments.
- Large-scale long-term SLAM evaluation has not been conducted.
Related Work & Insights¶
- EKF-RIO, PG-RIO: Model-based radar odometry methods.
- Milliego: Supervised learning-based radar–IMU fusion.
- RadarHD: Radar spectrum enhancement.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (first self-supervised radar spectrum + IMU fusion)
- Technical Depth: ⭐⭐⭐⭐⭐ (cross-fusion + differentiable velocity estimation + triple self-supervision)
- Experimental Thoroughness: ⭐⭐⭐⭐ (public + self-collected datasets)
- Value: ⭐⭐⭐⭐ (practical solution for navigation under adverse conditions)