TARS: Traffic-Aware Radar Scene Flow Estimation¶

Conference: ICCV2025 arXiv: 2503.10210 Code: To be confirmed Area: Autonomous Driving Keywords: Radar scene flow, traffic vector field, point cloud motion estimation, multi-task learning, autonomous driving perception

TL;DR¶

This paper proposes TARS, a traffic-aware radar scene flow estimation method that constructs a Traffic Vector Field (TVF) via joint object detection, capturing rigid-body motion at the traffic level rather than the instance level. TARS surpasses the state of the art by 15% and 23% on the VOD and a proprietary dataset, respectively.

Background & Motivation¶

Scene flow provides critical motion information for autonomous driving, describing per-point displacement vectors between two consecutive point cloud frames.
Existing LiDAR scene flow methods exploit an instance-level rigid-body motion assumption: the scene is decomposed into multiple rigidly moving objects and a static background.
However, instance-level approaches are ill-suited for radar point clouds for three reasons:
Extreme sparsity: Radar point clouds are an order of magnitude sparser than LiDAR (VOD dataset: ~256 points per frame).
Lack of shape information: Reliable instance-level matching is infeasible.
Inter-frame "deformation": Sparsity causes significant variation in the point distribution of the same object across consecutive frames.
Radar advantages: greater robustness to adverse weather and an order-of-magnitude lower cost.
Core problem: How to preserve the rigid-body motion assumption while accommodating radar sparsity?
Proposed solution: Elevate the rigid-body motion assumption from the instance level to the traffic level.

Method¶

Overall Architecture¶

TARS adopts a hierarchical architecture with $L$ layers, combining two branches: 1. Scene Flow Branch: Hierarchical coarse-to-fine scene flow estimation. 2. Object Detection (OD) Branch: Provides feature maps encoding traffic-level contextual information.

Both branches are jointly trained; feature maps from the OD branch supply traffic-level priors to the scene flow branch.

Input and Encoding¶

Two input point clouds $P \in \mathbb{R}^{N \times 5}$ and $Q \in \mathbb{R}^{M \times 5}$.
Each point has 5-dimensional features: x, y, z coordinates + Relative Radial Velocity (RRV) + Radar Cross Section (RCS).
A multi-scale point encoder (PointNet) extracts features.
Farthest point sampling downsamples the point sets at each layer, yielding multi-scale point set pairs.

Two-Level Motion Understanding¶

1. Point-Level Motion Understanding¶

A dual-attention mechanism extracts motion cues from neighboring points (replacing unstable MLPs).
Cross-attention: Computed between point $p_i$ and its $K$ nearest neighbors in $Q$, yielding matching embeddings.
Self-attention: Combines matching embeddings with neighborhood information in $P$, producing point-level flow embeddings.
A coarse flow from the previous layer is used to warp and align points, reducing the search range.
Unlike HALFlow, direction vectors are removed to alleviate inter-point distance issues, and heterogeneous key/value pairs are adopted.

2. Traffic-Level Scene Understanding¶

The core innovation: modeling traffic-level motion consistency via a TVF (Traffic Vector Field).

TVF Definition: A discrete grid map encoding traffic information of road participants and the environment, where each cell contains a motion vector. A coarse grid (e.g., 2 m × 2 m) is used to capture high-level understanding rather than per-point details.

TVF Encoder¶

TVF construction proceeds in two stages:

Scene Update: - A GRU updates the TVF across layers, taking OD feature maps (adapted to the TVF shape via CNN and pooling) as input. - The TVF serves as the GRU hidden state, with OD features as the input. - The scene representation is progressively refined across hierarchy levels.

Flow Painting: - Flow embeddings and point features from the previous layer are projected onto the coarse grid. - Since each grid cell may contain multiple points with different motion patterns, point-to-grid self-attention is used to adaptively extract motion features. - Traffic features and motion features are fused via spatial attention. - Axial attention ($\omega$ blocks) provides a global receptive field, modeling rigid-body motion dependencies in traffic (e.g., motion patterns of vehicles in the same lane).

TVF Decoder¶

Perceives latent rigid-body motion within spatial context.
For each point $p_i$, grid-to-point cross-attention is applied by querying the surrounding $\mathcal{N}_{TVF}$ TVF cells.
The attention receptive field is restricted to the local region of each point, focusing on relevant local rigid-body motion.
Query: previous-layer flow embedding + point features; Key/Value: TVF grid cells.

Scene Flow Prediction¶

Point-level and traffic-level flow embeddings are concatenated: $$\textbf{e}^l = \text{Concat}(\textbf{e}_\text{point}, \textbf{e}_\text{traffic}, \text{Interp}(\textbf{e}^{l-1}))$$
A self-attention layer is applied before predicting the final scene flow $F^l$.

Temporal Update Module¶

A PointGRU layer exploits multi-frame temporal information (distinct from the cross-layer GRU in the TVF encoder).
Point features at time $t-2$ initialize the hidden state.
During training, $T$-frame mini-clips are sampled as sequences.

Loss & Training¶

Weakly supervised training without scene flow ground truth annotations; a composite loss is used: 1. Soft Chamfer Loss $\mathcal{L}_{sc}$: Aligns the warped $P$ with $Q$. 2. Spatial Smoothness Loss $\mathcal{L}_{ss}$: Encourages neighboring points to have similar flow vectors. 3. Radial Displacement Loss $\mathcal{L}_{rd}$: Constrains the radial flow component using radar RRV measurements. 4. Foreground Loss $\mathcal{L}_{fg}$: Uses pseudo ground truth from a LiDAR multi-object tracking model. 5. Background Loss $\mathcal{L}_{bg}$: Uses ego-motion transformation as pseudo ground truth for static points ($\lambda = 0.5$).

Ego-Motion Handling¶

TARS-ego: An additional ego-motion head is trained (for fair comparison with CMFlow).
TARS-superego: Ego-motion is provided as a known input for compensation (simulating real autonomous driving conditions).

Key Experimental Results¶

VOD Dataset¶

Method	EPE↓(m)	AccS↑(%)	AccR↑(%)	RNE↓	MRNE↓	SRNE↓
RaFlow	0.226	19.0	39.0	0.090	0.114	0.087
CMFlow (SOTA)	0.130	22.8	53.9	0.052	0.072	0.049
TARS-ego	0.092	39.0	69.1	0.037	0.061	0.034
TARS-superego	0.048	76.6	86.4	0.019	0.057	0.014

TARS-ego reduces EPE from 0.130 m to 0.092 m (the first method to breach the AccR threshold of 0.1 m), improving AccS and AccR by 16.2% and 15.2%, respectively.

Proprietary Dataset (High-Resolution Radar)¶

Method	MEPE↓(m)	MagE↓	DirE↓(rad)	AccS↑(%)	AccR↑(%)
PointPWC-Net+GRU	0.213	0.178	0.762	49.0	60.5
HALFlow+GRU	0.170	0.135	0.721	50.9	63.8
TARS	0.069	0.059	0.599	69.8	86.8

MEPE is reduced from 0.170 m to 0.069 m (−59%), with AccS and AccR improving by 18.9% and 23.0%, respectively.

Ablation Study (Proprietary Dataset)¶

Configuration	MEPE↓	AccS↑	AccR↑
Point-level only	0.178	47.9	61.6
+ Traffic-level (w/o OD feature map)	0.144	45.0	63.3
+ OD feature map (fine grid)	0.104	51.4	69.9
+ Coarse grid (w/o global attention)	0.074	65.6	84.2
+ Global attention (full TARS)	0.069	69.8	86.8

Key Findings: - The coarse grid (2 m × 2 m vs. 1 m × 1 m) is critical for traffic-level understanding. - Global attention (vs. local convolution) improves AccS by +4.2%. - $\mathcal{N}_{TVF} = 9$ (surrounding neighborhood) in the TVF decoder achieves the best performance.

Loss Function Ablation (VOD Dataset)¶

The background loss $\mathcal{L}_{bg}$ yields a significant overall improvement: AccR increases from 62.4% to 69.1%.
Weight $\lambda = 0.5$ achieves a balance between accuracy on moving points and overall accuracy.

Highlights & Insights¶

Strengths: - The first work to elevate the rigid-body motion assumption from the instance level to the traffic level, effectively accommodating radar sparsity. - The coarse-grid design of TVF avoids over-fitting to point-level details. - Traffic context is obtained via OD branch feature maps (rather than detection outputs), reducing dependence on detection accuracy. - Substantially outperforms the state of the art on two datasets (15% and 23%). - Effectively mitigates the inherent limitation of radar in measuring tangential velocity.

Limitations: - Still relies on a LiDAR multi-object tracking model to generate foreground pseudo ground truth (not fully LiDAR-free). - On the VOD dataset, point clouds are extremely sparse (256 points/frame), making the effect of $\mathcal{N}_{TVF}$ in the TVF decoder less pronounced. - End-to-end joint optimization of the OD and scene flow branches remains unexplored.

Personal Reflections¶

The framing of "instance-level vs. traffic-level" is precise and directly addresses the core challenge in radar scene flow.
The TVF design philosophy is inspiring: when data is too sparse for fine-grained matching, raising the level of abstraction is an effective strategy.
The combination of coarse grid and global attention is well-motivated: the coarse grid preserves high-level semantics, while global attention models lane-level motion correlations.
Leveraging joint detection feature maps rather than detection results is more robust, as feature maps carry richer information than bounding box outputs.
The PointGRU temporal module and the cross-layer GRU in the TVF encoder serve distinct roles: the former accumulates temporal information, while the latter updates the scene representation across scales.

Rating¶

Novelty: TBD
Experimental Thoroughness: TBD
Writing Quality: TBD
Value: TBD