Skip to content

TARS: Traffic-Aware Radar Scene Flow Estimation

Conference: ICCV2025 arXiv: 2503.10210 Code: To be confirmed Area: Autonomous Driving Keywords: Radar scene flow, traffic vector field, point cloud motion estimation, multi-task learning, autonomous driving perception

TL;DR

This paper proposes TARS, a traffic-aware radar scene flow estimation method that constructs a Traffic Vector Field (TVF) via joint object detection, capturing rigid-body motion at the traffic level rather than the instance level. TARS surpasses the state of the art by 15% and 23% on the VOD and a proprietary dataset, respectively.

Background & Motivation

  • Scene flow provides critical motion information for autonomous driving, describing per-point displacement vectors between two consecutive point cloud frames.
  • Existing LiDAR scene flow methods exploit an instance-level rigid-body motion assumption: the scene is decomposed into multiple rigidly moving objects and a static background.
  • However, instance-level approaches are ill-suited for radar point clouds for three reasons:
  • Extreme sparsity: Radar point clouds are an order of magnitude sparser than LiDAR (VOD dataset: ~256 points per frame).
  • Lack of shape information: Reliable instance-level matching is infeasible.
  • Inter-frame "deformation": Sparsity causes significant variation in the point distribution of the same object across consecutive frames.
  • Radar advantages: greater robustness to adverse weather and an order-of-magnitude lower cost.
  • Core problem: How to preserve the rigid-body motion assumption while accommodating radar sparsity?
  • Proposed solution: Elevate the rigid-body motion assumption from the instance level to the traffic level.

Method

Overall Architecture

TARS adopts a hierarchical architecture with \(L\) layers, combining two branches: 1. Scene Flow Branch: Hierarchical coarse-to-fine scene flow estimation. 2. Object Detection (OD) Branch: Provides feature maps encoding traffic-level contextual information.

Both branches are jointly trained; feature maps from the OD branch supply traffic-level priors to the scene flow branch.

Input and Encoding

  • Two input point clouds \(P \in \mathbb{R}^{N \times 5}\) and \(Q \in \mathbb{R}^{M \times 5}\).
  • Each point has 5-dimensional features: x, y, z coordinates + Relative Radial Velocity (RRV) + Radar Cross Section (RCS).
  • A multi-scale point encoder (PointNet) extracts features.
  • Farthest point sampling downsamples the point sets at each layer, yielding multi-scale point set pairs.

Two-Level Motion Understanding

1. Point-Level Motion Understanding

  • A dual-attention mechanism extracts motion cues from neighboring points (replacing unstable MLPs).
  • Cross-attention: Computed between point \(p_i\) and its \(K\) nearest neighbors in \(Q\), yielding matching embeddings.
  • Self-attention: Combines matching embeddings with neighborhood information in \(P\), producing point-level flow embeddings.
  • A coarse flow from the previous layer is used to warp and align points, reducing the search range.
  • Unlike HALFlow, direction vectors are removed to alleviate inter-point distance issues, and heterogeneous key/value pairs are adopted.

2. Traffic-Level Scene Understanding

The core innovation: modeling traffic-level motion consistency via a TVF (Traffic Vector Field).

TVF Definition: A discrete grid map encoding traffic information of road participants and the environment, where each cell contains a motion vector. A coarse grid (e.g., 2 m × 2 m) is used to capture high-level understanding rather than per-point details.

TVF Encoder

TVF construction proceeds in two stages:

Scene Update: - A GRU updates the TVF across layers, taking OD feature maps (adapted to the TVF shape via CNN and pooling) as input. - The TVF serves as the GRU hidden state, with OD features as the input. - The scene representation is progressively refined across hierarchy levels.

Flow Painting: - Flow embeddings and point features from the previous layer are projected onto the coarse grid. - Since each grid cell may contain multiple points with different motion patterns, point-to-grid self-attention is used to adaptively extract motion features. - Traffic features and motion features are fused via spatial attention. - Axial attention (\(\omega\) blocks) provides a global receptive field, modeling rigid-body motion dependencies in traffic (e.g., motion patterns of vehicles in the same lane).

TVF Decoder

  • Perceives latent rigid-body motion within spatial context.
  • For each point \(p_i\), grid-to-point cross-attention is applied by querying the surrounding \(\mathcal{N}_{TVF}\) TVF cells.
  • The attention receptive field is restricted to the local region of each point, focusing on relevant local rigid-body motion.
  • Query: previous-layer flow embedding + point features; Key/Value: TVF grid cells.

Scene Flow Prediction

  • Point-level and traffic-level flow embeddings are concatenated: $\(\textbf{e}^l = \text{Concat}(\textbf{e}_\text{point}, \textbf{e}_\text{traffic}, \text{Interp}(\textbf{e}^{l-1}))\)$
  • A self-attention layer is applied before predicting the final scene flow \(F^l\).

Temporal Update Module

  • A PointGRU layer exploits multi-frame temporal information (distinct from the cross-layer GRU in the TVF encoder).
  • Point features at time \(t-2\) initialize the hidden state.
  • During training, \(T\)-frame mini-clips are sampled as sequences.

Loss & Training

Weakly supervised training without scene flow ground truth annotations; a composite loss is used: 1. Soft Chamfer Loss \(\mathcal{L}_{sc}\): Aligns the warped \(P\) with \(Q\). 2. Spatial Smoothness Loss \(\mathcal{L}_{ss}\): Encourages neighboring points to have similar flow vectors. 3. Radial Displacement Loss \(\mathcal{L}_{rd}\): Constrains the radial flow component using radar RRV measurements. 4. Foreground Loss \(\mathcal{L}_{fg}\): Uses pseudo ground truth from a LiDAR multi-object tracking model. 5. Background Loss \(\mathcal{L}_{bg}\): Uses ego-motion transformation as pseudo ground truth for static points (\(\lambda = 0.5\)).

Ego-Motion Handling

  • TARS-ego: An additional ego-motion head is trained (for fair comparison with CMFlow).
  • TARS-superego: Ego-motion is provided as a known input for compensation (simulating real autonomous driving conditions).

Key Experimental Results

VOD Dataset

Method EPE↓(m) AccS↑(%) AccR↑(%) RNE↓ MRNE↓ SRNE↓
RaFlow 0.226 19.0 39.0 0.090 0.114 0.087
CMFlow (SOTA) 0.130 22.8 53.9 0.052 0.072 0.049
TARS-ego 0.092 39.0 69.1 0.037 0.061 0.034
TARS-superego 0.048 76.6 86.4 0.019 0.057 0.014

TARS-ego reduces EPE from 0.130 m to 0.092 m (the first method to breach the AccR threshold of 0.1 m), improving AccS and AccR by 16.2% and 15.2%, respectively.

Proprietary Dataset (High-Resolution Radar)

Method MEPE↓(m) MagE↓ DirE↓(rad) AccS↑(%) AccR↑(%)
PointPWC-Net+GRU 0.213 0.178 0.762 49.0 60.5
HALFlow+GRU 0.170 0.135 0.721 50.9 63.8
TARS 0.069 0.059 0.599 69.8 86.8

MEPE is reduced from 0.170 m to 0.069 m (−59%), with AccS and AccR improving by 18.9% and 23.0%, respectively.

Ablation Study (Proprietary Dataset)

Configuration MEPE↓ AccS↑ AccR↑
Point-level only 0.178 47.9 61.6
+ Traffic-level (w/o OD feature map) 0.144 45.0 63.3
+ OD feature map (fine grid) 0.104 51.4 69.9
+ Coarse grid (w/o global attention) 0.074 65.6 84.2
+ Global attention (full TARS) 0.069 69.8 86.8

Key Findings: - The coarse grid (2 m × 2 m vs. 1 m × 1 m) is critical for traffic-level understanding. - Global attention (vs. local convolution) improves AccS by +4.2%. - \(\mathcal{N}_{TVF} = 9\) (surrounding neighborhood) in the TVF decoder achieves the best performance.

Loss Function Ablation (VOD Dataset)

  • The background loss \(\mathcal{L}_{bg}\) yields a significant overall improvement: AccR increases from 62.4% to 69.1%.
  • Weight \(\lambda = 0.5\) achieves a balance between accuracy on moving points and overall accuracy.

Highlights & Insights

Strengths: - The first work to elevate the rigid-body motion assumption from the instance level to the traffic level, effectively accommodating radar sparsity. - The coarse-grid design of TVF avoids over-fitting to point-level details. - Traffic context is obtained via OD branch feature maps (rather than detection outputs), reducing dependence on detection accuracy. - Substantially outperforms the state of the art on two datasets (15% and 23%). - Effectively mitigates the inherent limitation of radar in measuring tangential velocity.

Limitations: - Still relies on a LiDAR multi-object tracking model to generate foreground pseudo ground truth (not fully LiDAR-free). - On the VOD dataset, point clouds are extremely sparse (256 points/frame), making the effect of \(\mathcal{N}_{TVF}\) in the TVF decoder less pronounced. - End-to-end joint optimization of the OD and scene flow branches remains unexplored.

Personal Reflections

  • The framing of "instance-level vs. traffic-level" is precise and directly addresses the core challenge in radar scene flow.
  • The TVF design philosophy is inspiring: when data is too sparse for fine-grained matching, raising the level of abstraction is an effective strategy.
  • The combination of coarse grid and global attention is well-motivated: the coarse grid preserves high-level semantics, while global attention models lane-level motion correlations.
  • Leveraging joint detection feature maps rather than detection results is more robust, as feature maps carry richer information than bounding box outputs.
  • The PointGRU temporal module and the cross-layer GRU in the TVF encoder serve distinct roles: the former accumulates temporal information, while the latter updates the scene representation across scales.

Rating

  • Novelty: TBD
  • Experimental Thoroughness: TBD
  • Writing Quality: TBD
  • Value: TBD