Skip to content

CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception

Conference: ICCV 2025 arXiv: 2507.19239 Code: GitHub Area: Autonomous Driving Keywords: Cooperative Perception, 3D Multi-Object Tracking, End-to-End Learning, V2X, Instance-Level Fusion

TL;DR

This paper proposes CoopTrack, the first fully instance-level end-to-end cooperative 3D multi-object tracking framework. It achieves cross-agent instance matching and fusion via a learnable graph attention association module and multi-dimensional feature extraction, reaching state-of-the-art performance on V2X-Seq.

Background & Motivation

Single-vehicle perception is inherently limited by a single viewpoint—occlusions and restricted sensing range are common challenges. V2X (Vehicle-to-Everything) communication enables multi-agent cooperative perception. However, existing cooperative perception research has focused primarily on single-frame tasks (3D detection), while the more challenging cooperative sequential perception task (e.g., cooperative 3D multi-object tracking, MOT) remains largely underexplored.

Limitations of Prior Work:

Tracking-by-Cooperative-Detection (TBCD): First performs cooperative detection, then applies a conventional tracker (e.g., AB3DMOT) as post-processing. The tracker cannot leverage fusion information, and the decoupled optimization of detection and tracking leads to suboptimal results.

Existing end-to-end approach (UniV2X): While pioneering an end-to-end cooperative tracking framework, it suffers from two design issues: - Uses rule-based association (e.g., Euclidean distance matching), which is informationally limited and susceptible to pose noise. - Adopts a fusion-before-decoding pipeline—fusing queries from both agents before decoding with ego features—leading to ambiguity and conflicts.

Core Idea: Design a fusion-after-decoding pipeline—each agent decodes independently to obtain instance-level features, followed by cross-agent association and fusion. Association is shifted from rule-driven to learnable graph attention-based, leveraging multi-dimensional semantic and motion features for richer matching signals.

Method

Overall Architecture

CoopTrack comprises two sub-systems: a vehicle-side and a road-side agent. Each independently performs: image feature extraction → Transformer decoding → Multi-Dimensional Feature Extraction (MDFE). The road-side instance-level features are transmitted to the vehicle side via V2X communication at extremely low bandwidth. On the vehicle side, the received features are processed sequentially through: Cross-Agent Alignment (CAA) → Graph-Based Association (GBA) → feature aggregation → FFN for final output.

Key Designs

  1. Multi-Dimensional Feature Extraction (MDFE):

    • Decoupled semantic and motion features: Existing query-based methods implicitly couple the two, causing decoding ambiguity.
    • Semantic features: extracted from query features via MLP.
    • Motion features: the relative coordinates of 8 corners of the coarse 3D bounding box are encoded into motion features via PointNet (4-layer MLP + max pooling).
    • Temporal enhancement: A dedicated temporal transformer block (2-layer decoder) with sinusoidal positional encoding captures temporal dependencies; short sequences are padded with zeros and handled with binary masks.
    • Historical features are updated in a sliding window via FIFO.
  2. Cross-Agent Alignment (CAA):

    • Differences in sensors, viewpoints, and spatial positions between the vehicle-side and road-side agents create a feature domain gap.
    • Core Idea: The domain gap can be approximated as a linear transformation, analogous to the rigid-body transformation of spatial coordinates.
    • Explicit spatial transformation: \(\tilde{\mathcal{P}}^I = \mathcal{P}^I \cdot \mathbf{R}^\top + \mathbf{t}\)
    • Implicit feature transformation: \(\tilde{\mathcal{M}}^I = \mathcal{M}^I \cdot \hat{\mathbf{R}}^\top + \hat{\mathbf{t}}\)
    • The latent rotation matrix \(\hat{\mathbf{R}} \in \mathbb{R}^{d \times d}\) and translation \(\hat{\mathbf{t}} \in \mathbb{R}^{1 \times d}\) are predicted by two MLPs from explicit pose parameters.
    • A 6D continuous rotation representation with piecewise mapping is used to reduce parameter count.
  3. Graph-Based Association (GBA):

    • A fully connected association graph \(\mathcal{G} = \{\mathcal{N}, \mathcal{E}\}\) is constructed between vehicle-side and road-side instances.
    • Node features: extracted via MLP from the concatenation of motion and semantic features.
    • Edge features: Euclidean distance differences between reference points are encoded via MLP.
    • Graph attention computes the affinity matrix: $\(\hat{A} = \frac{(\mathcal{N}^V W^V)(\mathcal{N}^I W^I)^T}{\sqrt{d}} + \mathcal{E} W^E\)$
    • FFN + sigmoid generates the final affinity matrix \(A\).
    • The Hungarian algorithm is applied to \(1 - A\) to obtain matched pairs.
  4. Feature Aggregation and Tracking Propagation:

    • Matched instance pairs fuse multi-dimensional features into a single representation (eliminating duplicate detections).
    • Unmatched instances are retained directly (extending the observation range).
    • Semantic features of active instances are propagated as query features to the next frame.
    • Reference points are predicted for the next frame using a constant velocity assumption.

Loss & Training

Two-stage training:

  • Stage 1: Vehicle-side and road-side end-to-end tracking models are trained independently.

    • \(\mathcal{L}_{\text{stage1}} = 0.25 \cdot \mathcal{L}_{\text{bbx}} + 2.0 \cdot \mathcal{L}_{\text{cls}}\)
    • Classification uses Focal Loss (\(\alpha=0.25, \gamma=2.0\)); regression uses L1 Loss.
  • Stage 2: End-to-end cooperative tracking with association training.

    • \(\mathcal{L}_{\text{stage2}} = 0.25 \cdot \mathcal{L}_{\text{bbx}} + 2.0 \cdot \mathcal{L}_{\text{cls}} + 10.0 \cdot \mathcal{L}_{\text{asso}}\)
    • Association labels are automatically generated by matching predictions to GT via the Hungarian algorithm.
    • Association loss uses Focal Loss (\(\alpha=0.5, \gamma=1.0\)).

Key Experimental Results

Main Results

Comparison with cooperative perception state-of-the-art on V2X-Seq (ResNet101 backbone):

Method Paradigm mAP↑ AMOTA↑ Transmission↓
V2X-ViT TBCD 0.268 0.287 2.56×10⁶
Where2comm TBCD 0.162 0.106 5.40×10⁵
Late Fusion TBCD 0.196 0.263 6.60×10²
UniV2X E2EC 0.295 0.239 6.96×10⁴
CoopTrack E2EC 0.390 0.328 5.64×10⁴

CoopTrack outperforms UniV2X by +9.5% mAP and +8.9% AMOTA while requiring lower transmission volume.

Ablation Study

Incremental contribution of each module (ResNet50 backbone):

Pipeline MDFE CAA GBA mAP↑ AMOTA↑
0.310 0.266
0.337 0.277
0.345 0.283
0.354 0.304
0.355 0.332
0.356 0.346

Effect of historical frame count: AMOTA improves from 0.100 (0 frames) to 0.346 (4 frames), validating the value of temporal modeling.

Key Findings

  • The fusion-after-decoding pipeline alone yields +2.7% mAP and +1.1% AMOTA over fusion-before-decoding.
  • The CAA module learns implicit information: adding rotation noise only to the alignment module causes minor degradation, whereas global noise injection leads to significant performance drops.
  • Higher frame rates benefit tracking: at 10 Hz, CoopTrack's AMOTA is 10.7% higher than at 2 Hz.
  • The method generalizes effectively to the Griffin dataset (aerial-ground cooperation), demonstrating its broader applicability.

Highlights & Insights

  • Instance-level feature transmission incurs extremely low bandwidth overhead (5.64×10⁴ bytes/s), approximately one-thousandth of BEV feature fusion.
  • Learnable association is more robust than rule-based matching: even when reference point positions are imprecise, semantic and motion features enable correct association.
  • Decoupling multi-dimensional features (semantic vs. motion) resolves the decoding ambiguity caused by implicit coupling in query-based methods.
  • The automatic generation of association labels cleverly leverages the prediction capability of the Stage 1 model.

Limitations & Future Work

  • The two-stage training pipeline is relatively complex; future work may explore single-stage end-to-end training.
  • Validation is limited to vehicle-road (V2I) scenarios; multi-vehicle V2V settings remain to be explored.
  • The communication delay compensation module, while effective, lacks fine granularity; performance still degrades under long delays.
  • Pose noise has a considerable impact on the system (global noise), underscoring the importance of robust pose estimation.
  • LiDAR input is not explored; only camera images are used.
  • The temporal query propagation and prediction mechanism of PF-Track is extended in this work.
  • ADA-Track's differentiable association module shares conceptual similarity with GBA, though applied at a different level.
  • QUEST's instance-level feature fusion paradigm aligns with the instance-level transmission approach adopted here.
  • UniV2X's pioneering work provides both the baseline and the direction for improvement in this paper.

Rating

  • Novelty: ⭐⭐⭐⭐ — First fully end-to-end learnable association framework for cooperative tracking.
  • Technical Depth: ⭐⭐⭐⭐ — Well-motivated multi-module design with solid theoretical grounding.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Two datasets, comprehensive ablations, and qualitative analysis.
  • Value: ⭐⭐⭐⭐ — Low bandwidth with high performance; strong practical prospects.
  • Overall Recommendation: ⭐⭐⭐⭐