CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception¶
Conference: ICCV 2025 arXiv: 2507.19239 Code: GitHub Area: Autonomous Driving Keywords: Cooperative Perception, 3D Multi-Object Tracking, End-to-End Learning, V2X, Instance-Level Fusion
TL;DR¶
This paper proposes CoopTrack, the first fully instance-level end-to-end cooperative 3D multi-object tracking framework. It achieves cross-agent instance matching and fusion via a learnable graph attention association module and multi-dimensional feature extraction, reaching state-of-the-art performance on V2X-Seq.
Background & Motivation¶
Single-vehicle perception is inherently limited by a single viewpoint—occlusions and restricted sensing range are common challenges. V2X (Vehicle-to-Everything) communication enables multi-agent cooperative perception. However, existing cooperative perception research has focused primarily on single-frame tasks (3D detection), while the more challenging cooperative sequential perception task (e.g., cooperative 3D multi-object tracking, MOT) remains largely underexplored.
Limitations of Prior Work:
Tracking-by-Cooperative-Detection (TBCD): First performs cooperative detection, then applies a conventional tracker (e.g., AB3DMOT) as post-processing. The tracker cannot leverage fusion information, and the decoupled optimization of detection and tracking leads to suboptimal results.
Existing end-to-end approach (UniV2X): While pioneering an end-to-end cooperative tracking framework, it suffers from two design issues: - Uses rule-based association (e.g., Euclidean distance matching), which is informationally limited and susceptible to pose noise. - Adopts a fusion-before-decoding pipeline—fusing queries from both agents before decoding with ego features—leading to ambiguity and conflicts.
Core Idea: Design a fusion-after-decoding pipeline—each agent decodes independently to obtain instance-level features, followed by cross-agent association and fusion. Association is shifted from rule-driven to learnable graph attention-based, leveraging multi-dimensional semantic and motion features for richer matching signals.
Method¶
Overall Architecture¶
CoopTrack comprises two sub-systems: a vehicle-side and a road-side agent. Each independently performs: image feature extraction → Transformer decoding → Multi-Dimensional Feature Extraction (MDFE). The road-side instance-level features are transmitted to the vehicle side via V2X communication at extremely low bandwidth. On the vehicle side, the received features are processed sequentially through: Cross-Agent Alignment (CAA) → Graph-Based Association (GBA) → feature aggregation → FFN for final output.
Key Designs¶
-
Multi-Dimensional Feature Extraction (MDFE):
- Decoupled semantic and motion features: Existing query-based methods implicitly couple the two, causing decoding ambiguity.
- Semantic features: extracted from query features via MLP.
- Motion features: the relative coordinates of 8 corners of the coarse 3D bounding box are encoded into motion features via PointNet (4-layer MLP + max pooling).
- Temporal enhancement: A dedicated temporal transformer block (2-layer decoder) with sinusoidal positional encoding captures temporal dependencies; short sequences are padded with zeros and handled with binary masks.
- Historical features are updated in a sliding window via FIFO.
-
Cross-Agent Alignment (CAA):
- Differences in sensors, viewpoints, and spatial positions between the vehicle-side and road-side agents create a feature domain gap.
- Core Idea: The domain gap can be approximated as a linear transformation, analogous to the rigid-body transformation of spatial coordinates.
- Explicit spatial transformation: \(\tilde{\mathcal{P}}^I = \mathcal{P}^I \cdot \mathbf{R}^\top + \mathbf{t}\)
- Implicit feature transformation: \(\tilde{\mathcal{M}}^I = \mathcal{M}^I \cdot \hat{\mathbf{R}}^\top + \hat{\mathbf{t}}\)
- The latent rotation matrix \(\hat{\mathbf{R}} \in \mathbb{R}^{d \times d}\) and translation \(\hat{\mathbf{t}} \in \mathbb{R}^{1 \times d}\) are predicted by two MLPs from explicit pose parameters.
- A 6D continuous rotation representation with piecewise mapping is used to reduce parameter count.
-
Graph-Based Association (GBA):
- A fully connected association graph \(\mathcal{G} = \{\mathcal{N}, \mathcal{E}\}\) is constructed between vehicle-side and road-side instances.
- Node features: extracted via MLP from the concatenation of motion and semantic features.
- Edge features: Euclidean distance differences between reference points are encoded via MLP.
- Graph attention computes the affinity matrix: $\(\hat{A} = \frac{(\mathcal{N}^V W^V)(\mathcal{N}^I W^I)^T}{\sqrt{d}} + \mathcal{E} W^E\)$
- FFN + sigmoid generates the final affinity matrix \(A\).
- The Hungarian algorithm is applied to \(1 - A\) to obtain matched pairs.
-
Feature Aggregation and Tracking Propagation:
- Matched instance pairs fuse multi-dimensional features into a single representation (eliminating duplicate detections).
- Unmatched instances are retained directly (extending the observation range).
- Semantic features of active instances are propagated as query features to the next frame.
- Reference points are predicted for the next frame using a constant velocity assumption.
Loss & Training¶
Two-stage training:
-
Stage 1: Vehicle-side and road-side end-to-end tracking models are trained independently.
- \(\mathcal{L}_{\text{stage1}} = 0.25 \cdot \mathcal{L}_{\text{bbx}} + 2.0 \cdot \mathcal{L}_{\text{cls}}\)
- Classification uses Focal Loss (\(\alpha=0.25, \gamma=2.0\)); regression uses L1 Loss.
-
Stage 2: End-to-end cooperative tracking with association training.
- \(\mathcal{L}_{\text{stage2}} = 0.25 \cdot \mathcal{L}_{\text{bbx}} + 2.0 \cdot \mathcal{L}_{\text{cls}} + 10.0 \cdot \mathcal{L}_{\text{asso}}\)
- Association labels are automatically generated by matching predictions to GT via the Hungarian algorithm.
- Association loss uses Focal Loss (\(\alpha=0.5, \gamma=1.0\)).
Key Experimental Results¶
Main Results¶
Comparison with cooperative perception state-of-the-art on V2X-Seq (ResNet101 backbone):
| Method | Paradigm | mAP↑ | AMOTA↑ | Transmission↓ |
|---|---|---|---|---|
| V2X-ViT | TBCD | 0.268 | 0.287 | 2.56×10⁶ |
| Where2comm | TBCD | 0.162 | 0.106 | 5.40×10⁵ |
| Late Fusion | TBCD | 0.196 | 0.263 | 6.60×10² |
| UniV2X | E2EC | 0.295 | 0.239 | 6.96×10⁴ |
| CoopTrack | E2EC | 0.390 | 0.328 | 5.64×10⁴ |
CoopTrack outperforms UniV2X by +9.5% mAP and +8.9% AMOTA while requiring lower transmission volume.
Ablation Study¶
Incremental contribution of each module (ResNet50 backbone):
| Pipeline | MDFE | CAA | GBA | mAP↑ | AMOTA↑ |
|---|---|---|---|---|---|
| ✗ | ✗ | ✗ | ✗ | 0.310 | 0.266 |
| ✓ | ✗ | ✗ | ✗ | 0.337 | 0.277 |
| ✓ | ✓ | ✗ | ✗ | 0.345 | 0.283 |
| ✓ | ✗ | ✓ | ✗ | 0.354 | 0.304 |
| ✓ | ✓ | ✓ | ✗ | 0.355 | 0.332 |
| ✓ | ✓ | ✓ | ✓ | 0.356 | 0.346 |
Effect of historical frame count: AMOTA improves from 0.100 (0 frames) to 0.346 (4 frames), validating the value of temporal modeling.
Key Findings¶
- The fusion-after-decoding pipeline alone yields +2.7% mAP and +1.1% AMOTA over fusion-before-decoding.
- The CAA module learns implicit information: adding rotation noise only to the alignment module causes minor degradation, whereas global noise injection leads to significant performance drops.
- Higher frame rates benefit tracking: at 10 Hz, CoopTrack's AMOTA is 10.7% higher than at 2 Hz.
- The method generalizes effectively to the Griffin dataset (aerial-ground cooperation), demonstrating its broader applicability.
Highlights & Insights¶
- Instance-level feature transmission incurs extremely low bandwidth overhead (5.64×10⁴ bytes/s), approximately one-thousandth of BEV feature fusion.
- Learnable association is more robust than rule-based matching: even when reference point positions are imprecise, semantic and motion features enable correct association.
- Decoupling multi-dimensional features (semantic vs. motion) resolves the decoding ambiguity caused by implicit coupling in query-based methods.
- The automatic generation of association labels cleverly leverages the prediction capability of the Stage 1 model.
Limitations & Future Work¶
- The two-stage training pipeline is relatively complex; future work may explore single-stage end-to-end training.
- Validation is limited to vehicle-road (V2I) scenarios; multi-vehicle V2V settings remain to be explored.
- The communication delay compensation module, while effective, lacks fine granularity; performance still degrades under long delays.
- Pose noise has a considerable impact on the system (global noise), underscoring the importance of robust pose estimation.
- LiDAR input is not explored; only camera images are used.
Related Work & Insights¶
- The temporal query propagation and prediction mechanism of PF-Track is extended in this work.
- ADA-Track's differentiable association module shares conceptual similarity with GBA, though applied at a different level.
- QUEST's instance-level feature fusion paradigm aligns with the instance-level transmission approach adopted here.
- UniV2X's pioneering work provides both the baseline and the direction for improvement in this paper.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First fully end-to-end learnable association framework for cooperative tracking.
- Technical Depth: ⭐⭐⭐⭐ — Well-motivated multi-module design with solid theoretical grounding.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Two datasets, comprehensive ablations, and qualitative analysis.
- Value: ⭐⭐⭐⭐ — Low bandwidth with high performance; strong practical prospects.
- Overall Recommendation: ⭐⭐⭐⭐