COOPERTRIM: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception¶

Conference: ICLR 2026 arXiv: 2602.13287 Code: https://cisl.ucr.edu/CooperTrim Area: 3D Vision Keywords: cooperative perception, bandwidth optimization, temporal uncertainty, feature selection, conformal prediction

TL;DR¶

CooperTrim is an adaptive feature selection framework that evaluates feature relevance via conformal temporal uncertainty estimation and dynamically determines the sharing volume through a data-driven mechanism. It achieves 80.28% bandwidth reduction with comparable performance on cooperative semantic segmentation, and is the first to apply selective sharing to cooperative segmentation tasks.

Background & Motivation¶

Background: Cooperative perception enables autonomous vehicles to share encoded representations for enhanced situational awareness. Intermediate fusion is the dominant paradigm, yet the volume of transmitted features still strains wireless bandwidth (typically ~40 Mbps). Existing bandwidth optimization strategies include compression (lossy), selection (fixed threshold), and hybrid approaches.

Limitations of Prior Work: (a) Where2Comm employs confidence maps with fixed thresholds for feature selection, ignoring temporal context, resulting in persistently high bandwidth (~39.6 Mbps); (b) SwissCheese applies fixed-threshold channel/spatial selection without environmental adaptability; (c) all existing methods make per-frame decisions independently, repeatedly transmitting static information.

Key Challenge: The fundamental tension between limited bandwidth and rich sensory data — existing methods merely transmit less per frame rather than leveraging temporal continuity for demand-driven sharing.

Goal: (a) Utilize temporal context to identify dynamic features that genuinely require updating; (b) adaptively adjust the sharing volume according to scene complexity.

Key Insight: The ego vehicle can leverage its own temporal memory to determine which features carry "new information" (high temporal uncertainty) and request only those that have changed — transmitting less in simple scenes and more in complex ones.

Core Idea: Measure feature relevance through temporal uncertainty rather than static confidence scores, enabling environment-adaptive, demand-driven sharing.

Method¶

Overall Architecture¶

The ego vehicle computes conformal temporal uncertainty from the current-frame features \(F_t\) and the previously fused features \(F_{t-1}^{\text{fused}}\). A learnable quantile threshold \(q\) and an attention mask threshold \(\tau\) determine the subset of features to request. A request vector is broadcast, and the selected features received from collaborative vehicles are then fused.

Key Designs¶

Conformal Temporal Uncertainty:
Function: Quantifies the degree of change in each feature channel relative to its temporal context.
Mechanism: Computes the L1 distance between the current frame and the previous fused frame, \(S_t = |F_t - F_{t-1}^{\text{fused}}|\), and applies gating via a learnable quantile threshold \(q\) (inspired by conformal prediction), retaining only features whose change exceeds \(q\) as "uncertain."
Design Motivation: In static scenes, most features remain unchanged across frames and need not be retransmitted.
Adaptive Volume Determination:
Function: Dynamically adjusts the number of shared features according to scene complexity.
Mechanism: Cross-attention weighting is applied to uncertain features, followed by truncation via a learnable mask threshold \(\tau\) — complex scenes (e.g., multiple intersections) yield high relevance scores, causing more features to exceed the threshold and thus more transmission.
Design Motivation: Realizes adaptive behavior of transmitting less in simple scenes and more in complex ones.
\(\epsilon\)-Greedy Training Strategy:
Function: Balances training with full features versus selected features.
Mechanism: With probability \(\epsilon\), all features are used (exploration); with probability \((1-\epsilon)\), only selected features are used (exploitation). It is theoretically shown that this reduces both bias and variance of the gradient estimator.
Design Motivation: Training exclusively on partial features can introduce large gradient noise and unstable convergence.

Loss & Training¶

Lagrangian-constrained optimization: \(\theta^* = \arg\min_\theta L(C(\theta)) + \lambda \cdot (P(C(\theta)) - C_{1.6})\), which maximizes task performance subject to a bandwidth constraint of 1.6 Mbps. \(\lambda\) is adjusted dynamically.

Key Experimental Results¶

Main Results¶

Cooperative semantic segmentation (OPV2V dataset, applied to CoBEVT / AttFuse / DiscoNet):

Configuration	Dynamic IoU	Bandwidth Usage	Bandwidth Reduction
CoBEVT (original)	Baseline	100% (40 Mbps)	—
CooperTrim-CoBEVT	Comparable	27.9%	72.1%
CooperTrim-AttFuse	Comparable	21.07%	78.93%
CooperTrim-DiscoNet	Comparable	10.18%	89.82%

Comparison with other selection strategies:

Method	Dynamic IoU	Bandwidth (Mbps)
Where2Comm	8.62	39.6
SwissCheese	35.71	10.0
CooperTrim	54.03	11.16

Ablation Study¶

Analysis	Key Finding
+ Compression (32×)	Bandwidth reduced to 1.46% with no IoU degradation
Localization error robustness	Performance degrades gracefully under positional noise
Communication latency robustness	Remains stable under transmission delays
Frame-level analysis	Dynamic scenes automatically allocated more bandwidth; static scenes exhibit very low bandwidth usage

Key Findings¶

Average bandwidth reduction of 80.28% (segmentation) and 72.52% (detection) with comparable performance.
CooperTrim outperforms Where2Comm by 45.41% in IoU while using 72% less bandwidth.
Orthogonal to compression methods — combining both reduces bandwidth to 1.46%.
Qualitative analysis confirms adaptive behavior: bandwidth usage increases when vehicles traverse intersections and decreases during straight-road travel.

Highlights & Insights¶

Elegant exploitation of temporal information: Using inter-frame variation directly as an uncertainty measure is simple yet effective, avoiding complex uncertainty modeling.
First selective perception for cooperative segmentation: Segmentation demands pixel-level precision, posing greater bandwidth challenges than detection — achieving 80%+ reduction is highly impressive.
Orthogonality to compression: Combining selection with compression achieves 1.46% bandwidth usage, demonstrating the complementarity of the two strategies.
Theoretical guarantees for \(\epsilon\)-Greedy training: A rigorous scaling analysis of gradient bias for sparse feature training is provided.

Limitations & Future Work¶

Assumes accurate pose estimation — in practice, GPS/localization errors may affect spatial transformation.
Validated on only two datasets (OPV2V and V2V4Real), limiting scene diversity.
The conformal temporal uncertainty relies solely on L1 distance, without modeling semantic-level changes.
Learnable thresholds \(q\) and \(\tau\) may require re-tuning under domain shift.
Multi-hop communication and heterogeneous sensor configurations are not addressed.

vs. Where2Comm: Where2Comm uses static confidence maps with fixed thresholds, ignoring temporal context. CooperTrim uses temporal uncertainty with adaptive thresholds, achieving 45%+ higher IoU and 72% lower bandwidth.
vs. SwissCheese: SwissCheese applies fixed-threshold channel/spatial selection. CooperTrim's adaptive mechanism achieves 18%+ higher IoU at comparable bandwidth.
vs. UniSense: UniSense employs uncertainty-driven selection but makes per-frame independent decisions. CooperTrim uses temporal contrast to reduce redundant transmission.
Implications for edge AI: The paradigm of demand-driven transmission guided by temporal differences is transferable to any bandwidth-constrained distributed perception scenario.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of temporal uncertainty and adaptive volume is novel, though individual components are not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-model / multi-task / multi-strategy comparisons, compression compatibility, and robustness analyses.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear, though some equations could be more concise.
Value: ⭐⭐⭐⭐ Substantially advances the practical deployment of cooperative perception systems.