NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception¶

Conference: NeurIPS 2025 arXiv: 2510.27647 Code: None Area: Multimodal VLM / Collaborative Perception Keywords: Collaborative Perception, Heterogeneity, Common Representation, Domain Adaptation, Autonomous Driving

TL;DR¶

This paper proposes the NegoCollab framework, which introduces a Negotiator module to negotiate a common representation from the local representations of heterogeneous multimodal agents during training, effectively eliminating domain gaps between heterogeneous collaborative agents and enabling low-cost collaborative connected perception.

Background & Motivation¶

Background: Multi-agent collaborative perception expands perception range and overcomes blind spots through feature sharing, making it a key direction for V2X communication.

Limitations of Prior Work: Agents may be equipped with different or fixed perception models, leading to domain gaps between intermediate features. Pairwise adaptation methods (MPDA/PnPDA) require training a large number of adapters, with training cost scaling quadratically with the number of agent types.

Key Challenge: Designating a single agent's representation as the common representation introduces bias, making alignment difficult for modalities that differ significantly from that agent.

Key Insight: The common representation should not be fixed to any single agent's representation; instead, it should be negotiated from the local representations of agents across modalities.

Core Idea: Multi-dimensional alignment (distributional + structural + pragmatic) combined with cyclic consistency to negotiate a neutral common representation from multimodal features.

Method¶

Overall Architecture¶

The system comprises agents from \(M\) modalities and \(N\) agents in total. The pipeline proceeds as: Local Representation → Sender → Common Representation (Negotiator) → Receiver → Local Representation.

Key Designs¶

Sender (Feature → Common Representation)
- Function: Maps local features into the common representation space.
- Mechanism: A dual-module design — Recombiner (ConvNeXt architecture to enhance local features and adjust dimensions) + Aligner (fusion axis attention to capture global and local dependencies).
- Design Motivation: Both dimensional and semantic alignment must be addressed simultaneously.
Negotiator (Negotiating the Common Representation)
- Function: Negotiates a unified common representation from the outputs of multimodal Senders.
- Mechanism: Feature Pyramid Network (FPN)-based fusion strategy \(P = \bigoplus_{l,m} (u_l(P^{(m)}_l) \odot \text{norm}(P^{(m)}_l))\).
- Design Motivation: Explicitly learns to generate the common representation \(P\) rather than designating any single modality, eliminating bias.
Receiver (Common → Local)
- Function: Transforms the common representation back into the local modality space.
- Mechanism: Converter (fusion axis attention + local guidance, with Query derived from Recombiner output) + Recombiner.
- Design Motivation: The common representation contains multimodal fused information and requires targeted conversion for each modality.
Multi-Dimensional Alignment Loss (Section 3.2.3)
- Distributional alignment: Matches mean and standard deviation \(\mathcal{L}_{uni-dis}^{(m)} = \|P^{(m)} - P\|_2^2 + \alpha\|Std(P^{(m)}) - Std(P)\|_2^2\)
- Structural alignment: Maintains consistency of feature similarity matrices at 9 keypoints.
- Pragmatic alignment: Ensures consistent organization of foreground information \(\mathcal{L}_{uni-pragma}^{(m)} = L_{focal}(\mathcal{N}(P^{(m)}), Y)\)
- Cyclic consistency: \(\mathcal{L}_{cycle}^{(m)} = \|F^{(m)} - L^{(m)}\|_2^2\), minimizing information loss through forward-backward transformations.

Loss & Training¶

Three-stage training: Stage 1 trains the Sender/Receiver with multi-dimensional alignment and cyclic consistency losses; Stage 2 jointly trains the Negotiator; Stage 3 performs end-to-end fine-tuning.

Key Experimental Results¶

Main Results (OPV2V-H Dataset)¶

Method	Agent Types	AP@0.5	AP@0.7	Notes
No Fusion	m1, m2	0.482	0.350	Single-agent baseline
MPDA (pairwise)	m1, m2	0.815	0.692	Pairwise adaptation
PnPDA	m2, m4	0.532	0.331	Poor cross-modal gap
NegoCollab	m1, m2	0.872	0.911	Common representation
NegoCollab	m1, m3	0.949	0.854	New agent onboarding

Ablation Study¶

Alignment	AP@0.5	Gain	Notes
Distributional only	0.812	Baseline	Conventional method
+ Structural	0.841	+3.6%	Spatial relationships
+ Pragmatic	0.858	+5.7%	Foreground consistency
Full three-dimensional	0.872	+7.4%	Comprehensive constraints

Key Findings¶

Training cost reduced by 60% compared to pairwise adaptation.
The common representation natively supports new agent onboarding without retraining the Negotiator.
Over 40% improvement on real-world datasets V2V4Real and DAIR-V2X.

Highlights & Insights¶

Negotiation Framework: Breaks the limitation of "designation" by generating a more neutral and informative common representation. This paradigm is transferable to other multimodal fusion scenarios.
Multi-Dimensional Alignment: Goes beyond conventional distributional alignment by incorporating structural and pragmatic constraints, forming a more complete alignment mechanism.
Cost-Performance Balance: New agents can be onboarded by training only new Sender/Receiver modules — \(O(M)\) complexity rather than \(O(M^2)\).

Limitations & Future Work¶

Experiments are limited to LiDAR+Camera dual-modality settings; generalization to more than three modalities remains unverified.
The paper does not discuss strategies for compressing the common representation to reduce communication bandwidth.
The synchronization assumption across agents may not hold in real-world network environments.
The additional computation of the Negotiator may become a bottleneck on edge devices.

vs. MPDA: MPDA requires training adapters for each modality pair at \(O(M^2)\) cost; NegoCollab requires only \(O(M)\).
vs. PnPDA: PnPDA performs poorly under large cross-modal gaps (AP@0.7 of only 0.331); NegoCollab's negotiation mechanism is more robust.

Rating¶

Novelty: ⭐⭐⭐⭐ — Introduction of negotiated common representation
Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple collaboration scenarios with real-world validation
Writing Quality: ⭐⭐⭐⭐ — Clear framework presentation with rigorous formulations
Value: ⭐⭐⭐⭐⭐ — High practical deployment value for V2X scenarios