Linking Modality Isolation in Heterogeneous Collaborative Perception¶

Conference: CVPR 2026
arXiv: 2603.00609
Code: cxliu0314/CodeAlign
Area: Pretraining
Keywords: Collaborative Perception, Heterogeneous Alignment, Modality Isolation, Codebook, Cross-Modal Translation

TL;DR¶

CodeAlign constructs a discrete code space via codebooks and cross-modal Feature-Code-Feature (FCF) translation, becoming the first framework to solve the "modality isolation" problem in heterogeneous collaborative perception—where different modalities never co-occur in training data—using only 8% of HEAL's training parameters with 1024x communication reduction while achieving SOTA perception performance.

Background & Motivation¶

Value of collaborative perception: Multi-agent systems (e.g., connected autonomous vehicles) can build more comprehensive environmental understanding by sharing perception information, compensating for single-vehicle blind spots and occlusions
Heterogeneity problem: In practice, vehicles from different manufacturers carry different sensor types (LiDAR/Camera), configurations (64-beam/32-beam), and perception models, creating significant domain gaps for feature-level fusion
Ubiquity of modality isolation: Different institutions collect data at different locations and times, causing many modality pairs to never co-occur in the same scene—e.g., Institution A has only LiDAR data, Institution B has only Camera data, with no spatially overlapping observations
Dependencies and limitations of existing methods: HEAL requires expensive encoder retraining; STAMP/GT-Space depend on co-occurrence data with spatial correspondence supervision or shared FoV; HMViT and Pyramid Fusion require joint training and suffer severe performance degradation under modality isolation (AP70 drops by 15.21%)
Efficiency bottleneck: Intermediate fusion methods transmit dense feature maps, incurring massive communication overhead (32MB per exchange), limiting practical deployment
Privacy constraints: Data from different institutions is subject to privacy regulations, preventing direct sharing of raw data and further complicating cross-modal alignment

Method¶

Overall Architecture: CodeAlign¶

CodeAlign comprises two training stages and one inference pipeline:

Stage 1: Code Space Construction

A lightweight adapter (4-layer ResNet blocks, 3×3 convolutions) and a learnable codebook (size \(D=16\)) are inserted between each modality's encoder and backend
Encoders and backends are frozen; only adapters and codebooks are trained
Each spatial position in the BEV feature map is mapped to codebook indices via nearest-neighbor quantization: \(I_{[h,w]} = \arg\min_\ell \| (\mathcal{P}(F))_{[h,w]} - C[\ell] \|_2^2\)
Communication transmits only codebook index maps (\(H \times W \times \log_2(D)\)), achieving ~1024x compression compared to raw features (\(H \times W \times C\))

Group Code Space Construction: Non-isolated modalities share a single codebook, enabling features from different modalities representing the same object to map to the same codebook embedding, achieving natural alignment while reducing the number of cross-modal translators needed.

Stage 2: FCF Translation (Feature-Code-Feature)

Feature→Code: Cross-modal translator \(T_{m_i \to m_j}\) maps the source modality's dense features to index maps in the target modality's codebook
Code→Feature: Target modality's reconstructor \(R_{m_j}\) decodes index maps back into dense features that naturally reside in the target modality's feature space
The dense-to-code scheme is chosen (rather than dense-to-dense or code-to-code), balancing reconstruction accuracy with communication efficiency

One-to-Many Code Translator:

Shared backbone (stacked ConvNeXt blocks) + modality-specific multi-head outputs
Training parameters grow linearly with the number of modalities (\((0.5M) \cdot n\)), avoiding the quadratic growth of one-to-one approaches
Data balancing strategy: dynamically adjusts training data ratios based on loss changes across different targets

Loss & Training¶

\[L = L_{\text{det}}(\hat{\mathcal{O}}_i, \mathcal{O}_i^0) + L_{\text{pyramid}} + \lambda \sum_{k,j \in \mathcal{G}_s, m_k \neq m_j} L_{\text{sim}}(F_{k \to i}, F_{j \to i})\]

\(L_{\text{det}}\): Detection loss
\(L_{\text{pyramid}}\): Pyramid fusion loss (from HEAL)
\(L_{\text{sim}}\): Smooth L1 feature similarity loss (\(\lambda=0.1\)), encouraging cross-modal feature consistency

Local Data Training Protocol¶

Only source modality local data is used: source modality encoding → translator → target backend detection loss, requiring no cross-institution data transfer and fully complying with data privacy requirements.

Key Experimental Results¶

OPV2V Dataset (Simulation, Multi-Vehicle V2V)¶

Method	m1+m7+m2 AP30	m1+m7+m2 AP50	m1+m7+m2 AP70	Params (M)	Comm.
No Collaboration	81.18	79.44	68.26	0	0
Late Fusion	88.24	85.02	68.45	0	0.5KB
Pyramid Fusion	83.95	82.93	68.91	21.4	32MB
HEAL	87.80	86.98	79.89	16.0	32MB
CodeAlign	89.77	88.59	77.73	1.3	0.03MB

CodeAlign surpasses HEAL by 1.97/1.61 points in AP30/AP50 in the three-modality scenario
Training parameters are only 8% of HEAL's (1.3M vs 16.0M)
Communication is reduced by 1024x (0.03MB vs 32MB)

DAIR-V2X Dataset (Real World)¶

Method	m1+m2 AP30	m1+m2 AP50	m1+m2 AP70
HEAL	73.70	67.21	44.76
CodeAlign	82.03	77.37	57.84

CodeAlign surpasses HEAL by 13.08 points in AP70 on real-world data, demonstrating stronger generalization

Ablation Study¶

Modality isolation impact: Pyramid Fusion's AP70 drops from 80.88% to 65.67% (−15.21%) under modality isolation
Group code space vs FCF translation: For non-isolated modalities, group code space construction outperforms FCF translation by 6.71% AP70
Translator structure: Multi-head translator loses only 0.10% AP50 compared to one-to-one, while parameters drop from \(O(n^2)\) to \(O(n)\)
Codebook + frozen encoder + adapter + similarity loss: Progressively introducing each component recovers AP70 from 77.87% to 79.63%
Pose error robustness: CodeAlign consistently outperforms HEAL under pose perturbations, while Late Fusion rapidly degrades below the no-collaboration baseline

Highlights & Insights¶

First co-occurrence-free alignment framework: Solves modality isolation fundamentally through representation consistency instead of spatial correspondence
Extreme efficiency: 8% training parameters + 1024x communication compression, friendly for large-scale deployment
Privacy-preserving: Local data training protocol avoids cross-institution data transfer
Strong scalability: One-to-many translator reduces new modality onboarding cost from \(O(n^2)\) to \(O(n)\)
Plug-and-play design: Freezes original encoders and backends, training only lightweight inserted modules

Limitations & Future Work¶

Information loss from codebook quantization causes AP70 to be slightly lower than HEAL in some scenarios (e.g., m1+m2: 85.56 vs 86.18)
Codebook size is fixed at 16; smaller codebooks may not sufficiently represent complex scenes
Evaluation is limited by modality diversity in existing datasets; not validated on large-scale multi-modality (>7 types) scenarios
Online codebook updating and adaptation mechanisms in dynamic scenes are unexplored
BEV spatial range is set to ±102.4m; applicability to ultra-long-range scenarios is unverified

Method	Supports Modality Isolation	Training	Comm. Efficiency	Core Mechanism
HMViT	✗	Joint end-to-end	Low (32MB)	Cross-modal attention
CodeFilling	✗	Shared codebook E2E	High (0.03MB)	Single shared codebook
STAMP	✗	Contrastive learning	Low (32MB)	Protocol network reference
GT-Space	✗	GT feature alignment	Low	Ground-truth anchors
HEAL	△ (requires encoder retraining)	Backward alignment	Low (32MB)	Encoder retraining
CodeAlign	✓	Local data training	High (0.03MB)	FCF translation + codebook

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First to define and systematically solve the modality isolation problem; the FCF translation approach is novel
Experimental Thoroughness: ⭐⭐⭐⭐ — Simulation + real-world datasets with comprehensive multi-scenario ablations; modality variety is limited
Writing Quality: ⭐⭐⭐⭐ — Clear problem definition and systematic method exposition; some symbols are redundantly defined
Value: ⭐⭐⭐⭐⭐ — Addresses practical deployment pain points (privacy, efficiency, scalability) with significant engineering impact