V-LynX: Token Interface Alignment for VideoX LLMs¶
Conference: ICML 2026
arXiv: 2606.00508
Code: To be confirmed
Area: Multimodal VLM / Modality Adaptation
Keywords: Video LLM, Modality Adaptation, Lightweight Adaptation, Token Interface, Multimodal Alignment
TL;DR¶
V-LynX efficiently integrates new modalities (audio, 3D, high frame rate video) into pretrained Video LLMs by discovering the internal continuous token interface (manifold)—a geometric prior carved by the visual encoder and projection layer that is compatible with the LLM's internal operating space. Using only lightweight LoRA (68.7M parameters) and unpaired unimodal data, it achieves CIDEr 145.7 on AVSD compared to PAVE's 134.5 (with 46% fewer parameters).
Background & Motivation¶
Background: Video LLMs exhibit excellent performance in RGB video understanding, but most only support RGB + text, lacking support for other sensory signals such as audio, 3D geometry, and high frame rate video. Existing expansion methods (e.g., PAVE) require designing heavy modality-specific encoders, complex fusion mechanisms, and paired multimodal supervision for each new modality, leading to parameter bloat and increased architectural complexity.
Limitations of Prior Work: - Training large modality-specific encoders for each new modality leads to linear growth in parameter costs. - Reliance on paired multimodal data (e.g., audio-video-text triplets) for alignment, which is difficult and expensive to acquire. - Retraining encoders easily triggers catastrophic forgetting, undermining existing video-language alignment.
Key Challenge: The visual pathway (encoder + projection layer) of a Video LLM does not just map images to a fixed vocabulary; it has learned a geometric prior compatible with the LLM's internal operating space. How can this prior be utilized to adapt to new modalities without rebuilding the entire pathway?
Goal: To answer a fundamental question—how to effectively reuse the internalized visual pathway of a Video LLM to adapt to new modalities while avoiding catastrophic forgetting and data bottlenecks.
Key Insight: The authors found that the visual encoder and projection layer of Video LLMs actually carve out a continuous geometric space (termed the token interface). This space acts as a bridge between perception and fixed vocabulary constraints, allowing the LLM to process continuous visual signals as independent non-symbolic entities. New modality inputs only need to be mapped into this existing token interface.
Core Idea: Seamlessly adapt new modality representations to the video-induced token interface using a lightweight parallel LoRA pathway and a distribution alignment strategy (attention response alignment + statistical distribution regularization), using only unpaired unimodal data.
Method¶
Overall Architecture¶
Three stages: 1. Interface Guidance Extraction: Extract how the pretrained Video LLM processes video tokens from a batch of reference videos (attention Key/Value means, token distribution mean and variance after the projection layer) to serve as target anchors for new modality adaptation. 2. Encoder-side Adaptation: Parallelize lightweight LoRA (\(\Delta\psi\)) within the frozen visual encoder. By performing attention response alignment and distribution regularization on new modality inputs, their internal activation within the encoder is made compatible with the video modality's attention behavior, producing distribution-compatible tokens. 3. LLM-side Instruction Tuning: Add LoRA (\(\Delta\phi\)) within the LLM and perform instruction tuning to enable the LLM to reason using new modality tokens.
Key Designs¶
-
Lightweight Parallel LoRA Path:
- Function: Introduces learnable parameters for new modalities while keeping the main visual pathway frozen, achieving parameter-efficient modality adaptation.
- Mechanism: For new modality input \(\mathbf{Z}_m = p_\theta(g_{\psi + \Delta\psi}(\mathbf{X}_m))\), \(\psi\) is frozen and \(\Delta\psi\) is tuned via LoRA. This leverages video pre-training knowledge (\(\psi\)) while flexibly adapting to new modality characteristics (\(\Delta\psi\)), with only 68.7M parameters (vs. PAVE's 127-475M).
- Design Motivation: Prevent catastrophic forgetting (full retraining would break video alignment) while ensuring parameter efficiency (requiring only a few parameters per new modality).
-
Attention Response Alignment:
- Function: Ensures that new modality tokens activate behavior within the encoder's attention mechanism that is compatible with the video modality, using the same Key-Value priors.
- Mechanism: Given a Query \(Q_m^{(l)}\) from a new modality input, instead of its native Key-Value, the video-guided Reference Key-Value \((K_v^{(l)}, V_v^{(l)})\) is used to calculate a reference attention response \(\tilde{O}_m^{(l)} = \text{Attn}(Q_m^{(l)}, K_v^{(l)}, V_v^{(l)})\). The actual attention response \(O_m^{(l)} = \text{Attn}(Q_m^{(l)}, K_m^{(l)}, V_m^{(l)})\) of the new modality is forced toward \(\tilde{O}_m^{(l)}\) via the loss \(\mathcal{L}_{\text{attn}} = \sum_l \|O_m^{(l)} - \tilde{O}_m^{(l)}\|_1\).
- Design Motivation: Achieve cross-modal alignment without sequence-pair constraints by maintaining stable Key-Values from the visual world. The new modality only needs to learn how to "query" the video space. Function-level alignment preserves original video semantics better than direct feature similarity alignment.
-
Distribution Regularization:
- Function: Ensures the statistical distribution (mean and variance) of new modality tokens after the projection layer matches that of video tokens, ensuring correct processing by the LLM.
- Mechanism: Pre-calculate the video token distribution \((\mu_v, \sigma_v^2) = \mathbb{E}[\mathbf{Z}_v], \mathbb{E}[(\mathbf{Z}_v - \mu_v)^2]\) and constrain the new modality token distribution \((\mu_m, \sigma_m^2)\) to approximate it via the loss \(\mathcal{L}_{\text{stat}} = \|\mu_v - \mu_m\|_2 + \|\sigma_v^2 - \sigma_m^2\|_2\).
- Design Motivation: The token distribution after the projection layer is directly "seen" by the LLM; distribution mismatch causes abnormal LLM output. However, excessively strong feature alignment loses modality-specific traits—distributional regularization is the optimal balance.
Loss & Training¶
\(\mathcal{L}_{V\text{-LynX}} = \mathcal{L}_{\text{attn}} + \beta \cdot \mathcal{L}_{\text{stat}}\). The LoRA rank is \(r = 64\). Subsequently, the LLM LoRA is trained via standard supervised fine-tuning (instruction tuning).
Key Experimental Results¶
Main Results¶
| Task | Dataset | Metric | PAVE-0.5B | V-LynX-0.5B | PAVE-7B | V-LynX-7B | Parameters Reduction |
|---|---|---|---|---|---|---|---|
| Audio-Visual QA | AVSD | CIDEr | 134.5 | 145.7 (+8.3%) | 152.9 | 163.0 (+6.6%) | -46% vs PAVE-0.5B |
| Audio-Visual QA | AVQA | Acc. | 90.4 | 93.1 | 93.8 | 94.2 | |
| 3D Reasoning | ScanQA | CIDEr | 84.2 | 87.1 | 103.4 | 107.4 | -80% vs PAVE-0.5B |
| 3D Reasoning | ScanQA | EM@1 | 23.1 | 26.4 (+14.3%) | 29.1 | 29.7 | |
| High Frame Rate Video | VideoMME | Avg. | 46.0 | 52.8 (+14.8%) | 59.9 | 62.7 | -81% vs PAVE-0.5B |
Ablation Study (ScanQA)¶
| Configuration | CIDEr | BLEU-4 | EM@1 | Note |
|---|---|---|---|---|
| V-LynX (Full) | 87.1 | 14.3 | 26.4 | Full model |
| w/o Attention Alignment | 81.0 | 11.8 | 23.5 | Gain -4.6% (Key component) |
| w/o Distribution Regularization | 86.2 | 13.4 | 25.6 | Gain -0.9% (Auxiliary stability) |
| w/o Interface Adaptation | 77.3 | 10.9 | 22.4 | Gain -12.7% (Most critical) |
Key Findings¶
- Attention alignment is the primary driver: Removing attention alignment resulted in a 4.6% drop, the largest impact among the three components, highlighting the importance of "alignment within the encoder."
- Robustness of LoRA rank: Even with rank=8, the model reached 86.1 CIDEr, with rank=64 being optimal at 87.1—low-rank adaptation is sufficient to capture key modality adaptation information.
- Reference videos do not need to be in-distribution: Using audio-related videos (57k) as references reached 87.7 CIDEr. The interface is robust and does not require strictly in-distribution reference sets, significantly reducing deployment difficulty.
- Parameter-efficient scalability: The 0.5B model with 68.7M additional parameters outperformed the 7B version of PAVE (256.7M); the 7B version of V-LynX (195.0M) used 59% fewer parameters than PAVE-7B (475.0M) while performing better.
Highlights & Insights¶
- Discovery of the Token Interface: The core contribution is the formalization of this continuous manifold inside Video LLMs. Visualization using t-SNE of frame and vocabulary embeddings reveals this "soft token" space, establishing a theoretical foundation for multimodal expansion beyond just video (e.g., image-text, 3D-text).
- Clever Design for Unpaired Adaptation: Traditional multimodal alignment requires paired data (A-B-C triplets). V-LynX achieves alignment using unimodal data by treating video Key-Values as "reference anchors" at the attention level, drastically reducing data costs and enabling transfer to any cross-modal scenario lacking aligned data.
- Trade-off between Distribution and Features: Compared to direct feature similarity alignment, distribution regularization is more sophisticated—it preserves the degrees of freedom for the new modality's specialized space while ensuring statistical matching, avoiding semantic loss caused by over-alignment.
Limitations & Future Work¶
- All experiments were based on the LLaVA-OV single backbone; generalizability has not been verified on other Video LLMs (e.g., VideoChat3, Qwen-Video).
- While reference video selection does not require strict in-distribution sets, selection is still manual. Adapting the selection of the most informative reference sets could further reduce costs.
- Multimodal fusion is still additive (training LoRA for audio and 3D separately); the potential for interference when fusing more than three modalities simultaneously has not been explored.
- Future Work: Validate on multiple backbones to establish a universal token interface theory; use active learning for adaptive reference video selection; explore Pareto optimality for simultaneous multi-modality alignment.
Related Work & Insights¶
- vs. PAVE (Liu et al. 2025): PAVE uses independent modality encoders and cross-attention fusion, requiring massive paired data and extra parameters. V-LynX reuses the frozen main encoder and uses distribution alignment, requiring 59% fewer parameters (81% fewer at 0.5B scale) with better performance—essentially replacing complex encoder design with geometric constraints.
- vs. Video-LLaMA / VideoLLaMA2: These support multi-modalities by extending audio encoders or integrating prefab encoders (ImageBind), following the "encoder stacking" paradigm. V-LynX's innovation lies in "reuse instead of addition."
- vs. Parameter-Efficient Prompting (Li & Liang 2021): Inspired by soft tokens, but V-LynX moves beyond parameter efficiency at the prompt level to multimodal adaptation at the representation level—a natural evolution of the concept.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Discovery of the Token Interface and the simple design of distribution alignment are innovative; achieving cross-modal alignment with unimodal data is a breakthrough in the Multimodal LLM field.
- Experimental Thoroughness: ⭐⭐⭐⭐ Four task dimensions (Audio / 3D / High Frame Rate / Multi-view) with sufficient ablation and baselines, though limited to the LLaVA-OV backbone.
- Writing Quality: ⭐⭐⭐⭐ Clear logic and detailed method descriptions, with occasional redundant paragraphs.
- Value: ⭐⭐⭐⭐⭐ Efficient parameterization, practical non-aligned data solution, and a generalizable theoretical perspective make significant contributions to both the engineering and science of Multimodal LLMs.