L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention¶

Conference: AAAI 2026 arXiv: 2511.17910 Code: None Area: Multimodal VLM / Reasoning Enhancement Keywords: CoT reasoning transfer, activation engineering, frequency-domain analysis, LLM to VLM, training-free

TL;DR¶

Through LAT analysis, this paper reveals that the low-frequency CoT directional representations of LLMs and VLMs share similar distributions. It proposes L2V-CoT: extract CoT directional representations from an LLM → apply low-pass filtering → frequency-domain resampling for dimension alignment → inject into VLM hidden layers. This training-free approach transfers LLM reasoning capabilities to VLMs, achieving an average improvement of 3.7% and a maximum gain of 8.6%.

Background & Motivation¶

Limitations of Prior Work¶

Background: CoT reasoning has significantly enhanced LLM capabilities, yet VLMs continue to struggle with multi-step reasoning tasks, primarily due to the scarcity of multimodal reasoning data. Existing transfer methods fall into three categories: (1) approaches such as Virgo fine-tune VLMs on textual CoT data, but incur high training costs; (2) model merging integrates LLM parameters into VLMs, but requires architectural compatibility (the LLM backbone of the VLM must match the source LLM); (3) activation engineering methods (e.g., RoT) activate reasoning neurons within the VLM itself, but are bounded by the VLM's inherent reasoning capacity. Key Challenge: How can the reasoning capabilities of a stronger LLM be transferred to a VLM across different architectures?

Starting Point¶

Goal: Although LLMs and VLMs differ in architecture and even in hidden dimension size, do they share transferable reasoning representations? If so, how can cross-modal, cross-architecture reasoning transfer be achieved without any training?

Method¶

Overall Architecture¶

L2V-CoT consists of two steps: (1) extracting low-pass CoT pattern representations from an LLM (DeepSeek-R1-Distill-Qwen-32B); and (2) injecting these representations into intermediate VLM layers at inference time to implicitly enhance reasoning capability.

Key Designs¶

Key Findings from LAT Analysis: Contrastive inputs ("Let's think step by step" vs. "Answer directly") are passed through LLMs/VLMs to obtain CoT/Non-CoT hidden states, and the directional representation is computed as \(u = h(c) - h(d)\). Key observations: (a) CoT representations of VLMs and LLMs cluster in distinct regions of the latent space; (b) the dispersion of CoT directional representations in VLMs is substantially higher than in LLMs (1117.8 vs. 176.7), because representation drift induced by multimodal training concentrates in high-frequency components; (c) after applying low-pass filtering to VLM representations, dispersion drops to 197.7, approaching the LLM value of 176.7; (d) low-frequency components preserve CoT information — injecting low-frequency components activates reasoning, whereas injecting high-frequency components has no effect.
Frequency-Domain Low-Pass Filtering + Resampling: The CoT pattern representation \(v(l_L)\) from the LLM (mean directional representation over 100 samples) undergoes FFT → low-pass filtering (retaining the top \(k\) frequency components) → LMN frequency-domain resampling (aligning LLM dimensions to VLM dimensions) → IFFT → normalization. Crucially, resampling is performed in the frequency domain rather than via direct interpolation, which preserves more CoT information (ablations show that interpolation severely degrades performance).
Latent Space Injection: The low-pass CoT pattern representation is injected into intermediate VLM layers as: \(\hat{h}_V = h_V + \alpha \cdot \hat{v}_{LPF}\), followed by norm normalization of the updated activations to preserve the scale of the original representation space. The injection coefficient \(\alpha\) must be moderate — too small yields no effect, while too large disrupts the original semantics.

Loss & Training¶

The method is entirely training-free. Only 100 CoT/Non-CoT sample pairs are needed to extract representations from the LLM (a one-time offline process); at inference time, a vector addition is performed at designated layers for each token.

Key Experimental Results¶

VLM	Method	MathVista-All	MathVerse	MMStar-All
LLaVA-8B	Non-CoT	35.2	20.9	22.9
LLaVA-8B	Finetuned CoT	39.9	24.1	25.8
LLaVA-8B	L2V-CoT	41.8	25.5	26.9
QwenVL-7B	Non-CoT	60.5	26.9	33.8
QwenVL-7B	Finetuned CoT	63.7	32.8	35.3
QwenVL-7B	L2V-CoT	64.2	35.5	35.9
InternVL-8B	Non-CoT	59.3	29.9	30.5
InternVL-8B	L2V-CoT	61.6	33.3	33.7

Ablation Study¶

Replacing LMN frequency-domain resampling with interpolation causes a sharp performance drop (LLaVA MathVista: 41.8 → 31.1), confirming the information-preservation advantage of the frequency-domain approach.
Replacing the LLM representation with the VLM's own low-pass directional representation yields limited gains (36.3 vs. 41.8), confirming that the LLM possesses stronger reasoning capabilities.
Stronger LLMs yield larger improvements: DeepSeek-R1 7B → 14B → 32B raises LLaVA MathVista from 38.6 → 38.9 → 41.8.
Layer-wise injection analysis shows that middle layers are optimal; shallow layers interfere with perception, while deep layers leave insufficient remaining layers to process the injected information.
The method is complementary to explicit reasoning methods (e.g., Mulberry/MCTS), enabling further gains when combined.

Highlights & Insights¶

Profound Frequency-Domain Perspective: The paper discovers that CoT representations of VLMs and LLMs are highly consistent in their low-frequency components, with high-frequency differences arising from noise introduced by multimodal training. This finding is itself a significant scientific contribution — suggesting that neural networks with different architectures may share similar internal representational structures at the level of abstract reasoning.
True Cross-Architecture Training-Free Transfer: The method does not require VLMs and LLMs to share a backbone, elegantly resolving dimension mismatches via frequency-domain resampling. This means the strongest available LLMs (e.g., DeepSeek-R1) can be used to enhance the reasoning capabilities of arbitrary VLMs.
Surpassing Supervised Methods: L2V-CoT (training-free) outperforms Finetuned CoT (supervised) on multiple benchmarks, suggesting that directly manipulating hidden representations may be more efficient than fine-tuning — because fine-tuning modifies both useful and irrelevant parameters, whereas activation injection targets only the critical reasoning representations.
Plug-and-Play and Complementary to Explicit Methods: The method can be combined with explicit reasoning search approaches such as MCTS, achieving dual gains from implicit reasoning enhancement and explicit reasoning search.
Methodological Contribution of LAT Analysis: The framework of using Linear Artificial Tomography to analyze cross-modal reasoning representations is generalizable to other capability transfer research.

Limitations & Future Work¶

Injection layer and injection strength require per-task tuning (Tables S.2/S.3 show different hyperparameters for different tasks), and an adaptive selection mechanism is lacking.
Validation is limited to mathematical reasoning benchmarks; the method has not been tested on general VQA or broader reasoning tasks (e.g., spatial or causal reasoning).
CoT samples are drawn from the STILL-2 dataset (mathematics, physics, chemistry, biology), and domain bias may affect generalization to other domains (e.g., legal or financial reasoning).
Frequency-domain resampling assumes that CoT information is concentrated in low-frequency components — this holds for mathematical reasoning, but it is unclear whether it remains valid for tasks requiring fine-grained symbolic manipulation.
The choice of low-pass filter cutoff frequency lacks theoretical grounding and currently relies on empirical tuning.
The interpretability of hidden representation injection is limited — although the method is effective, it is difficult to explain precisely how the VLM's internal reasoning process changes after injection.
It is unclear whether the method imposes requirements on VLM model scale; performance may differ for very small (<1B) or very large (>70B) VLMs.

vs. Virgo (Finetuned CoT): Virgo requires SFT on the VLM, incurring high training costs; L2V-CoT requires no training and achieves superior performance.
vs. Model Merging: Model merging requires architectural compatibility, whereas L2V-CoT supports transfer across different architectures such as LLaMA and Qwen.
vs. RoT: RoT extracts directional representations from CoT prompts and injects them into VLMs, but is bounded by the VLM's own reasoning capacity; L2V-CoT extracts representations from a stronger external LLM.
vs. MathNeuro: MathNeuro activates reasoning neurons via pruning/scaling but neglects the coordination among neurons.

"Low-frequency components encode core capabilities, while high-frequency components carry noise and modality-specific information" — this observation may apply to other cross-modal transfer scenarios.
Frequency-domain resampling as a dimension alignment technique outperforms simple interpolation or linear projection, and merits broader exploration in other activation engineering tasks.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First work to use frequency-domain analysis to reveal cross-architecture consistency in LLM/VLM reasoning representations; the method is novel and elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers 4 VLMs × 5 benchmarks with complete ablations and multi-scale LLM validation.
Writing Quality: ⭐⭐⭐⭐ The logical flow from empirical analysis to method design is clear, and frequency-domain visualizations are highly convincing.
Value: ⭐⭐⭐⭐ Provides a general cross-architecture reasoning transfer framework with practical significance for VLM reasoning enhancement; the frequency-domain analysis framework is broadly applicable.