ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding¶

Conference: ICLR 2026 arXiv: 2602.23306 Code: https://1ranguan.github.io/thinkomni Area: Multimodal VLM Keywords: Omni-modal reasoning, guidance decoding, LRM, training-free, contrastive scaling

TL;DR¶

ThinkOmni is a training-free framework that leverages a text-only large reasoning model (LRM) to guide an omni-modal LLM (OLLM) during decoding via Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals. The method achieves 70.2% on MathVista and 75.5% on MMAU, matching or surpassing RFT-based approaches.

Background & Motivation¶

Background: Large reasoning models (LRMs) such as DeepSeek-R1 and o1 demonstrate remarkable performance on textual reasoning tasks but are limited to text-only inputs. Omni-modal LLMs (OLLMs) such as Qwen2.5-Omni can process text, audio, images, and video, yet still exhibit weaknesses in complex reasoning tasks.

Limitations of Prior Work: Existing approaches to enhancing OLLM reasoning face several challenges: - Data scarcity: SFT requires large quantities of high-quality multimodal reasoning samples, which are costly to obtain. - Training cost: RFT (reinforcement fine-tuning) demands substantial GPU resources (8×40G for 7B models; 16×80G for 32B models). - Task specialization: Existing enhancement methods (e.g., Omni-R1, HumanOmniV2) are limited to specific downstream tasks and lack generalizability. - Modality limitation: Most prior work focuses on a single modality (image or audio) and does not achieve true cross-modal reasoning.

Key Challenge: LRMs possess strong reasoning capabilities but cannot process non-textual inputs; OLLMs handle multimodal inputs but lack sufficient reasoning capacity. The two are complementary, yet how to combine them in a training-free manner at inference time remains the central challenge.

Goal: To transfer the textual reasoning capability of LRMs to omni-modal scenarios without additional training data or fine-tuning.

Key Insight: Inference-time guidance decoding is adopted, treating the LRM as a decoding-time "advisor" for the OLLM and fusing their signals at the logits level.

Core Idea: The textual reasoning signal produced by the LRM guides the OLLM's omni-modal decoding at the logits level, with Stepwise Contrastive Scaling adaptively regulating the perception–reasoning balance.

Method¶

Overall Architecture¶

The ThinkOmni framework consists of two core components: (1) LRM-as-a-Guide, which contrastively fuses the output logits of the OLLM and the LRM to form an enhanced decoding distribution; and (2) Stepwise Contrastive Scaling, which automatically computes the contribution magnitudes of perception and reasoning at each decoding step and dynamically adjusts fusion weights without manual tuning.

Key Designs¶

LRM-as-a-Guide:
- Function: At each decoding step, three sets of logits are obtained—from the OLLM (full omni-modal input), the OLLM (text-only input), and the LRM (text-only input)—to construct a contrastive signal.
- Mechanism: Base logits \(\hat{z}^{\text{base}} = M_O(x_{<t}, O)\), negative logits \(z^- = M_O(x_{<t})\) (multimodal input removed), and positive logits \(z^+ = M_R(x_{<t})\). The fusion formula is: \(\hat{P} = \text{Softmax}[z^{\text{base}} + \alpha \cdot (z^+ - z^-)]\). The contrastive term \((z^+ - z^-)\) encodes the incremental reasoning preference of the LRM relative to the OLLM in text-only mode.
- Design Motivation: Analogous to a differential amplifier, \(z^+ - z^-\) amplifies the LRM's reasoning signal while suppressing linguistic noise shared by both models. Although the LRM cannot perceive multimodal content, the already-generated textual context implicitly encodes multimodal information as decoding progresses.
Stepwise Contrastive Scaling:
- Function: Dynamically computes reasoning weight \(\alpha_r\) and perception weight \(\alpha_p\) at each decoding step, replacing the fixed scalar \(\alpha\).
- Mechanism: Jensen–Shannon divergence quantifies the discrepancy between the three distributions: \(D_R = \text{JS}(P_R \| P)\) reflects the reasoning contribution, and \(D_P = \text{JS}(P_O \| P)\) reflects the perception contribution. Subject to \(\alpha_r + \alpha_p = 1\), the weights are allocated proportionally to \(D_R / D_P\). A warmup mechanism is introduced to limit reasoning intervention during the initial decoding phase.
- Design Motivation: Different tasks and different decoding steps require varying degrees of reasoning versus perception. Mathematical problems require larger \(\alpha_r\), while audio perception tasks require larger \(\alpha_p\). A fixed \(\alpha\) cannot adapt to all scenarios, as experiments demonstrate that the optimal \(\alpha\) varies considerably across tasks.
Extended Formula:
- Function: The complete fusion formula incorporates two contrastive terms.
- Mechanism: \(\hat{P} = \text{Softmax}[M_O(x_{<t}, O) + \alpha_r \cdot (M_R(x_{<t}) - M_O(x_{<t})) + \alpha_p \cdot (M_O(x_{<t}, O) - M_O(x_{<t}))]\). The second contrastive term constitutes an aggressive visual contrastive decoding that directly enhances perception by differencing the outputs with and without multimodal input.
- Design Motivation: Dual contrastive terms independently and simultaneously enhance reasoning and perception capabilities.

Loss & Training¶

The method is entirely training-free. It requires the OLLM and LRM to share a vocabulary (e.g., both from the Qwen family). Three forward passes are required per decoding step.

Key Experimental Results¶

Main Results¶

Model	MathVista	MathVision	MathVerse	MMAU	DailyOmni	OmniBench
GPT-4o	63.8	30.4	50.8	62.5	56.5	-
Gemini-2.0-Flash	73.1	41.3	59.3	70.5	67.8	-
Qwen2.5-Omni-7B	66.8	25.0	40.2	71.5	57.9	42.1
+DeepSeek Guide	68.8(+2.0)	28.2(+3.2)	42.0(+1.8)	73.8(+2.3)	59.8(+1.9)	43.2(+1.1)
+Qwen3 Guide	70.2(+3.4)	32.9(+7.9)	45.1(+4.9)	75.5(+4.0)	59.5(+1.6)	43.6(+1.5)
Omni-R1 (RFT)	64.7	25.4	39.8	70.5	59.6	43.0
+Qwen3 Guide	71.3(+6.6)	31.5(+6.1)	45.2(+5.4)	75.4(+4.9)	59.8(+0.2)	43.4(+0.4)

Ablation Study — Comparison with Other Training-Free Methods (based on Qwen2.5-Omni-7B)¶

Method	MathVista	MMAU	OmniBench
Base Model	66.8	71.5	42.1
Average Logits Fusion	55.0(−11.8)	55.7(−15.8)	36.1(−6.0)
Caption-then-Answer	61.0(−5.8)	59.7(−11.8)	32.3(−9.8)
VCD	66.5(−0.3)	72.2(+0.7)	43.1(+1.0)
ThinkOmni	68.8(+2.0)	73.8(+2.3)	43.2(+1.1)

Key Findings¶

Applying ThinkOmni on top of the RFT-trained Omni-R1 still yields substantial gains (MathVista +6.6), demonstrating that the method is complementary to RFT.
Stronger LRMs (Qwen3 > DeepSeek-R1-Distill) produce larger improvements, validating that guidance quality determines the magnitude of gains.
The largest improvements occur on mathematical and scientific tasks (MathVision +7.9), while audio and general tasks see smaller gains, consistent with the expectation that LRMs are predominantly trained on math and science data.
Simple logits averaging severely degrades performance (−11.8), underscoring the necessity of contrastive fusion.
Efficiency analysis: Under the 7B+7B configuration, generation latency is 2.88× and prefill latency is 1.38× compared to the base model (since the LRM processes only text and handles a lighter prefix).

Highlights & Insights¶

Training-free framework surpasses trained methods: Using Qwen2.5-Omni-7B + Qwen3, ThinkOmni matches or exceeds RFT-based methods such as Omni-R1 and HumanOmniV2 on multiple benchmarks.
Stepwise Contrastive Scaling is elegant and practical: JS divergence automatically estimates the demand for reasoning versus perception at each step, eliminating the burden of manual hyperparameter tuning.
Plug-and-play and scalable: As stronger LRMs emerge (LRM development typically outpaces multimodal variants), ThinkOmni can benefit automatically.
Rich qualitative analysis: Token-level visualizations of LRM contributions show that logical connectives and key technical terms are primarily guided by the LRM, while content words are contributed by the OLLM.

Limitations & Future Work¶

The requirement that the OLLM and LRM share a vocabulary restricts the flexibility of model pairing (e.g., a LLaMA-family LRM cannot guide a Qwen-family OLLM).
Three forward passes per step incur approximately 2.88× inference overhead relative to the base model, which poses challenges for latency-sensitive deployments.
Gains on audio and general omni-modal tasks remain limited (DailyOmni +1.6 only), indicating that the method is less effective for perception-intensive tasks.
When multimodal inputs contain contradictory information (e.g., labels that conflict with visual content), the LRM may misdirect reasoning.

The key distinction from ProxyTuning (which belongs to the same guidance decoding paradigm) is that ThinkOmni enables cross-modal guidance without requiring the LRM to perceive multimodal inputs.
ThinkOmni is complementary to VCD (Visual Contrastive Decoding): VCD enhances perception, while ThinkOmni enhances reasoning.
The work introduces a new paradigm for "reasoning capability transfer": rather than fine-tuning the model, capabilities are grafted via logits fusion at inference time.

Rating¶

Novelty: ⭐⭐⭐⭐ Cross-modal guidance decoding is a novel idea; Stepwise Contrastive Scaling is an elegant design.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Six benchmarks, three OLLMs, multiple LRMs, comprehensive ablations, and efficiency analysis.
Writing Quality: ⭐⭐⭐⭐ Clear structure, thorough theoretical analysis, and rich qualitative visualizations.
Value: ⭐⭐⭐⭐⭐ A training-free method that surpasses RFT approaches is highly practical and offers an important paradigm-level contribution to the community.