Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension¶

Conference: ICCV 2025 arXiv: 2412.03704 Code: None Area: Multimodal VLM / Inference-Time Search Keywords: inference-time scaling, value model, hallucination reduction, self-training, vision-language model

TL;DR¶

This paper proposes the Vision Value Model (VisVM), a value network trained via TD learning to predict the long-term value of sentences generated by a VLM. VisVM guides sentence-level beam search at inference time to produce image descriptions with fewer hallucinations and richer detail. High-quality captions generated by VisVM are further used for self-training, achieving an average improvement of 10.8% over LLaVA-Next across 9 benchmarks.

Background & Motivation¶

Background: The LLM community has demonstrated that inference-time compute scaling is an effective strategy for improving output quality (e.g., OpenAI O1), and process reward models (PRMs) can guide search toward higher-quality responses. However, effective inference-time search methods for VLMs remain lacking.

Limitations of Prior Work: Inference-time search for VLMs faces unique challenges. Unlike math or coding tasks, image captioning lacks clear outcome metrics; captions consist of multiple sentences that must form coherent paragraphs, and each sentence must be both locally accurate and globally consistent. Using CLIP directly as a process reward signal can only assess the quality of the current sentence and cannot anticipate hallucinations that may arise in subsequent sentences.

Key Challenge: PRMs only consider the immediate reward at the current step. During image caption generation, however, the choice of the current sentence affects the quality and coherence of subsequent sentences. A sentence that appears acceptable at present may cause extensive hallucinations later.

Goal: Train a visual value model that predicts long-term value (rather than only immediate reward) and use it to guide VLM inference-time search.

Key Insight: VLM text generation is formulated as an MDP, where each step generates one sentence as an action. A value function is trained via TD learning (rather than using CLIP solely as a PRM) to predict the cumulative reward over all future sentences.

Core Idea: Replace the immediate reward model with a TD-learning-trained value model to guide sentence-level VLM search, enabling forward-looking quality assessment and reducing hallucinations.

Method¶

Overall Architecture¶

VLM text generation is modeled as an MDP: state = generated sentences so far + image; action = the sentence generated at the current step; reward = CLIP similarity score. A VisVM is trained as a value function to estimate the long-term cumulative reward from each state. At inference time, the VLM generates multiple candidate sentences per step using multiple temperatures; VisVM evaluates the long-term value of each candidate, and the sentence with the highest value is selected to continue generation.

Key Designs¶

VisVM Training — TD Learning:
- Function: Train a value network that takes (current sentence, image) as input and outputs a scalar \(V_\rho(y_i, I)\), estimating the long-term cumulative reward from the current state onward.
- Mechanism: Temporal Difference learning is applied with the loss \(L(\rho) = -\mathbb{E}_{(y_i, y_{i+1}, I) \sim \mathcal{D}} (r_{s_i} + \gamma V_\rho(y_{i+1}, I) - V_\rho(y_i, I))^2\), which enforces that the current value equals the immediate reward plus the discounted value of the next state. The discount factor is \(\gamma=0.9\).
- Design Motivation: A PRM relies solely on the CLIP score of the current sentence (immediate reward) and cannot anticipate future hallucinations. TD learning endows VisVM with a look-ahead capability — even if a sentence has a slightly lower CLIP score, VisVM will assign it a higher value if it leads to higher-quality subsequent sentences.
Training Data Construction:
- 9,215 images are sampled from the COCO 2017 training set, paired with 9 description prompts from LLaVA-150K.
- For each (image, prompt) pair, the VLM generates 5 diverse descriptions at different temperatures.
- Descriptions are decomposed into (current sentence, next sentence, image) triples, yielding 378K samples in total.
- Design Motivation: Diverse responses allow VisVM to learn the differing future trajectories that result from different sentence choices.
Self-Rewarding PRM Design:
- The VLM's own visual encoder (CLIP-ViT for LLaVA-Next; SigLIP for LLaVA-OV) is used as the PRM to compute image-text similarity.
- No external models or human annotations are required, making the pipeline entirely self-contained.
- Experiments show that a stronger PRM (e.g., replacing CLIP-ViT with SigLIP) further improves VisVM performance.
Inference-Time Search Strategy:
- At each step, 6 candidate sentences are generated using 5 temperatures (0.1–0.9) plus greedy decoding.
- VisVM evaluates the long-term value of each candidate, and the highest-scoring sentence is selected.
- This process iterates sentence by sentence until the full response is generated.
- This approach is approximately 7× more efficient than MCTS, as the VisVM value function generalizes to new prompt-image pairs, whereas MCTS must perform search from scratch each time.

Self-Training Pipeline¶

VisVM-guided search generates high-quality captions for 9,215 COCO images → these captions serve as SFT data for full-parameter fine-tuning of the original VLM → 3 epochs, lr=1e-6. The entire pipeline requires no external models or human annotations.

Key Experimental Results¶

Hallucination Evaluation (LLaVA-Next-7B, Inference-Time Search)¶

Search Method	CHAIRs↓	CHAIRi↓	MMHal↑	MMHal rate↓	AMBER Cov↑
Greedy (default)	32.4	5.9	2.94	0.52	63.9
BoN (30)	27.1	5.2	3.06	0.45	65.3
CLIP-PRM	28.4	5.5	2.96	0.49	66.1
MCTS	25.9	4.7	3.24	0.37	67.3
VisVM	26.2	4.6	3.30	0.39	66.8

Multi-Benchmark Performance After Self-Training (LLaVA-Next-7B)¶

Data Source	MM-Vet	MMBench	MMMU	MathVista	CVBench	LLaVA-Wild	MMStar	CHAIRs↓	Avg Gain
Base Model	45.2	74.9	34.2	38.5	65.8	76.9	36.0	32.4	—
Greedy SFT	43.5	74.6	34.9	37.8	66.2	75.1	36.7	33.2	-1.6%
BoN SFT	47.1	76.1	35.4	40.9	67.9	77.3	36.9	30.0	+4.9%
CLIP-PRM SFT	46.1	75.8	35.8	39.6	68.5	78.1	36.6	26.0	+4.6%
VisVM SFT	48.3	76.7	36.1	42.3	69.8	78.4	38.0	22.6	+10.8%

Ablation Study¶

Configuration	CHAIRs↓	CHAIRi↓	MMHal↑	AMBER Cov↑
Greedy	32.4	5.9	2.94	63.9
CLIP-VisVM	26.2	4.6	3.30	66.8
SigLIP-VisVM (stronger PRM)	25.6	4.4	3.31	67.5

Key Findings¶

VisVM achieves hallucination reduction comparable to MCTS at approximately 1/7 of the computational cost.
More inference-time computation yields monotonic gains: as step size increases from 2 to 16, CHAIRs continues to decline. VisVM is 2× more search-efficient than CLIP-PRM (VisVM at step size 8 ≈ CLIP-PRM at step size 16).
SFT on greedy-decoded captions degrades performance (−1.6%), underscoring the critical importance of self-training data quality.
VisVM self-training also benefits Qwen2-VL-7B with an average improvement of 7.3%, demonstrating cross-model generalizability.
When selecting from the same candidate set, sentences chosen by VisVM lead to fewer hallucinations under subsequent greedy decoding (30.9 vs. 31.6 CHAIRs), validating the effectiveness of forward-looking value prediction.

Highlights & Insights¶

Elegant Application of MDP + TD Learning: Formulating VLM text generation as an MDP and training the value function via TD learning is an elegant and effective approach. Long-term value prediction avoids the myopic sentence selection inherent to immediate reward signals. This framework is readily extensible to other VLM generation tasks.
Fully Self-Contained Self-Training Loop: The PRM is derived from the VLM's own visual encoder, VisVM is initialized from the VLM, and SFT data is generated by the VLM+VisVM pipeline. The entire process requires no external models or human annotations, constituting a genuine self-improvement loop.
Efficiency–Effectiveness Trade-off at Inference Time: VisVM is 7× more efficient than MCTS with comparable performance. The key advantage lies in the generalizability of VisVM's value function as a neural network, in contrast to MCTS which must search from scratch for every new input.

Limitations & Future Work¶

Validation is limited to descriptive captioning; extension to other VLM tasks such as VQA and reasoning has not been explored.
The search granularity is fixed at the sentence level; finer-grained (token-level) or coarser-grained (paragraph-level) search may be preferable in different settings.
VisVM training data comprises only 9K images; scaling up the training set may yield further improvements.
Only a single round of self-training is performed; the potential gains from iterative multi-round self-training remain to be explored.
Inference-time search introduces additional computational overhead (~6 candidates per step), which may be prohibitive in latency-sensitive applications.

vs. CLIP-PRM: VisVM converts CLIP's immediate reward into a long-term value estimate via TD learning, consistently outperforming direct CLIP scoring under the same search budget.
vs. MCTS: MCTS also enables look-ahead search but incurs approximately 7× the computational cost of VisVM, as it does not reuse a learned value function.
vs. BoN: BoN generates 30 complete responses and selects the best, which is inefficient and does not provide step-wise guidance. VisVM with step size 6 (6 candidates per step) already surpasses BoN with 30 candidates.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to introduce RL-style value functions into VLM inference-time search; the MDP + TD learning framework is elegant and effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across both inference-time search and self-training dimensions, 9 benchmarks, GPT + human evaluation, multiple VLM backbones, and comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐ Clear and systematic presentation; qualitative case studies are intuitive.
Value: ⭐⭐⭐⭐⭐ Opens a new direction for inference-time compute scaling in VLMs; the self-training pipeline has strong practical utility.