Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension¶
Conference: ICCV 2025 arXiv: 2412.03704 Code: None Area: Multimodal VLM / Inference-Time Search Keywords: inference-time scaling, value model, hallucination reduction, self-training, vision-language model
TL;DR¶
This paper proposes the Vision Value Model (VisVM), a value network trained via TD learning to predict the long-term value of sentences generated by a VLM. VisVM guides sentence-level beam search at inference time to produce image descriptions with fewer hallucinations and richer detail. High-quality captions generated by VisVM are further used for self-training, achieving an average improvement of 10.8% over LLaVA-Next across 9 benchmarks.
Background & Motivation¶
Background: The LLM community has demonstrated that inference-time compute scaling is an effective strategy for improving output quality (e.g., OpenAI O1), and process reward models (PRMs) can guide search toward higher-quality responses. However, effective inference-time search methods for VLMs remain lacking.
Limitations of Prior Work: Inference-time search for VLMs faces unique challenges. Unlike math or coding tasks, image captioning lacks clear outcome metrics; captions consist of multiple sentences that must form coherent paragraphs, and each sentence must be both locally accurate and globally consistent. Using CLIP directly as a process reward signal can only assess the quality of the current sentence and cannot anticipate hallucinations that may arise in subsequent sentences.
Key Challenge: PRMs only consider the immediate reward at the current step. During image caption generation, however, the choice of the current sentence affects the quality and coherence of subsequent sentences. A sentence that appears acceptable at present may cause extensive hallucinations later.
Goal: Train a visual value model that predicts long-term value (rather than only immediate reward) and use it to guide VLM inference-time search.
Key Insight: VLM text generation is formulated as an MDP, where each step generates one sentence as an action. A value function is trained via TD learning (rather than using CLIP solely as a PRM) to predict the cumulative reward over all future sentences.
Core Idea: Replace the immediate reward model with a TD-learning-trained value model to guide sentence-level VLM search, enabling forward-looking quality assessment and reducing hallucinations.
Method¶
Overall Architecture¶
VLM text generation is modeled as an MDP: state = generated sentences so far + image; action = the sentence generated at the current step; reward = CLIP similarity score. A VisVM is trained as a value function to estimate the long-term cumulative reward from each state. At inference time, the VLM generates multiple candidate sentences per step using multiple temperatures; VisVM evaluates the long-term value of each candidate, and the sentence with the highest value is selected to continue generation.
Key Designs¶
-
VisVM Training — TD Learning:
- Function: Train a value network that takes (current sentence, image) as input and outputs a scalar \(V_\rho(y_i, I)\), estimating the long-term cumulative reward from the current state onward.
- Mechanism: Temporal Difference learning is applied with the loss \(L(\rho) = -\mathbb{E}_{(y_i, y_{i+1}, I) \sim \mathcal{D}} (r_{s_i} + \gamma V_\rho(y_{i+1}, I) - V_\rho(y_i, I))^2\), which enforces that the current value equals the immediate reward plus the discounted value of the next state. The discount factor is \(\gamma=0.9\).
- Design Motivation: A PRM relies solely on the CLIP score of the current sentence (immediate reward) and cannot anticipate future hallucinations. TD learning endows VisVM with a look-ahead capability — even if a sentence has a slightly lower CLIP score, VisVM will assign it a higher value if it leads to higher-quality subsequent sentences.
-
Training Data Construction:
- 9,215 images are sampled from the COCO 2017 training set, paired with 9 description prompts from LLaVA-150K.
- For each (image, prompt) pair, the VLM generates 5 diverse descriptions at different temperatures.
- Descriptions are decomposed into (current sentence, next sentence, image) triples, yielding 378K samples in total.
- Design Motivation: Diverse responses allow VisVM to learn the differing future trajectories that result from different sentence choices.
-
Self-Rewarding PRM Design:
- The VLM's own visual encoder (CLIP-ViT for LLaVA-Next; SigLIP for LLaVA-OV) is used as the PRM to compute image-text similarity.
- No external models or human annotations are required, making the pipeline entirely self-contained.
- Experiments show that a stronger PRM (e.g., replacing CLIP-ViT with SigLIP) further improves VisVM performance.
-
Inference-Time Search Strategy:
- At each step, 6 candidate sentences are generated using 5 temperatures (0.1–0.9) plus greedy decoding.
- VisVM evaluates the long-term value of each candidate, and the highest-scoring sentence is selected.
- This process iterates sentence by sentence until the full response is generated.
- This approach is approximately 7× more efficient than MCTS, as the VisVM value function generalizes to new prompt-image pairs, whereas MCTS must perform search from scratch each time.
Self-Training Pipeline¶
VisVM-guided search generates high-quality captions for 9,215 COCO images → these captions serve as SFT data for full-parameter fine-tuning of the original VLM → 3 epochs, lr=1e-6. The entire pipeline requires no external models or human annotations.
Key Experimental Results¶
Hallucination Evaluation (LLaVA-Next-7B, Inference-Time Search)¶
| Search Method | CHAIRs↓ | CHAIRi↓ | MMHal↑ | MMHal rate↓ | AMBER Cov↑ |
|---|---|---|---|---|---|
| Greedy (default) | 32.4 | 5.9 | 2.94 | 0.52 | 63.9 |
| BoN (30) | 27.1 | 5.2 | 3.06 | 0.45 | 65.3 |
| CLIP-PRM | 28.4 | 5.5 | 2.96 | 0.49 | 66.1 |
| MCTS | 25.9 | 4.7 | 3.24 | 0.37 | 67.3 |
| VisVM | 26.2 | 4.6 | 3.30 | 0.39 | 66.8 |
Multi-Benchmark Performance After Self-Training (LLaVA-Next-7B)¶
| Data Source | MM-Vet | MMBench | MMMU | MathVista | CVBench | LLaVA-Wild | MMStar | CHAIRs↓ | Avg Gain |
|---|---|---|---|---|---|---|---|---|---|
| Base Model | 45.2 | 74.9 | 34.2 | 38.5 | 65.8 | 76.9 | 36.0 | 32.4 | — |
| Greedy SFT | 43.5 | 74.6 | 34.9 | 37.8 | 66.2 | 75.1 | 36.7 | 33.2 | -1.6% |
| BoN SFT | 47.1 | 76.1 | 35.4 | 40.9 | 67.9 | 77.3 | 36.9 | 30.0 | +4.9% |
| CLIP-PRM SFT | 46.1 | 75.8 | 35.8 | 39.6 | 68.5 | 78.1 | 36.6 | 26.0 | +4.6% |
| VisVM SFT | 48.3 | 76.7 | 36.1 | 42.3 | 69.8 | 78.4 | 38.0 | 22.6 | +10.8% |
Ablation Study¶
| Configuration | CHAIRs↓ | CHAIRi↓ | MMHal↑ | AMBER Cov↑ |
|---|---|---|---|---|
| Greedy | 32.4 | 5.9 | 2.94 | 63.9 |
| CLIP-VisVM | 26.2 | 4.6 | 3.30 | 66.8 |
| SigLIP-VisVM (stronger PRM) | 25.6 | 4.4 | 3.31 | 67.5 |
Key Findings¶
- VisVM achieves hallucination reduction comparable to MCTS at approximately 1/7 of the computational cost.
- More inference-time computation yields monotonic gains: as step size increases from 2 to 16, CHAIRs continues to decline. VisVM is 2× more search-efficient than CLIP-PRM (VisVM at step size 8 ≈ CLIP-PRM at step size 16).
- SFT on greedy-decoded captions degrades performance (−1.6%), underscoring the critical importance of self-training data quality.
- VisVM self-training also benefits Qwen2-VL-7B with an average improvement of 7.3%, demonstrating cross-model generalizability.
- When selecting from the same candidate set, sentences chosen by VisVM lead to fewer hallucinations under subsequent greedy decoding (30.9 vs. 31.6 CHAIRs), validating the effectiveness of forward-looking value prediction.
Highlights & Insights¶
- Elegant Application of MDP + TD Learning: Formulating VLM text generation as an MDP and training the value function via TD learning is an elegant and effective approach. Long-term value prediction avoids the myopic sentence selection inherent to immediate reward signals. This framework is readily extensible to other VLM generation tasks.
- Fully Self-Contained Self-Training Loop: The PRM is derived from the VLM's own visual encoder, VisVM is initialized from the VLM, and SFT data is generated by the VLM+VisVM pipeline. The entire process requires no external models or human annotations, constituting a genuine self-improvement loop.
- Efficiency–Effectiveness Trade-off at Inference Time: VisVM is 7× more efficient than MCTS with comparable performance. The key advantage lies in the generalizability of VisVM's value function as a neural network, in contrast to MCTS which must search from scratch for every new input.
Limitations & Future Work¶
- Validation is limited to descriptive captioning; extension to other VLM tasks such as VQA and reasoning has not been explored.
- The search granularity is fixed at the sentence level; finer-grained (token-level) or coarser-grained (paragraph-level) search may be preferable in different settings.
- VisVM training data comprises only 9K images; scaling up the training set may yield further improvements.
- Only a single round of self-training is performed; the potential gains from iterative multi-round self-training remain to be explored.
- Inference-time search introduces additional computational overhead (~6 candidates per step), which may be prohibitive in latency-sensitive applications.
Related Work & Insights¶
- vs. CLIP-PRM: VisVM converts CLIP's immediate reward into a long-term value estimate via TD learning, consistently outperforming direct CLIP scoring under the same search budget.
- vs. MCTS: MCTS also enables look-ahead search but incurs approximately 7× the computational cost of VisVM, as it does not reuse a learned value function.
- vs. BoN: BoN generates 30 complete responses and selects the best, which is inefficient and does not provide step-wise guidance. VisVM with step size 6 (6 candidates per step) already surpasses BoN with 30 candidates.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to introduce RL-style value functions into VLM inference-time search; the MDP + TD learning framework is elegant and effective.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across both inference-time search and self-training dimensions, 9 benchmarks, GPT + human evaluation, multiple VLM backbones, and comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear and systematic presentation; qualitative case studies are intuitive.
- Value: ⭐⭐⭐⭐⭐ Opens a new direction for inference-time compute scaling in VLMs; the self-training pipeline has strong practical utility.