Skip to content

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Conference: ICCV 2025 arXiv: 2412.03704 Code: None Area: Multimodal VLM / Inference-Time Search Keywords: inference-time scaling, value model, hallucination reduction, self-training, vision-language model

TL;DR

This paper proposes the Vision Value Model (VisVM), a value network trained via TD learning to predict the long-term value of sentences generated by a VLM. VisVM guides sentence-level beam search at inference time to produce image descriptions with fewer hallucinations and richer detail. High-quality captions generated by VisVM are further used for self-training, achieving an average improvement of 10.8% over LLaVA-Next across 9 benchmarks.

Background & Motivation

Background: The LLM community has demonstrated that inference-time compute scaling is an effective strategy for improving output quality (e.g., OpenAI O1), and process reward models (PRMs) can guide search toward higher-quality responses. However, effective inference-time search methods for VLMs remain lacking.

Limitations of Prior Work: Inference-time search for VLMs faces unique challenges. Unlike math or coding tasks, image captioning lacks clear outcome metrics; captions consist of multiple sentences that must form coherent paragraphs, and each sentence must be both locally accurate and globally consistent. Using CLIP directly as a process reward signal can only assess the quality of the current sentence and cannot anticipate hallucinations that may arise in subsequent sentences.

Key Challenge: PRMs only consider the immediate reward at the current step. During image caption generation, however, the choice of the current sentence affects the quality and coherence of subsequent sentences. A sentence that appears acceptable at present may cause extensive hallucinations later.

Goal: Train a visual value model that predicts long-term value (rather than only immediate reward) and use it to guide VLM inference-time search.

Key Insight: VLM text generation is formulated as an MDP, where each step generates one sentence as an action. A value function is trained via TD learning (rather than using CLIP solely as a PRM) to predict the cumulative reward over all future sentences.

Core Idea: Replace the immediate reward model with a TD-learning-trained value model to guide sentence-level VLM search, enabling forward-looking quality assessment and reducing hallucinations.

Method

Overall Architecture

VLM text generation is modeled as an MDP: state = generated sentences so far + image; action = the sentence generated at the current step; reward = CLIP similarity score. A VisVM is trained as a value function to estimate the long-term cumulative reward from each state. At inference time, the VLM generates multiple candidate sentences per step using multiple temperatures; VisVM evaluates the long-term value of each candidate, and the sentence with the highest value is selected to continue generation.

Key Designs

  1. VisVM Training — TD Learning:

    • Function: Train a value network that takes (current sentence, image) as input and outputs a scalar \(V_\rho(y_i, I)\), estimating the long-term cumulative reward from the current state onward.
    • Mechanism: Temporal Difference learning is applied with the loss \(L(\rho) = -\mathbb{E}_{(y_i, y_{i+1}, I) \sim \mathcal{D}} (r_{s_i} + \gamma V_\rho(y_{i+1}, I) - V_\rho(y_i, I))^2\), which enforces that the current value equals the immediate reward plus the discounted value of the next state. The discount factor is \(\gamma=0.9\).
    • Design Motivation: A PRM relies solely on the CLIP score of the current sentence (immediate reward) and cannot anticipate future hallucinations. TD learning endows VisVM with a look-ahead capability — even if a sentence has a slightly lower CLIP score, VisVM will assign it a higher value if it leads to higher-quality subsequent sentences.
  2. Training Data Construction:

    • 9,215 images are sampled from the COCO 2017 training set, paired with 9 description prompts from LLaVA-150K.
    • For each (image, prompt) pair, the VLM generates 5 diverse descriptions at different temperatures.
    • Descriptions are decomposed into (current sentence, next sentence, image) triples, yielding 378K samples in total.
    • Design Motivation: Diverse responses allow VisVM to learn the differing future trajectories that result from different sentence choices.
  3. Self-Rewarding PRM Design:

    • The VLM's own visual encoder (CLIP-ViT for LLaVA-Next; SigLIP for LLaVA-OV) is used as the PRM to compute image-text similarity.
    • No external models or human annotations are required, making the pipeline entirely self-contained.
    • Experiments show that a stronger PRM (e.g., replacing CLIP-ViT with SigLIP) further improves VisVM performance.
  4. Inference-Time Search Strategy:

    • At each step, 6 candidate sentences are generated using 5 temperatures (0.1–0.9) plus greedy decoding.
    • VisVM evaluates the long-term value of each candidate, and the highest-scoring sentence is selected.
    • This process iterates sentence by sentence until the full response is generated.
    • This approach is approximately 7× more efficient than MCTS, as the VisVM value function generalizes to new prompt-image pairs, whereas MCTS must perform search from scratch each time.

Self-Training Pipeline

VisVM-guided search generates high-quality captions for 9,215 COCO images → these captions serve as SFT data for full-parameter fine-tuning of the original VLM → 3 epochs, lr=1e-6. The entire pipeline requires no external models or human annotations.

Key Experimental Results

Search Method CHAIRs↓ CHAIRi↓ MMHal↑ MMHal rate↓ AMBER Cov↑
Greedy (default) 32.4 5.9 2.94 0.52 63.9
BoN (30) 27.1 5.2 3.06 0.45 65.3
CLIP-PRM 28.4 5.5 2.96 0.49 66.1
MCTS 25.9 4.7 3.24 0.37 67.3
VisVM 26.2 4.6 3.30 0.39 66.8

Multi-Benchmark Performance After Self-Training (LLaVA-Next-7B)

Data Source MM-Vet MMBench MMMU MathVista CVBench LLaVA-Wild MMStar CHAIRs↓ Avg Gain
Base Model 45.2 74.9 34.2 38.5 65.8 76.9 36.0 32.4
Greedy SFT 43.5 74.6 34.9 37.8 66.2 75.1 36.7 33.2 -1.6%
BoN SFT 47.1 76.1 35.4 40.9 67.9 77.3 36.9 30.0 +4.9%
CLIP-PRM SFT 46.1 75.8 35.8 39.6 68.5 78.1 36.6 26.0 +4.6%
VisVM SFT 48.3 76.7 36.1 42.3 69.8 78.4 38.0 22.6 +10.8%

Ablation Study

Configuration CHAIRs↓ CHAIRi↓ MMHal↑ AMBER Cov↑
Greedy 32.4 5.9 2.94 63.9
CLIP-VisVM 26.2 4.6 3.30 66.8
SigLIP-VisVM (stronger PRM) 25.6 4.4 3.31 67.5

Key Findings

  • VisVM achieves hallucination reduction comparable to MCTS at approximately 1/7 of the computational cost.
  • More inference-time computation yields monotonic gains: as step size increases from 2 to 16, CHAIRs continues to decline. VisVM is 2× more search-efficient than CLIP-PRM (VisVM at step size 8 ≈ CLIP-PRM at step size 16).
  • SFT on greedy-decoded captions degrades performance (−1.6%), underscoring the critical importance of self-training data quality.
  • VisVM self-training also benefits Qwen2-VL-7B with an average improvement of 7.3%, demonstrating cross-model generalizability.
  • When selecting from the same candidate set, sentences chosen by VisVM lead to fewer hallucinations under subsequent greedy decoding (30.9 vs. 31.6 CHAIRs), validating the effectiveness of forward-looking value prediction.

Highlights & Insights

  • Elegant Application of MDP + TD Learning: Formulating VLM text generation as an MDP and training the value function via TD learning is an elegant and effective approach. Long-term value prediction avoids the myopic sentence selection inherent to immediate reward signals. This framework is readily extensible to other VLM generation tasks.
  • Fully Self-Contained Self-Training Loop: The PRM is derived from the VLM's own visual encoder, VisVM is initialized from the VLM, and SFT data is generated by the VLM+VisVM pipeline. The entire process requires no external models or human annotations, constituting a genuine self-improvement loop.
  • Efficiency–Effectiveness Trade-off at Inference Time: VisVM is 7× more efficient than MCTS with comparable performance. The key advantage lies in the generalizability of VisVM's value function as a neural network, in contrast to MCTS which must search from scratch for every new input.

Limitations & Future Work

  • Validation is limited to descriptive captioning; extension to other VLM tasks such as VQA and reasoning has not been explored.
  • The search granularity is fixed at the sentence level; finer-grained (token-level) or coarser-grained (paragraph-level) search may be preferable in different settings.
  • VisVM training data comprises only 9K images; scaling up the training set may yield further improvements.
  • Only a single round of self-training is performed; the potential gains from iterative multi-round self-training remain to be explored.
  • Inference-time search introduces additional computational overhead (~6 candidates per step), which may be prohibitive in latency-sensitive applications.
  • vs. CLIP-PRM: VisVM converts CLIP's immediate reward into a long-term value estimate via TD learning, consistently outperforming direct CLIP scoring under the same search budget.
  • vs. MCTS: MCTS also enables look-ahead search but incurs approximately 7× the computational cost of VisVM, as it does not reuse a learned value function.
  • vs. BoN: BoN generates 30 complete responses and selects the best, which is inefficient and does not provide step-wise guidance. VisVM with step size 6 (6 candidates per step) already surpasses BoN with 30 candidates.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First to introduce RL-style value functions into VLM inference-time search; the MDP + TD learning framework is elegant and effective.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across both inference-time search and self-training dimensions, 9 benchmarks, GPT + human evaluation, multiple VLM backbones, and comprehensive ablations.
  • Writing Quality: ⭐⭐⭐⭐ Clear and systematic presentation; qualitative case studies are intuitive.
  • Value: ⭐⭐⭐⭐⭐ Opens a new direction for inference-time compute scaling in VLMs; the self-training pipeline has strong practical utility.