Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework¶

Conference: CVPR 2026 arXiv: 2603.07659 Code: https://github.com/KaihuaTang/Self-Critical-Inference-Framework Area: Multimodal VLM Keywords: LVLM robustness, counterfactual reasoning, language bias, language sensitivity, test-time scaling

TL;DR¶

This paper proposes the Self-Critical Inference (SCI) framework, which simultaneously addresses language bias and language sensitivity in LVLMs via multi-round textual and visual counterfactual logit aggregation. A dynamic robustness benchmark, DRBench, is introduced to evaluate robustness in a model-specific manner. Increasing the number of counterfactual inference rounds yields consistent robustness gains, opening a new direction for test-time scaling.

Background & Motivation¶

Background: LVLMs achieve strong vision-language capabilities by combining visual encoders with pretrained LLMs through joint fine-tuning.

Limitations of Prior Work: - Language Bias: Models rely on language priors rather than visual inputs to answer questions, leading to object hallucinations (e.g., generating non-existent content). - Language Sensitivity: Semantically equivalent but lexically distinct prompt variations elicit different responses, undermining consistency and reliability. - Methods such as VCD address only visual counterfactuals (bias), entirely neglecting textual counterfactuals (sensitivity).

Key Challenge: VCD is fundamentally a reweighting of the original logits by TIE logits along a single dimension (visual); however, the robustness challenges of LVLMs are two-dimensional.

Goal: Simultaneously mitigate language bias and language sensitivity, and demonstrate that increasing inference rounds improves robustness.

Key Insight: Unifying VCD within the causal analysis framework of CF-VQA to reveal the physical interpretation of \(\alpha\) (a temperature parameter for TIE), and naturally extending this formulation to the textual counterfactual dimension.

Core Idea: Since VCD reduces to TIE reweighting, both Textual Counterfactual (TC) and Visual Counterfactual (VC) can be applied simultaneously, enabling test-time robustness scaling via multi-round logit aggregation.

Method¶

Overall Architecture¶

Given the original input \((v^0, q^0)\), \(N\) textual variants \(\{q^i\}\) and \(M\) visual variants \(\{v^j\}\) are generated. TC and VC logits are computed separately and combined via weighted multiplication to yield the final prediction: \(p_{SCI}(y) \propto \exp(TC/\tau_1) \cdot \exp(VC/\tau_2)\).

Key Designs¶

Unified Understanding of VCD and CF-VQA:
- VCD logit: \(Z_{vcd} = (1+\alpha)Z(v,q) - \alpha Z(v^*,q)\)
- Expanding in the exp domain: \(p(y) \propto \exp(Z(v,q)) \cdot \exp(TIE/\tau)\)
- This reveals that VCD is essentially a vocabulary-level reweighting term using TIE logits, where \(\tau = 1/\alpha\) serves as the temperature parameter.
- This analysis bridges VCD and CF-VQA, providing a theoretical foundation for extension to the textual dimension.
Textual Counterfactual (TC):
- Semantically equivalent but lexically diverse prompt variants \(\{q^i\}\) are generated.
- For each token position \(k\), the element-wise maximum over all variant logits is taken: \(TC_k = \max_i(Z_k(v^0, q^i))\)
- Effect: eliminates logit biases induced by specific phrasings while retaining predictions that are consistent across wordings.
- Design Motivation: If a model produces different answers to semantically identical but lexically different prompts, taking the maximum selects the most stable prediction.
Visual Counterfactual (VC):
- VCD is extended to multiple counterfactual images: \(VC = Z(v^0, q^0) - \mathbb{E}[Z(v^j, q^0)]\)
- The average logit over multiple content-removed images replaces a single noise image.
- This yields a more stable estimate of language bias.
SCI3 / SCI5 / SCI7 Configurations:
- SCI3: \(M=N=1\) (3 inference passes); SCI5: \(M=N=2\) (5 passes); SCI7: \(M=N=3\) (7 passes).
- Increasing inference rounds consistently improves robustness at linearly growing computational cost.

Loss & Training¶

SCI is a purely inference-time method requiring no training. The temperature parameters \(\tau_1\) and \(\tau_2\) for TC and VC are tuned on a validation set.

Key Experimental Results¶

Main Results (DRBench BS Subset Overall)¶

Method	LLaVA-NeXT BS↑	Qwen2-VL BS↑
Baseline	18.75	14.52
TIE	27.31	-
VCD	27.89	-
M3ID	29.05	-
SCI3	32.72	-
SCI5	34.19	-
SCI7	34.92	-

Ablation Study¶

Configuration	Effect	Notes
VC only (≈VCD)	Bias improves; sensitivity unchanged	Addresses only half the problem
TC only	Sensitivity improves; bias unchanged	Addresses the other half
VC + TC (SCI)	Both problems simultaneously improved	Advantage of the unified framework
SCI3→SCI5→SCI7	Consistent 1–2% gains	Test-time scaling is effective

Key Findings¶

Minimal overlap between bias and sensitivity samples: Only 7.34% of the 24.68% hard samples identified for LLaVA-NeXT are shared with Qwen2-VL, confirming that robustness is model-specific.
Qwen2-VL is generally more robust but more susceptible to bias; LLaVA-NeXT exhibits more pronounced sensitivity issues.
Monotonic improvement from SCI3 to SCI7 suggests that the potential of test-time robustness scaling remains largely unexplored.
TC and VC address distinct types of robustness failures and are mutually indispensable.

Highlights & Insights¶

Unification of VCD and CF-VQA: Revealing that VCD is equivalent to temperature-scaled TIE reweighting is a contribution of independent value.
Test-time robustness scaling: Unlike conventional test-time scaling (extending intermediate token length), robustness is improved by increasing counterfactual inference rounds—a direction orthogonal to CoT scaling.
DRBench design philosophy: A dynamic, model-specific benchmark that can be automatically constructed from any dataset, addressing the risk of fixed benchmarks being included in subsequent model training data.
The method is model-agnostic and can be directly integrated into any LVLM inference pipeline.

Limitations & Future Work¶

Inference cost scales linearly: SCI7 requires 7 forward passes.
The strategies for generating textual and visual variants are relatively straightforward; more sophisticated counterfactual generation may yield further improvements.
Temperature parameters \(\tau_1\) and \(\tau_2\) require manual tuning.
DRBench relies on specific counterfactual generation methods to construct bias and sensitivity subsets.

vs. VCD: VCD is a special case of SCI (\(N=0, M=1\)); SCI extends the counterfactual dimensions and introduces test-time scaling.
vs. CF-VQA / TDE: Causal analysis for debiasing in conventional VQA has been established in prior work; this paper demonstrates that the same principles apply to LVLMs and extend naturally to language sensitivity.

Rating¶

Novelty: ⭐⭐⭐⭐ — Unified analysis is insightful; test-time scaling direction is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ — Six datasets, two models, and a well-designed DRBench.
Writing Quality: ⭐⭐⭐⭐⭐ — Theoretical analysis is elegant; the derivation from VCD to SCI is natural.
Value: ⭐⭐⭐⭐ — A practical inference-time robustness enhancement method.