Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework¶
Conference: CVPR 2026 arXiv: 2603.07659 Code: https://github.com/KaihuaTang/Self-Critical-Inference-Framework Area: Multimodal VLM Keywords: LVLM robustness, counterfactual reasoning, language bias, language sensitivity, test-time scaling
TL;DR¶
This paper proposes the Self-Critical Inference (SCI) framework, which simultaneously addresses language bias and language sensitivity in LVLMs via multi-round textual and visual counterfactual logit aggregation. A dynamic robustness benchmark, DRBench, is introduced to evaluate robustness in a model-specific manner. Increasing the number of counterfactual inference rounds yields consistent robustness gains, opening a new direction for test-time scaling.
Background & Motivation¶
Background: LVLMs achieve strong vision-language capabilities by combining visual encoders with pretrained LLMs through joint fine-tuning.
Limitations of Prior Work: - Language Bias: Models rely on language priors rather than visual inputs to answer questions, leading to object hallucinations (e.g., generating non-existent content). - Language Sensitivity: Semantically equivalent but lexically distinct prompt variations elicit different responses, undermining consistency and reliability. - Methods such as VCD address only visual counterfactuals (bias), entirely neglecting textual counterfactuals (sensitivity).
Key Challenge: VCD is fundamentally a reweighting of the original logits by TIE logits along a single dimension (visual); however, the robustness challenges of LVLMs are two-dimensional.
Goal: Simultaneously mitigate language bias and language sensitivity, and demonstrate that increasing inference rounds improves robustness.
Key Insight: Unifying VCD within the causal analysis framework of CF-VQA to reveal the physical interpretation of \(\alpha\) (a temperature parameter for TIE), and naturally extending this formulation to the textual counterfactual dimension.
Core Idea: Since VCD reduces to TIE reweighting, both Textual Counterfactual (TC) and Visual Counterfactual (VC) can be applied simultaneously, enabling test-time robustness scaling via multi-round logit aggregation.
Method¶
Overall Architecture¶
Given the original input \((v^0, q^0)\), \(N\) textual variants \(\{q^i\}\) and \(M\) visual variants \(\{v^j\}\) are generated. TC and VC logits are computed separately and combined via weighted multiplication to yield the final prediction: \(p_{SCI}(y) \propto \exp(TC/\tau_1) \cdot \exp(VC/\tau_2)\).
Key Designs¶
-
Unified Understanding of VCD and CF-VQA:
- VCD logit: \(Z_{vcd} = (1+\alpha)Z(v,q) - \alpha Z(v^*,q)\)
- Expanding in the exp domain: \(p(y) \propto \exp(Z(v,q)) \cdot \exp(TIE/\tau)\)
- This reveals that VCD is essentially a vocabulary-level reweighting term using TIE logits, where \(\tau = 1/\alpha\) serves as the temperature parameter.
- This analysis bridges VCD and CF-VQA, providing a theoretical foundation for extension to the textual dimension.
-
Textual Counterfactual (TC):
- Semantically equivalent but lexically diverse prompt variants \(\{q^i\}\) are generated.
- For each token position \(k\), the element-wise maximum over all variant logits is taken: \(TC_k = \max_i(Z_k(v^0, q^i))\)
- Effect: eliminates logit biases induced by specific phrasings while retaining predictions that are consistent across wordings.
- Design Motivation: If a model produces different answers to semantically identical but lexically different prompts, taking the maximum selects the most stable prediction.
-
Visual Counterfactual (VC):
- VCD is extended to multiple counterfactual images: \(VC = Z(v^0, q^0) - \mathbb{E}[Z(v^j, q^0)]\)
- The average logit over multiple content-removed images replaces a single noise image.
- This yields a more stable estimate of language bias.
-
SCI3 / SCI5 / SCI7 Configurations:
- SCI3: \(M=N=1\) (3 inference passes); SCI5: \(M=N=2\) (5 passes); SCI7: \(M=N=3\) (7 passes).
- Increasing inference rounds consistently improves robustness at linearly growing computational cost.
Loss & Training¶
SCI is a purely inference-time method requiring no training. The temperature parameters \(\tau_1\) and \(\tau_2\) for TC and VC are tuned on a validation set.
Key Experimental Results¶
Main Results (DRBench BS Subset Overall)¶
| Method | LLaVA-NeXT BS↑ | Qwen2-VL BS↑ |
|---|---|---|
| Baseline | 18.75 | 14.52 |
| TIE | 27.31 | - |
| VCD | 27.89 | - |
| M3ID | 29.05 | - |
| SCI3 | 32.72 | - |
| SCI5 | 34.19 | - |
| SCI7 | 34.92 | - |
Ablation Study¶
| Configuration | Effect | Notes |
|---|---|---|
| VC only (≈VCD) | Bias improves; sensitivity unchanged | Addresses only half the problem |
| TC only | Sensitivity improves; bias unchanged | Addresses the other half |
| VC + TC (SCI) | Both problems simultaneously improved | Advantage of the unified framework |
| SCI3→SCI5→SCI7 | Consistent 1–2% gains | Test-time scaling is effective |
Key Findings¶
- Minimal overlap between bias and sensitivity samples: Only 7.34% of the 24.68% hard samples identified for LLaVA-NeXT are shared with Qwen2-VL, confirming that robustness is model-specific.
- Qwen2-VL is generally more robust but more susceptible to bias; LLaVA-NeXT exhibits more pronounced sensitivity issues.
- Monotonic improvement from SCI3 to SCI7 suggests that the potential of test-time robustness scaling remains largely unexplored.
- TC and VC address distinct types of robustness failures and are mutually indispensable.
Highlights & Insights¶
- Unification of VCD and CF-VQA: Revealing that VCD is equivalent to temperature-scaled TIE reweighting is a contribution of independent value.
- Test-time robustness scaling: Unlike conventional test-time scaling (extending intermediate token length), robustness is improved by increasing counterfactual inference rounds—a direction orthogonal to CoT scaling.
- DRBench design philosophy: A dynamic, model-specific benchmark that can be automatically constructed from any dataset, addressing the risk of fixed benchmarks being included in subsequent model training data.
- The method is model-agnostic and can be directly integrated into any LVLM inference pipeline.
Limitations & Future Work¶
- Inference cost scales linearly: SCI7 requires 7 forward passes.
- The strategies for generating textual and visual variants are relatively straightforward; more sophisticated counterfactual generation may yield further improvements.
- Temperature parameters \(\tau_1\) and \(\tau_2\) require manual tuning.
- DRBench relies on specific counterfactual generation methods to construct bias and sensitivity subsets.
Related Work & Insights¶
- vs. VCD: VCD is a special case of SCI (\(N=0, M=1\)); SCI extends the counterfactual dimensions and introduces test-time scaling.
- vs. CF-VQA / TDE: Causal analysis for debiasing in conventional VQA has been established in prior work; this paper demonstrates that the same principles apply to LVLMs and extend naturally to language sensitivity.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Unified analysis is insightful; test-time scaling direction is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Six datasets, two models, and a well-designed DRBench.
- Writing Quality: ⭐⭐⭐⭐⭐ — Theoretical analysis is elegant; the derivation from VCD to SCI is natural.
- Value: ⭐⭐⭐⭐ — A practical inference-time robustness enhancement method.