Skip to content

Efficient Test-Time Scaling for Small Vision-Language Models

Conference: ICLR 2026 arXiv: 2510.03574 Code: GitHub Area: LLM Reasoning / VLM Efficiency Keywords: test-time scaling, vision-language models, test-time augmentation, test-time adaptation, token-level aggregation

TL;DR

This paper proposes two efficient test-time scaling strategies for small VLMs: TTAug (applying diverse input augmentations and aggregating output probability distributions at the token level) and TTAdapt (adapting model parameters using pseudo-labels generated by TTAug). Both methods consistently improve performance across 9 benchmarks while achieving substantially better computational efficiency than existing sampling-based test-time scaling approaches.

Background & Motivation

Small vision-language models (e.g., SmolVLM2-2.2B) offer computationally efficient alternatives to large models but suffer from weaker generalization and downstream task performance. Test-Time Scaling (TTS) techniques can compensate for limited model capacity by investing additional computation at inference time; however, existing methods face a fundamental tension:

Self-Consistency: Generates multiple candidate answers via temperature sampling and aggregates them through majority voting. However, repeated sampling is computationally expensive, and aggregation at the answer level discards fine-grained token-level information.

Self-Selector / Sample-and-Rank: Selects the best response using the model itself or log probabilities, but still relies on multiple independent sampling passes.

Self-Synthesizer: Synthesizes multiple responses into a final answer, incurring additional overhead for the synthesis step.

Root Cause: The computational cost of existing TTS methods conflicts with the resource-efficiency design goals of small models.

The paper identifies two design choices that simultaneously improve both effectiveness and efficiency: - Replace temperature sampling with input augmentation to induce diversity: Semantics-preserving augmentations produce higher-quality diverse candidates than temperature sampling. - Aggregate at the token level rather than the answer level: This captures finer-grained confidence signals.

Method

Overall Architecture

Two complementary pipelines:

Pipeline 1 — TTAug (Test-Time Augmentation): Input image + text prompt → generate multiple augmented variants → process each variant with the VLM → aggregate next-token probability distributions at each decoding step → greedily decode the final answer.

Pipeline 2 — TTAdapt (Test-Time Adaptation): TTAug generates pseudo-labels → fine-tune VLM parameters on pseudo-labels → repeat until convergence or budget exhaustion.

Key Designs

  1. Dual-channel input augmentation:

  2. Image augmentation: Applies semantics-preserving transformations to the input image (e.g., slight rotations, crops, color jitter) to generate multiple visual variants.

  3. Text augmentation: Introduces mild perturbations to the text prompt — injecting typos and tokenization noise (e.g., "Which country" → "Wh ich cou ntry") — while appending the original intact question as a reference (e.g., "In other words, ...").
  4. Design Motivation: Image augmentation provides viewpoint diversity; text augmentation forces the model to focus on core semantics rather than surface form.

  5. Token-level aggregation (Key Insight #2):

  6. At each step of autoregressive decoding, the next-token probability distributions produced by all augmented inputs are collected.

  7. These distributions are averaged (or weighted-averaged), and the token with the highest probability is greedily selected.
  8. Compared to answer-level aggregation (e.g., majority voting), token-level aggregation exploits local confidence signals.
  9. Example: If 6 out of 10 augmentations predict "Germany," 3 predict "France," and 1 predicts "UK" at a given position, token-level aggregation can leverage this 60% consensus signal.

  10. Consensus pseudo-label adaptation (TTAdapt):

  11. Uses TTAug outputs as high-quality pseudo-labels.

  12. Performs lightweight parameter adaptation (potentially via LoRA or direct fine-tuning of a subset of layers).
  13. The adapted model can subsequently re-run TTAug, forming an iterative optimization loop.
  14. Compared to the parameter-free TTAug, TTAdapt further internalizes the corrective signals introduced by augmentation.

  15. Augmentation diversity vs. temperature sampling (Key Insight #1):

  16. Experiments demonstrate that input augmentation produces higher-quality answer diversity than temperature sampling.

  17. Temperature sampling only alters decoding stochasticity, whereas input augmentation changes the information received by the model, inducing more fundamental perspective variation.
  18. Under both Self-Consistency and Self-Selector TTS strategies, augmentation consistently outperforms temperature sampling.

Loss & Training

TTAug requires no training and is a purely inference-time method.

TTAdapt uses: - Pseudo-labels derived from TTAug consensus outputs. - Standard next-token cross-entropy loss. - Lightweight parameter updates to avoid overfitting to individual test samples.

Key Experimental Results

Main Results

Evaluated on 9 benchmarks using SmolVLM2-2.2B as the primary model: - VQA: GQA, TextVQA, OCRVQA - Multiple-choice / binary judgment: AI2D, MME-RealWorld, AMBER - Chart understanding: ChartQA, OCRBench - Image captioning: COCO Captions

Radar chart results show that TTAug and TTAdapt yield consistent improvements across all 9 benchmarks, with TTAdapt > TTAug > Baseline.

Comparison with Other TTS Methods

Method Diversity Source Aggregation Level Efficiency
Self-Consistency Temperature sampling Answer-level Low (multiple full decoding passes)
Self-Selector Temperature sampling Selection Low
Sample-and-Rank Temperature sampling Selection Low
Self-Synthesizer Temperature sampling LLM synthesis Lowest
TTAug (Ours) Input augmentation Token-level High

Scaling Behavior

  • Average performance increases monotonically with the number of augmentations and eventually saturates.
  • A small number of augmentations (e.g., 4–8) captures most of the performance gain.

Cross-Model Generalization

Although hyperparameters are tuned for SmolVLM2-2.2B, consistent improvements are observed across models of different scales and architectures, demonstrating the generality of the approach.

Ablation Study

Configuration Description
Aggregation strategy Token-level > answer-level
Aggregation position Exploration of optimal aggregation layer
Augmentation combination Image + text dual augmentation is optimal
Adaptation objective Comparison of different TTAdapt loss formulations

Qualitative Results

Multiple case studies illustrate the error-correction capability of augmentation and aggregation: - ChartQA: Baseline "France" → TTAug/TTAdapt "Germany" ✓ - OCRBench: Baseline "100.00" → TTAug/TTAdapt "71.10" ✓ - OCRVQA: Baseline "Brushy" → TTAug/TTAdapt "Brush Dance" ✓ - GQA: Baseline "Blinds" → TTAug/TTAdapt "Desk" ✓

Highlights & Insights

  1. Two key insights constitute the core contribution: augmentation > temperature sampling and token-level > answer-level aggregation; these findings can be applied independently to other TTS methods.
  2. Clever design of text augmentation: The strategy of injecting spelling noise while appending the original question as a reference is both simple and effective, compelling the model to ignore surface-level noise and attend to semantics.
  3. Iterative mechanism of TTAdapt: Using the model's own consensus outputs as pseudo-labels for self-adaptation creates a positive feedback loop.
  4. Practical deployment orientation: Small models are explicitly targeted, and efficiency constraints are maintained throughout the design.
  5. Systematic analysis: The paper provides comprehensive ablation studies, scaling behavior analysis, and cross-model generalization experiments.

Limitations & Future Work

  1. Augmentation count vs. latency trade-off: Although more efficient than sampling-based methods, processing multiple augmented inputs still incurs linearly growing latency.
  2. Generality of text augmentation: Spelling noise injection may not be appropriate for certain tasks (e.g., code generation).
  3. Risk of overfitting in TTAdapt: Adapting parameters on individual test samples may lead to overfitting, particularly on out-of-distribution data.
  4. Hyperparameter transfer: While cross-model generalization is demonstrated, the paper acknowledges that task-specific tuning yields better results.
  5. Evaluation scope: Evaluation is conducted primarily on discriminative tasks; effectiveness on open-ended generative tasks remains to be validated.
  • Test-Time Training (TTT): A classical test-time adaptation approach; this paper redesigns it for the VLM setting.
  • Self-Consistency (Wang et al.): Seminal work on majority-voting TTS; this paper improves upon it in both the source of diversity and the granularity of aggregation.
  • Test-Time Augmentation (classical CV): TTA is widely used in traditional computer vision; this paper extends it to autoregressive generation in VLMs.
  • Token Merging / Pruning: A complementary direction for efficiency optimization, potentially composable with TTAug.
  • The work offers broader inspiration for research on VLM inference efficiency: performance gains can be achieved not only through model compression but also through more principled inference strategies.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐