Efficient Test-Time Scaling for Small Vision-Language Models¶

Conference: ICLR 2026 arXiv: 2510.03574 Code: GitHub Area: LLM Reasoning / VLM Efficiency Keywords: test-time scaling, vision-language models, test-time augmentation, test-time adaptation, token-level aggregation

TL;DR¶

This paper proposes two efficient test-time scaling strategies for small VLMs: TTAug (applying diverse input augmentations and aggregating output probability distributions at the token level) and TTAdapt (adapting model parameters using pseudo-labels generated by TTAug). Both methods consistently improve performance across 9 benchmarks while achieving substantially better computational efficiency than existing sampling-based test-time scaling approaches.

Background & Motivation¶

Small vision-language models (e.g., SmolVLM2-2.2B) offer computationally efficient alternatives to large models but suffer from weaker generalization and downstream task performance. Test-Time Scaling (TTS) techniques can compensate for limited model capacity by investing additional computation at inference time; however, existing methods face a fundamental tension:

Self-Consistency: Generates multiple candidate answers via temperature sampling and aggregates them through majority voting. However, repeated sampling is computationally expensive, and aggregation at the answer level discards fine-grained token-level information.

Self-Selector / Sample-and-Rank: Selects the best response using the model itself or log probabilities, but still relies on multiple independent sampling passes.

Self-Synthesizer: Synthesizes multiple responses into a final answer, incurring additional overhead for the synthesis step.

Root Cause: The computational cost of existing TTS methods conflicts with the resource-efficiency design goals of small models.

The paper identifies two design choices that simultaneously improve both effectiveness and efficiency: - Replace temperature sampling with input augmentation to induce diversity: Semantics-preserving augmentations produce higher-quality diverse candidates than temperature sampling. - Aggregate at the token level rather than the answer level: This captures finer-grained confidence signals.

Method¶

Overall Architecture¶

Two complementary pipelines:

Pipeline 1 — TTAug (Test-Time Augmentation): Input image + text prompt → generate multiple augmented variants → process each variant with the VLM → aggregate next-token probability distributions at each decoding step → greedily decode the final answer.

Pipeline 2 — TTAdapt (Test-Time Adaptation): TTAug generates pseudo-labels → fine-tune VLM parameters on pseudo-labels → repeat until convergence or budget exhaustion.

Key Designs¶

Dual-channel input augmentation:
Image augmentation: Applies semantics-preserving transformations to the input image (e.g., slight rotations, crops, color jitter) to generate multiple visual variants.
Text augmentation: Introduces mild perturbations to the text prompt — injecting typos and tokenization noise (e.g., "Which country" → "Wh ich cou ntry") — while appending the original intact question as a reference (e.g., "In other words, ...").
Design Motivation: Image augmentation provides viewpoint diversity; text augmentation forces the model to focus on core semantics rather than surface form.
Token-level aggregation (Key Insight #2):
At each step of autoregressive decoding, the next-token probability distributions produced by all augmented inputs are collected.
These distributions are averaged (or weighted-averaged), and the token with the highest probability is greedily selected.
Compared to answer-level aggregation (e.g., majority voting), token-level aggregation exploits local confidence signals.
Example: If 6 out of 10 augmentations predict "Germany," 3 predict "France," and 1 predicts "UK" at a given position, token-level aggregation can leverage this 60% consensus signal.
Consensus pseudo-label adaptation (TTAdapt):
Uses TTAug outputs as high-quality pseudo-labels.
Performs lightweight parameter adaptation (potentially via LoRA or direct fine-tuning of a subset of layers).
The adapted model can subsequently re-run TTAug, forming an iterative optimization loop.
Compared to the parameter-free TTAug, TTAdapt further internalizes the corrective signals introduced by augmentation.
Augmentation diversity vs. temperature sampling (Key Insight #1):
Experiments demonstrate that input augmentation produces higher-quality answer diversity than temperature sampling.
Temperature sampling only alters decoding stochasticity, whereas input augmentation changes the information received by the model, inducing more fundamental perspective variation.
Under both Self-Consistency and Self-Selector TTS strategies, augmentation consistently outperforms temperature sampling.

Loss & Training¶

TTAug requires no training and is a purely inference-time method.

TTAdapt uses: - Pseudo-labels derived from TTAug consensus outputs. - Standard next-token cross-entropy loss. - Lightweight parameter updates to avoid overfitting to individual test samples.

Key Experimental Results¶

Main Results¶

Evaluated on 9 benchmarks using SmolVLM2-2.2B as the primary model: - VQA: GQA, TextVQA, OCRVQA - Multiple-choice / binary judgment: AI2D, MME-RealWorld, AMBER - Chart understanding: ChartQA, OCRBench - Image captioning: COCO Captions

Radar chart results show that TTAug and TTAdapt yield consistent improvements across all 9 benchmarks, with TTAdapt > TTAug > Baseline.

Comparison with Other TTS Methods¶

Method	Diversity Source	Aggregation Level	Efficiency
Self-Consistency	Temperature sampling	Answer-level	Low (multiple full decoding passes)
Self-Selector	Temperature sampling	Selection	Low
Sample-and-Rank	Temperature sampling	Selection	Low
Self-Synthesizer	Temperature sampling	LLM synthesis	Lowest
TTAug (Ours)	Input augmentation	Token-level	High

Scaling Behavior¶

Average performance increases monotonically with the number of augmentations and eventually saturates.
A small number of augmentations (e.g., 4–8) captures most of the performance gain.

Cross-Model Generalization¶

Although hyperparameters are tuned for SmolVLM2-2.2B, consistent improvements are observed across models of different scales and architectures, demonstrating the generality of the approach.

Ablation Study¶

Configuration	Description
Aggregation strategy	Token-level > answer-level
Aggregation position	Exploration of optimal aggregation layer
Augmentation combination	Image + text dual augmentation is optimal
Adaptation objective	Comparison of different TTAdapt loss formulations

Qualitative Results¶

Multiple case studies illustrate the error-correction capability of augmentation and aggregation: - ChartQA: Baseline "France" → TTAug/TTAdapt "Germany" ✓ - OCRBench: Baseline "100.00" → TTAug/TTAdapt "71.10" ✓ - OCRVQA: Baseline "Brushy" → TTAug/TTAdapt "Brush Dance" ✓ - GQA: Baseline "Blinds" → TTAug/TTAdapt "Desk" ✓

Highlights & Insights¶

Two key insights constitute the core contribution: augmentation > temperature sampling and token-level > answer-level aggregation; these findings can be applied independently to other TTS methods.
Clever design of text augmentation: The strategy of injecting spelling noise while appending the original question as a reference is both simple and effective, compelling the model to ignore surface-level noise and attend to semantics.
Iterative mechanism of TTAdapt: Using the model's own consensus outputs as pseudo-labels for self-adaptation creates a positive feedback loop.
Practical deployment orientation: Small models are explicitly targeted, and efficiency constraints are maintained throughout the design.
Systematic analysis: The paper provides comprehensive ablation studies, scaling behavior analysis, and cross-model generalization experiments.

Limitations & Future Work¶

Augmentation count vs. latency trade-off: Although more efficient than sampling-based methods, processing multiple augmented inputs still incurs linearly growing latency.
Generality of text augmentation: Spelling noise injection may not be appropriate for certain tasks (e.g., code generation).
Risk of overfitting in TTAdapt: Adapting parameters on individual test samples may lead to overfitting, particularly on out-of-distribution data.
Hyperparameter transfer: While cross-model generalization is demonstrated, the paper acknowledges that task-specific tuning yields better results.
Evaluation scope: Evaluation is conducted primarily on discriminative tasks; effectiveness on open-ended generative tasks remains to be validated.

Test-Time Training (TTT): A classical test-time adaptation approach; this paper redesigns it for the VLM setting.
Self-Consistency (Wang et al.): Seminal work on majority-voting TTS; this paper improves upon it in both the source of diversity and the granularity of aggregation.
Test-Time Augmentation (classical CV): TTA is widely used in traditional computer vision; this paper extends it to autoregressive generation in VLMs.
Token Merging / Pruning: A complementary direction for efficiency optimization, potentially composable with TTAug.
The work offers broader inspiration for research on VLM inference efficiency: performance gains can be achieved not only through model compression but also through more principled inference strategies.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐