Context Informs Pragmatic Interpretation in Vision-Language Models¶
Conference: NeurIPS 2025 arXiv: 2511.03908 Code: GitHub Area: Multimodal VLM Keywords: Pragmatic reasoning, reference games, context sensitivity, cognitive evaluation of VLMs, abstract visual reasoning
TL;DR¶
This work systematically evaluates the pragmatic reasoning capabilities of VLMs using iterated reference games. Models perform substantially worse than humans in the absence of context, but can rapidly leverage relevant dialogue history to achieve approximately 80% accuracy, revealing a strong dependence on contextual information.
Background & Motivation¶
Multi-turn dialogue is a central feature of human communication—shared conversational history underpins the formation of semantic conventions. Iterated reference games are a classical paradigm for studying this phenomenon: a director must describe a target figure in language so that a matcher can correctly select it from among multiple options. As the game progresses, participants develop concise, conventionalized expressions.
This capability is critical for AI dialogue systems and requires two behavioral properties:
Pragmatic interpretation: Understanding the intended meaning of an utterance in context
Context sensitivity: Leveraging prior interaction history to guide current comprehension
Nevertheless, reference games involving abstract figures (tangrams) remain highly challenging for AI systems, particularly in few-shot settings. This study is the first systematic comparison of human and state-of-the-art open-weight VLM performance on pragmatic reasoning in iterated reference games.
Method¶
Overall Architecture¶
Data: The study uses the iterated reference game dataset from Boyce et al. In each game, players view a grid of 12 tangram figures; the director describes a highlighted target, and the matcher selects it. Each game involves 2–6 players across 6 rounds totaling 72 trials.
Evaluation setup: Ten games are selected, and four open-weight VLMs are evaluated:
- Qwen 2.5 VL 32B
- Gemma 3 27B
- Llama 3.2 11B
- Kimi VL A3B
Each model receives a system prompt, a concatenated image of 12 labeled figures, and the dialogue history of prior trials (as chat history), and outputs log probabilities over letters A–L. Accuracy is measured as the normalized probability assigned to the correct target.
Key Designs¶
Eight control conditions (systematically manipulating the quantity, order, and relevance of context):
| Condition | Context Source | Same Game | Trial Order | Target Visible in History |
|---|---|---|---|---|
| Yoked | Same game | ✓ | Original order | ✓ |
| Shuffled | Same game | ✓ | Randomly shuffled | ✓ |
| Backward | Same game | ✓ | Reversed | ✓ |
| Ablated | Same game | ✓ | Original order | ✗ |
| Other-within | Different game (single) | ✓ | Original order | ✓ |
| Other-across | Different game (multiple) | ✗ | Original order | ✓ |
| Random | Different game (multiple) | ✗ | Randomly shuffled | ✓ |
| No context | None | ✗ | N/A | ✗ |
Human Baselines¶
- Original players: Interactive participants from the original games
- Naïve matchers: Human participants who read only the dialogue transcripts without having participated in the original game
- Yoked (\(N=99\)), Shuffled (\(N=97\)), Backward (\(N=89\)), Random (\(N=107\))
Key Experimental Results¶
Main Results¶
Model vs. human accuracy across conditions (steady-state estimates):
| Condition | Human (Original) | Human (Naïve) | Gemma 3 | Qwen 2.5 | Llama 3.2 | Kimi VL |
|---|---|---|---|---|---|---|
| No context | ~0.75 | — | ~0.15 | ~0.15 | ~0.12 | ~0.12 |
| Yoked (late rounds) | ~0.95 | ~0.75 | ~0.80 | ~0.40 | ~0.75 | ~0.75 |
| Other-within | — | — | ~0.40 | ~0.35 | ~0.30 | ~0.35 |
Model–human correlation (trial-level):
| Condition | Human Split-Half | Qwen 2.5 | Gemma 3 | Llama 3.2 | Kimi VL |
|---|---|---|---|---|---|
| Yoked | .42 [.32, .50] | .10 | .20 | .25 | .27 |
| Backward | .48 [.40, .56] | .23 | .31 | .40 | .35 |
| Random | — | .61 | .58 | .55 | .57 |
Ablation Study¶
Context relevance analysis (core systematic manipulation):
| Comparison | Finding |
|---|---|
| Yoked vs. Shuffled | Shuffling order degrades performance, mirroring human behavior |
| Yoked vs. Backward | Models outperform their shuffled baseline under reversed order, contrary to humans |
| Yoked vs. Other-within | Cross-game context substantially reduces accuracy (0.3–0.5) |
| Yoked vs. Ablated | Removing prior descriptions of the target figure causes a large performance drop |
Key Findings¶
- Models substantially underperform humans without context: Human accuracy is approximately 0.75 upon first exposure to a description; models perform only marginally above chance (0.083).
- Models can rapidly exploit relevant context: Under the Yoked condition, most models reach approximately 0.80 accuracy in later rounds.
- Context must originate from the same game: Cross-game context provides little benefit (0.3–0.5), indicating that conventions are game-specific.
- Model and human error patterns diverge: Trial-level correlations are weak (\(r = .10\)–\(.40\)), far below inter-human consistency.
Highlights & Insights¶
- Elegant experimental design: Eight control conditions systematically isolate the effects of context quantity, order, and relevance.
- Cognitive science perspective: Pragmatic reasoning is situated within a classical paradigm from human communication research rather than treated as a purely model-comparison exercise.
- Revealing fundamental model differences: Humans exhibit an intuitive zero-shot understanding (~0.75), whereas VLMs are heavily reliant on explicit context.
- Intriguing finding in the Backward condition: Models are better than humans at back-inferring early-game expressions from conventionalized ones, suggesting qualitatively different reasoning mechanisms.
- Methodological contribution: The study provides a reproducible framework for evaluating context sensitivity and pragmatic reasoning in VLMs.
Limitations & Future Work¶
- Only abstract tangram figures are tested; generalization to naturalistic images remains unverified.
- Models receive richer feedback than humans (models observe the correct answer, whereas human participants receive only binary correctness feedback).
- Only comprehension is evaluated; generating appropriate descriptions constitutes a harder and unaddressed task.
- Only four open-weight models are assessed.
- As a workshop paper, the scope is limited and certain analyses lack depth.
Related Work & Insights¶
- Clark (1996): Common ground theory provides the cognitive foundation for this work.
- Hawkins et al. (2021): A hierarchical model of convention formation from partner-specific to community-level representations.
- Gul et al. (2024) CoGen: Attempts to train models on the production side of reference games, with still-limited success.
- Implication: VLM "comprehension" may reflect pattern matching rather than genuine pragmatic reasoning; future work should distinguish in-context learning from true pragmatic adaptation.
Rating¶
- ⭐ Novelty: 4/5 — First systematic evaluation of VLMs on pragmatic reasoning in iterated reference games, with a well-designed experimental framework.
- ⭐ Experimental Thoroughness: 3/5 — Rich control conditions, but limited in number of models and stimulus types.
- ⭐ Writing Quality: 4/5 — Clear logic, intuitive figures, and adequate exposition of the cognitive science background.
- ⭐ Value: 3/5 — Moderate depth as a workshop paper, but opens a meaningful direction for evaluating pragmatic capabilities in VLMs.