Can Vision Language Models Understand Mimed Actions?¶

Metadata	Content
Conference	ACL 2025
arXiv	2506.21586
Code	justin-cho.com/mime
Area	Multimodal VLM
Keywords	Mime recognition, VLM evaluation, Action understanding, Non-verbal communication, Video QA

TL;DR¶

This paper proposes the Mime benchmark (86 mimed actions × 10 variations = 860 samples), constructing a controllable evaluation via motion capture + 3D rendering. It finds that while humans maintain near-100% accuracy under various perturbations, the strongest VLM achieves only 52.3% (multiple-choice) / 19.8% (free-form), revealing that VLMs heavily rely on scene context cues rather than the action itself.

Background & Motivation¶

Research Question: Can Vision-Language Models (VLMs) reliably recognize mimed actions—a subset of non-verbal communication that conveys intent solely through body movements while removing crucial object contexts?

Core Argument: Mimed actions represent a unique subset of non-verbal communication (NVC). Unlike other gestures, mimed actions exhibit extremely high interpretative consistency among humans and are directly linked to physical actions. Therefore, understanding pantomimes is a necessary foundational prerequisite for VLMs to advance toward NVC understanding.

Limitations of Prior Work: While VLMs perform exceptionally on standard action recognition benchmarks, actions in these benchmarks are always accompanied by complete contextual cues (e.g., weightlifting in a gym with barbells, sportswear, etc.). When these cues are stripped away, the true understanding capabilities of VLMs are fully exposed.

Method¶

Overall Architecture¶

The benchmark is constructed using motion-capture (MoCap) and 3D computer graphics software (Blender), allowing flexible control over characters, backgrounds, and viewpoints to systematically evaluate the robustness of VLMs in mimed action recognition:

Data Construction Pipeline: (1) Vicon MoCap stage capture \(\rightarrow\) (2) 3D character retargeting in Blender \(\rightarrow\) (3) Transparent background frame rendering \(\rightarrow\) (4) Overlaying onto specified backgrounds.

Key Designs¶

1. Motion Capture Data Collection: - Brainstormed 75 candidate mimed actions (actions lacking critical object context, such as playing violin without a violin, swimming without water). - Captured by 2 actors (1 amateur male + 1 professional female) with 3 takes each. - Retained only if at least 2 out of 3 authors could correctly identify the action. - Finally retained 47 action types and 86 mime samples.

2. 10 Variational Designs (Per Action):

Variation	Character	Background	Viewpoint
Base	Male human	Blank white	Frontal
Aligned Background	Male human	Matching action (e.g., basketball court)	Frontal
Adversarial Background	Male human	Mismatched (e.g., living room)	Frontal
Adversarial Character	Astronaut suit character	Blank white	Frontal
Female Character	Female human	Blank white	Frontal
90°/180°/270°	Male human	Blank white	Rotated

3. Dual-Format Evaluation: - Multiple-Choice (MC): 4 options, with distractors excluding semantically similar actions. - Free-Form (FF): No option prompts, using sentence embedding cosine similarity (threshold 0.5) to determine correctness.

Experiments¶

Main Results: Mime vs Real¶

Model	Mime MC	Mime FF	Real MC	Real FF
Gemini 1.5 Flash	52.3%	19.8%	~100%	~95%
GPT-4o Mini	41.9%	11.6%	~99%	~92%
Qwen 2.5 VL (7B)	39.5%	5.8%	~97%	~85%
InternVL2.5 (8B)	31.4%	2.3%	~96%	~80%
Human	99.6%	89.5%	~100%	~95%

Ablation on Background Perturbations¶

Model	Base (Blank)	Aligned Background	Adversarial Background
Gemini 1.5 Flash MC	52.3%	68.6%	37.2%
GPT-4o Mini MC	41.9%	66.3%	37.2%
Qwen 2.5 VL (7B) MC	39.5%	68.6%	32.6%
Human MC	99.6%	98.5%	99.2%

Ablation on Viewpoint Perturbations¶

Model	0°	90°	180°	270°	Std Dev ↓
Gemini 1.5 Flash MC	52.3%	47.7%	52.3%	53.5%	2.2
GPT-4o Mini MC	41.9%	47.7%	43.0%	47.7%	2.6
Human MC	99.6%	98.8%	98.8%	98.7%	0.4

Key Findings¶

Massive gap between VLMs and humans: Humans maintain ~99% MC accuracy across all variations, whereas the strongest VLM (Gemini 1.5 Flash) achieves only 52.3%, a gap of roughly 47 percentage points.
VLMs rely heavily on scene cues: The aligned background improves Gemini's performance from 52.3% to 68.6% (+16.3%) while human performance remains virtually unchanged, indicating that VLMs guess action based on background rather than understanding the action itself.
Adversarial backgrounds seriously mislead VLMs: Performance drops from 52.3% to 37.2%, while humans remain unaffected.
Humans are highly robust to viewpoint and character changes: VLMs exhibit significant fluctuations across different viewpoints.
Chain of Thought offers no obvious help: Manual inspection of Gemini's CoT outputs reveals that 80% of errors stem from incorrect motion observation, whereas only 15% arise from incorrect reasoning over correct descriptions.
Few-shot offers only marginal help for closed-source models: However, performance remains far below that of humans.

Highlights & Insights¶

Elegant experimental design: Achieves complete decoupling of action, character, background, and viewpoint through MoCap and 3D rendering, enabling systematic ablation studies.
Clearly exposes the fundamental flaw in VLM action understanding: It is not a matter of "poor performance" but rather a lack of "actual understanding of human actions."
Constructs a "Real" control dataset to precisely quantify the impact of having versus lacking context cues on VLMs.
Rigorous human evaluation design: Involves 60 participants spanning 8 nationalities and diverse backgrounds.

Limitations & Future Work¶

The scale of 86 mimed actions is relatively limited and may not fully cover all types of everyday actions.
The 3D-rendered characters lack detailed facial expressions, which are critical components of non-verbal communication.
The motion capture data of only 2 actors might not sufficiently capture individual performance variations.
The multiple-choice distractors exclude semantically similar options, which might make the evaluation relatively easy.
Factors such as video length and frame rate on VLM performance have not been fully explored.
Only single-action recognition is considered, without addressing action sequence understanding.

Action Recognition Benchmarks: Traditional benchmarks provide complete contextual cues, complementing Mime.
VLM Video Understanding: Models like Qwen-VL, InternVL, Gemini, and GPT-4o Mini perform exceptionally well on standard video QA.
Non-verbal Communication Studies: Mehrabian (1972) and Poyatos (1983) laid the foundations for NVC research.
Human Pantomime Cognition: O'Reilly (1995) and Little & Firestone (2021) demonstrate that humans show extremely high consistency in recognizing mimed actions.
Contrastive Evaluation Paradigm: Isolates specific capabilities by controlling variables, such as controlling context cues to test action understanding in this paper.

Rating¶

Dimension	Rating
Novelty	⭐⭐⭐⭐⭐
Technical Depth	⭐⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Overall Score	8.5/10