Can Vision Language Models Understand Mimed Actions? Mime Benchmark

Can Vision Language Models Understand
Mimed Actions?

Hyundong Justin Cho¹, Spencer Lin², Tejas Srinivasan³, Michael Saxon⁴, Deuksin Kwon², Natali T. Chavez⁵, Jonathan May¹

⁴

⁵

ACL 2025 Findings

Code & Data arXiv Leaderboard Poster Slides

Evaluating VLM's Understanding of Human Gestures through Mimed Actions:

A Foundational Prerequisite for Nonverbal Communication Understanding

Nonverbal communication (NVC) plays an integral role in human language, but studying NVC in general is challenging because of its broad scope and high variance in interpretation among individuals and cultures. However, mime - the theatrical technique of suggesting intent using only gesture, expression, and movement - is a subset of NVC that consists of explicit and embodied actions with much lower human interpretation variance. We argue that a solid understanding of mimed actions is a crucial prerequisite for vision-language models capable of interpreting and commanding more subtle aspects of NVC.

Mime Identification Multimodal Evaluation (Mime)

Hence, we propose Mime Identification Multimodal Evaluation (Mime), a novel video-based question answering benchmark comprising of 86 mimed actions.

The figure on the left shows a simplified illustration of a sample in Mime shown with a single frame from a video of a 3D male character miming a basketball shot. We evaluate with both multiple-choice (MC) and free-form short answer (FF) formats, where the former provides a list of options, which in effect supplies contextual information, while the latter requires the model to provide a short answer without such context and is therefore more challenging.

Spoiler alert: Humans achieve almost perfect accuracy on identifying mimed actions regardless of evaluation format, adversarial perturbations, and the absence of any salient context (e.g., basketball, court, basketball uniform), while VLMs struggle without salient context.

Mime Teaser

Mime Pipeline

We illustrate the pipeline for constructing Mime above. (1) We first collect motion capture data of a mimed action on a Vicon stage. (2) Then, a 3D character is retargeted to our motion capture data in Blender, a computer graphics software. (3) Next, we render frames of the animation with a transparent background. (4) With frames rendered with transparent backgrounds, we can easily overlay them over images of our choice.

Mime Variants

The main benefit of our setup is that we can flexibly permute different configurations for each action to ablate the robustness of a VLM's understanding of mimed actions.

(a,b,f,g) are examples of the same animation but with changes to the camera angle. Different body parts become occluded depending on the angle. (c) and (h) only change the character from (a). (c) is a female human character while (h) is an adversarial character in a sci-fi spacesuit. (d) and (i) are variants of (a) and (h) respectively with aligned backgrounds (=background, e.g., basketball court for basketball-related action) while (e) and (j) have adversarial backgrounds (≠background, e.g., living room).

Mime Results

We evaluate various open-weight and API-based vision-language models on Mime. We find that humans are robust to all variations, but VLMs drop performance for adversarial perturbations while the performance improves when exposed to signals from the background that are aligned with the action! This motivates the need for increased research for instilling more robust understanding of human gestures in VLMs.

Check out the paper for more details and other interesting findings!

We share interesting findings on results with attempts to improve on Mime, such as Chain-of-Thought (CoT) prompting, few-shot in-context learning, and fine-tuning! We also analyze the failure modes of CoT to identify why current VLMs fail so miserably on Mime.

Mime Leaderboard

😈 + ≠background@0 (Free-form)

Rank	Name	Model	Method	Accuracy (%)	Submission Date
-	Cho et al. 2025	Human	-	95.0	2025-07-03
1	Cho et al. 2025	Gemini 1.5 Flash	CoT	11.6	2025-07-03
2	Cho et al. 2025	Phi 3.5 (4.2B)	Zero-shot	5.8	2025-07-03
3	Cho et al. 2025	GPT-4o Mini	CoT	8.1	2025-07-03
4	Cho et al. 2025	Gemini 1.5 Flash	Zero-shot	3.5	2025-07-03
5	Cho et al. 2025	GPT-4o Mini	Zero-shot	2.3	2025-07-03
6	Cho et al. 2025	GPT-4o Mini	Few-shot	2.3	2025-07-03
7	Cho et al. 2025	InternVL2.5 (8B)	Zero-shot	2.3	2025-07-03
8	Cho et al. 2025	Phi 3.5 (4.2B)	CoT	2.3	2025-07-03
9	Cho et al. 2025	Qwen 2.5 VL (7B)	CoT	1.2	2025-07-03
10	Cho et al. 2025	Qwen 2.5 VL (3B)	Zero-shot	0.0	2025-07-03

😈 + ≠background@0 (Multiple Choice)

Rank	Name	Model	Method	Accuracy (%)	Submission Date
-	Cho et al. 2025	Human	-	99.2	2025-07-03
1	Cho et al. 2025	GPT-4o Mini	Few-shot	59.3	2025-07-03
2	Cho et al. 2025	Gemini 1.5 Flash	CoT	48.8	2025-07-03
3	Cho et al. 2025	Gemini 1.5 Flash	Few-shot	44.2	2025-07-03
4	Cho et al. 2025	GPT-4o Mini	CoT	44.2	2025-07-03
5	Cho et al. 2025	Gemini 1.5 Flash	Zero-shot	36.1	2025-07-03
6	Cho et al. 2025	GPT-4o Mini	Zero-shot	36.1	2025-07-03
7	Cho et al. 2025	Phi 3.5 (4.2B)	Zero-shot	36.1	2025-07-03
8	Cho et al. 2025	Qwen 2.5 VL (7B)	Zero-shot	30.2	2025-07-03
9	Cho et al. 2025	InternVL2.5 (8B)	Zero-shot	30.2	2025-07-03
10	Cho et al. 2025	Qwen 2.5 VL (3B)	CoT	25.6	2025-07-03

Base + blank@0 (Free-form)

Rank	Name	Model	Method	Accuracy (%)	Submission Date
-	Cho et al. 2025	Human	-	89.5	2025-07-03
1	Cho et al. 2025	Gemini 1.5 Flash	CoT	22.1	2025-07-03
2	Cho et al. 2025	Gemini 1.5 Flash	Zero-shot	19.8	2025-07-03
3	Cho et al. 2025	Gemini 1.5 Flash	Few-shot	14.0	2025-07-03
4	Cho et al. 2025	GPT-4o Mini	CoT	16.3	2025-07-03
5	Cho et al. 2025	GPT-4o Mini	Zero-shot	11.6	2025-07-03
6	Cho et al. 2025	GPT-4o Mini	Few-shot	9.3	2025-07-03
7	Cho et al. 2025	Qwen 2.5 VL (7B)	CoT	8.1	2025-07-03
8	Cho et al. 2025	Phi 3.5 (4.2B)	CoT	4.7	2025-07-03
9	Cho et al. 2025	Qwen 2.5 VL (7B)	Zero-shot	5.8	2025-07-03
10	Cho et al. 2025	InternVL2.5 (8B)	Zero-shot	2.3	2025-07-03

Base + blank@0 (Multiple Choice)

Rank	Name	Model	Method	Accuracy (%)	Submission Date
-	Cho et al. 2025	Human	-	99.6	2025-07-03
1	Cho et al. 2025	GPT-4o Mini	Few-shot	74.4	2025-07-03
2	Cho et al. 2025	Gemini 1.5 Flash	Few-shot	57.0	2025-07-03
3	Cho et al. 2025	Gemini 1.5 Flash	CoT	54.7	2025-07-03
4	Cho et al. 2025	Gemini 1.5 Flash	Zero-shot	52.3	2025-07-03
5	Cho et al. 2025	GPT-4o Mini	CoT	43.0	2025-07-03
6	Cho et al. 2025	Qwen 2.5 VL (3B)	CoT	43.0	2025-07-03
7	Cho et al. 2025	GPT-4o Mini	Zero-shot	41.9	2025-07-03
8	Cho et al. 2025	Qwen 2.5 VL (7B)	CoT	41.9	2025-07-03
9	Cho et al. 2025	Qwen 2.5 VL (7B)	Zero-shot	39.5	2025-07-03
10	Cho et al. 2025	Qwen 2.5 VL (7B)	FT	36.8	2025-07-03

To submit your results, please fill out this submission form.

Video Presentation

Cite us!


        @misc{cho2025visionlanguagemodelsunderstand,
          title={Can Vision Language Models Understand Mimed Actions?}, 
          author={Hyundong Cho and Spencer Lin and Tejas Srinivasan and Michael Saxon and Deuksin Kwon and Natali T. Chavez and Jonathan May},
          year={2025},
          eprint={2506.21586},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2506.21586}, 
    }