Can Vision Language Models Understand Mimed Actions?


1      2  3  4  5 
ACL 2025 Findings

Evaluating VLM's Understanding of Human Gestures through Mimed Actions:

A Foundational Prerequisite for Nonverbal Communication Understanding

Nonverbal communication (NVC) plays an integral role in human language, but studying NVC in general is challenging because of its broad scope and high variance in interpretation among individuals and cultures. However, mime - the theatrical technique of suggesting intent using only gesture, expression, and movement - is a subset of NVC that consists of explicit and embodied actions with much lower human interpretation variance. We argue that a solid understanding of mimed actions is a crucial prerequisite for vision-language models capable of interpreting and commanding more subtle aspects of NVC.

Mime Identification Multimodal Evaluation (MIME)

MIME Overview

Hence, we propose Mime Identification Multimodal Evaluation (MIME), a novel video-based question answering benchmark comprising of 86 mimed actions.

The figure on the left shows a simplified illustration of a sample in MIME shown with a single frame from a video of a 3D male character miming a basketball shot. We evaluate with both multiple-choice (MC) and free-form short answer (FF) formats, where the former provides a list of options, which in effect supplies contextual information, while the latter requires the model to provide a short answer without such context and is therefore more challenging.

Spoiler alert: Humans achieve almost perfect accuracy on identifying mimed actions regardless of evaluation format, adversarial perturbations, and the absence of any salient context (e.g., basketball, court, basketball uniform), while VLMs struggle without salient context.

Sample MIME Videos

MIME Pipeline

MIME pipeline

We illustrate the pipeline for constructing MIME above. (1) We first collect motion capture data of a mimed action on a Vicon stage. (2) Then, a 3D character is retargeted to our motion capture data in Blender, a computer graphics software. (3) Next, we render frames of the animation with a transparent background. (4) With frames rendered with transparent backgrounds, we can easily overlay them over images of our choice.

MIME Variants

MIME variants

The main benefit of our setup is that we can flexibly permute different configurations for each action to ablate the robustness of a VLM's understanding of mimed actions.

(a,b,f,g) are examples of the same animation but with changes to the camera angle. Different body parts become occluded depending on the angle. (c) and (h) only change the character from (a). (c) is a female human character while (h) is an adversarial character 😈 in a sci-fi spacesuit. (d) and (i) are variants of (a) and (h) respectively with aligned backgrounds (=background, e.g., basketball court for basketball-related action) while (e) and (j) have adversarial backgrounds (≠background, e.g., living room).

MIME Results

MIME main results

We evaluate various open-weight and API-based vision-language models on MIME. We find that humans are robust to all variations, but VLMs drop performance for adversarial perturbations while the performance improves when exposed to signals from the background that are aligned with the action! This motivates the need for increased research for instilling more robust understanding of human gestures in VLMs.

Check out the paper for more details and other interesting findings!

We share interesting findings on results with attempts to improve on MIME, such as Chain-of-Thought (CoT) prompting, few-shot in-context learning, and fine-tuning! We also analyze the failure modes of CoT to identify why current VLMs fail so miserably on MIME.

BibTeX


        TBD