Hence, we propose Mime Identification Multimodal Evaluation (MIME), a novel video-based question answering benchmark comprising of 86 mimed actions.
The figure on the left shows a simplified illustration of a sample in MIME shown with a single frame from a video of a 3D male character miming a basketball shot.
We evaluate with both multiple-choice (MC) and free-form short answer (FF) formats, where the former provides a list of options, which in effect supplies contextual information, while the latter requires the model to provide a short answer without such context and is therefore more challenging.
Spoiler alert: Humans achieve almost perfect accuracy on identifying mimed actions regardless of evaluation format, adversarial perturbations, and the absence of any salient context (e.g., basketball, court, basketball uniform), while VLMs struggle without salient context.