Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven's Matrices

VMEvalKit Team

Abstract

We introduce a benchmark that evaluates the reasoning abilities of video models across multiple tasks. We find that Sora-2 achieves a success rate above 60%. Our experiments establish a clear performance hierarchy across different model architectures and training methods. We develop a robust and scalable evaluation paradigm, together with a unified and modular VMEvalKit framework that supports diverse task generation and highly reliable assessments. The framework provides clear success signals, creating opportunities for future reinforcement learning (RL) fine-tuning to further improve reasoning consistency.

Framework Overview

Supported Models

40+ video generation models across 11 provider families

Commercial APIs

OpenAI Sora Sora-2, Sora-2-Pro
Google Veo Veo 2.0, 3.0, 3.1
Runway ML Gen-3A, Gen-4 Turbo
Luma Ray-2, Ray-Flash-2
WaveSpeed WAN 2.1, 2.2 variants

Open Source

HunyuanVideo 720p I2V
VideoCrafter Text-guided synthesis
DynamiCrafter 256p-1024p animation
Stable Video Diffusion
LTX-Video · Morphic

Supported Tasks

9 cognitive task types with procedural generation

Chess Strategic thinking with mate-in-1 puzzles

Maze Path-finding via Kruskal's algorithm

Sudoku Logical constraint satisfaction

Mental Rotation 3D object transformation

Raven's Matrices Abstract pattern reasoning

Object Subtraction Selective removal reasoning

Clock Time-based reasoning

Mirror Clock Reflected time reasoning

Task Pair Structure

Each Task Pair consists of three core components:

Initial State first_frame.png Starting point or problem setup

Final State final_frame.png Goal state or expected solution

Text Prompt prompt.txt Instructions for video model

Task Examples

Chess Task Example — **Chess Task**
Video models must generate legal chess moves and visualize complete game progression. This tests spatial reasoning, rule understanding, and sequential planning abilities.

Maze Task Example — **Maze Task**
Models navigate from start to goal through complex maze structures. This evaluates path-finding capabilities and spatial memory in video generation.

Model Rankings

Model Performance Rankings — Performance comparison across different video generation models on reasoning tasks.

BibTeX

@misc{VMEvalKit,
  author       = {VMEvalKit Team},
  title        = {VMEvalKit: A framework for evaluating reasoning abilities in foundational video models},
  year         = {2025},
  howpublished = {\url{https://github.com/Video-Reason/VMEvalKit}}
}