Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven's Matrices

VMEvalKit Team · 2025

Abstract

We introduce a benchmark that evaluates the reasoning abilities of video models across multiple tasks. We find that Sora-2 achieves a success rate above 60%. Our experiments establish a clear performance hierarchy across different model architectures and training methods. We develop a robust and scalable evaluation paradigm, together with a unified and modular VMEvalKit framework that supports diverse task generation and highly reliable assessments. The framework provides clear success signals, creating opportunities for future reinforcement learning (RL) fine-tuning to further improve reasoning consistency.

Framework Overview

VMEvalKit Framework Overview

Supported Models

40+ video generation models across 11 provider families

Commercial APIs

  • OpenAI Sora Sora-2, Sora-2-Pro
  • Google Veo Veo 2.0, 3.0, 3.1
  • Runway ML Gen-3A, Gen-4 Turbo
  • Luma Ray-2, Ray-Flash-2
  • WaveSpeed WAN 2.1, 2.2 variants

Open Source

  • HunyuanVideo 720p I2V
  • VideoCrafter Text-guided synthesis
  • DynamiCrafter 256p-1024p animation
  • Stable Video Diffusion
  • LTX-Video · Morphic

Supported Tasks

9 cognitive task types with procedural generation

Chess Strategic thinking with mate-in-1 puzzles
Maze Path-finding via Kruskal's algorithm
Sudoku Logical constraint satisfaction
Mental Rotation 3D object transformation
Raven's Matrices Abstract pattern reasoning
Object Subtraction Selective removal reasoning
Clock Time-based reasoning
Mirror Clock Reflected time reasoning

Task Pair Structure

Task Pair Structure

Each Task Pair consists of three core components:

Initial State first_frame.png Starting point or problem setup
Final State final_frame.png Goal state or expected solution
Text Prompt prompt.txt Instructions for video model

Task Examples

Chess Task Example
Chess Task

Video models must generate legal chess moves and visualize complete game progression. This tests spatial reasoning, rule understanding, and sequential planning abilities.

Maze Task Example
Maze Task

Models navigate from start to goal through complex maze structures. This evaluates path-finding capabilities and spatial memory in video generation.

Model Rankings

Model Performance Rankings
Performance comparison across different video generation models on reasoning tasks.

BibTeX

@misc{VMEvalKit,
  author       = {VMEvalKit Team},
  title        = {VMEvalKit: A framework for evaluating reasoning abilities in foundational video models},
  year         = {2025},
  howpublished = {\url{https://github.com/Video-Reason/VMEvalKit}}
}