YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Abstract

YoCausal asks whether video diffusion models (VDMs) truly understand causality or merely overfit to statistical temporal patterns. The authors present a two-level benchmark inspired by cognitive science's Violation of Expectation (VoE) paradigm. By temporally reversing real-world videos at zero cost as counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 uses Reverse Surprise Index (RSI) to quantify arrow-of-time perception via denoising loss; Level 2 uses Causality Cognition Index (CCI) leveraging a VLM to distinguish causal from non-causal subsets. Evaluation of 13 state-of-the-art VDMs reveals a significant gap relative to human-level causal reasoning.

Key Contributions

- YoCausal benchmark: Two-level evaluation protocol using real-world video reversal as natural counterfactuals

Reverse Surprise Index (RSI): Quantifies arrow-of-time perception via denoising loss differences between forward and reversed videos

Causality Cognition Index (CCI): Uses VLM to stratify datasets into causal and non-causal subsets, disentangling causal reasoning from temporal bias

Finding: Perceiving arrow of time does not imply understanding causality

Method Details

- Level 1 (RSI): Computes denoising loss difference between original and temporally reversed videos. High RSI indicates the model can distinguish forward from reversed time flow.

Level 2 (CCI): Uses a VLM judge to classify video pairs as causal vs. non-causal, then measures how well VDMs assign higher likelihood to causal sequences.

Data: Real-world videos temporally reversed to create zero-cost counterfactual samples, avoiding sim-to-real gap of synthetic benchmarks.

Models evaluated: 13 state-of-the-art VDMs

Key Results

- 35% of correct answers from mid-tier models are backed by physically invalid traces (from related work on VLM auditing)

Perceiving arrow of time does not imply understanding causality

Significant gap persists relative to human-level causal cognition

13 VDMs evaluated including SOTA models

Limitations and Future Work

- CCI relies on VLM judge quality (could have its own biases)

Current benchmark focuses on physical causality; could be extended to other causality types

The gap between VDMs and human causal reasoning remains substantial

Relevance to Patrick's Research

YoCausal directly addresses a fundamental question for world models: whether video generation models that claim to be "world models" truly understand causal structure of the physical world, or just statistical patterns. The finding that even correctly predicting arrow-of-time doesn't imply causal understanding is important for evaluating whether current video generation models deserve the "world model" label. The benchmark design using counterfactual video reversal is methodologically innovative.