YoCausal asks whether video diffusion models (VDMs) truly understand causality or merely overfit to statistical temporal patterns. The authors present a two-level benchmark inspired by cognitive science's Violation of Expectation (VoE) paradigm. By temporally reversing real-world videos at zero cost as counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 uses Reverse Surprise Index (RSI) to quantify arrow-of-time perception via denoising loss; Level 2 uses Causality Cognition Index (CCI) leveraging a VLM to distinguish causal from non-causal subsets. Evaluation of 13 state-of-the-art VDMs reveals a significant gap relative to human-level causal reasoning.
- YoCausal benchmark: Two-level evaluation protocol using real-world video reversal as natural counterfactuals
- Level 1 (RSI): Computes denoising loss difference between original and temporally reversed videos. High RSI indicates the model can distinguish forward from reversed time flow.
- 35% of correct answers from mid-tier models are backed by physically invalid traces (from related work on VLM auditing)
- CCI relies on VLM judge quality (could have its own biases)
YoCausal directly addresses a fundamental question for world models: whether video generation models that claim to be "world models" truly understand causal structure of the physical world, or just statistical patterns. The finding that even correctly predicting arrow-of-time doesn't imply causal understanding is important for evaluating whether current video generation models deserve the "world model" label. The benchmark design using counterfactual video reversal is methodologically innovative.