Abstract

YoCausal asks whether video diffusion models (VDMs) truly understand causality or merely overfit to statistical temporal patterns. The authors present a two-level benchmark inspired by cognitive science's Violation of Expectation (VoE) paradigm. By temporally reversing real-world videos at zero cost as counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 uses Reverse Surprise Index (RSI) to quantify arrow-of-time perception via denoising loss; Level 2 uses Causality Cognition Index (CCI) leveraging a VLM to distinguish causal from non-causal subsets. Evaluation of 13 state-of-the-art VDMs reveals a significant gap relative to human-level causal reasoning.

Key Contributions

- YoCausal benchmark: Two-level evaluation protocol using real-world video reversal as natural counterfactuals

  • Reverse Surprise Index (RSI): Quantifies arrow-of-time perception via denoising loss differences between forward and reversed videos
  • Causality Cognition Index (CCI): Uses VLM to stratify datasets into causal and non-causal subsets, disentangling causal reasoning from temporal bias
  • Finding: Perceiving arrow of time does not imply understanding causality

    Method Details

    - Level 1 (RSI): Computes denoising loss difference between original and temporally reversed videos. High RSI indicates the model can distinguish forward from reversed time flow.

  • Level 2 (CCI): Uses a VLM judge to classify video pairs as causal vs. non-causal, then measures how well VDMs assign higher likelihood to causal sequences.
  • Data: Real-world videos temporally reversed to create zero-cost counterfactual samples, avoiding sim-to-real gap of synthetic benchmarks.
  • Models evaluated: 13 state-of-the-art VDMs

    Key Results

    - 35% of correct answers from mid-tier models are backed by physically invalid traces (from related work on VLM auditing)

  • Perceiving arrow of time does not imply understanding causality
  • Significant gap persists relative to human-level causal cognition
  • 13 VDMs evaluated including SOTA models

    Limitations and Future Work

    - CCI relies on VLM judge quality (could have its own biases)

  • Current benchmark focuses on physical causality; could be extended to other causality types
  • The gap between VDMs and human causal reasoning remains substantial

    Relevance to Patrick's Research

    YoCausal directly addresses a fundamental question for world models: whether video generation models that claim to be "world models" truly understand causal structure of the physical world, or just statistical patterns. The finding that even correctly predicting arrow-of-time doesn't imply causal understanding is important for evaluating whether current video generation models deserve the "world model" label. The benchmark design using counterfactual video reversal is methodologically innovative.