ImageTime: A Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency

Abstract

ImageTime is a diagnostic benchmark that tests whether image generation models can coherently "imagine time" — preserve identities, objects, spatial relations, and causal order across multiple visual states. Given an action instruction (and optionally a reference image), a model must produce one image containing four ordered key states: initial state → action onset → transition state → final state. This four-keyframe protocol is more temporally demanding than single-image generation but avoids the confounds of dense video dynamics. Tasks are organized in a progressive capability hierarchy with stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations, scored by GPT-5.5 under a VLM-as-judge protocol.

Key Contributions

Four-keyframe protocol — one image with four ordered states (initial, onset, transition, final), positioned between single-image generation and dense video
Progressive capability hierarchy — tasks structured so each level demands more temporal reasoning
Structured evaluation — stage-wise state predicates + cross-frame temporal constraints + forbidden causal violations
GPT-5.5 VLM-as-judge — interpretable capability scores, diagnostic subscores, and failure labels
Multi-family benchmark — reveals where current image generation systems succeed, fail, and drift on coherent visual world modeling

Method Details

Benchmark design:

Task formulation: given an action instruction a (and optionally a reference image I_ref specifying the initial state), generate an image containing four ordered keyframes — initial state s_0, action onset s_1, transition state s_2, final state s_3
Progressive capability hierarchy: tasks are organized into levels that demand progressively more sophisticated temporal reasoning (preservation of identity, spatial relations, causal order, etc.)
Per-task decomposition: each scenario decomposes into:

Stage-wise state predicates — what must be true in each keyframe
Cross-frame temporal constraints — what must be preserved across the sequence
Forbidden causal violations — what physical/causal events must not occur

VLM-as-judge (GPT-5.5): a structured scoring protocol produces:

Capability scores — overall temporal reasoning ability
Diagnostic subscores — per-capability breakdowns
Failure labels — categorical failure modes for error analysis

Multi-family evaluation: results span multiple image-generation model families to map where each succeeds and drifts

The design deliberately sits between single-image generation (no time) and dense video generation (confounded by motion/physics modeling) — it isolates temporal coherence as a property of the image model itself.

Key Results

ImageTime exposes failure modes of current image generation models on temporally-ordered visual states that single-image benchmarks miss
The four-keyframe protocol is more temporally demanding than single-image generation while avoiding dense video's physics confound
Specific numerical results across model families are reported in the paper body (abstract does not enumerate them); the headline claim is that no tested model fully solves the progressive capability hierarchy
Failure labels and diagnostic subscores give actionable error analysis per model family

Limitations and Future Work

A four-keyframe image is a proxy for temporal reasoning — it tests discrete state transitions but not continuous dynamics
VLM-as-judge (GPT-5.5) inherits any biases in the judge model; agreement with human raters needs separate validation
The benchmark focuses on visual coherence; semantic plausibility and physical correctness beyond "forbidden causal violations" are not exhaustively covered
Extension to longer horizons (≥8 keyframes) and to reference-guided editing / previsualization workflows is open

Relevance to Patrick's Research

ImageTime is a sharp empirical probe of a question that matters for world modeling: can a static image model reason about discrete time? For Patrick's tracking, it sits in the evaluation stream (alongside WorldScore, MIRA-Bench, etc.) and complements video-world-model benchmarks by isolating temporal coherence from motion dynamics. The four-keyframe protocol is also operationally simple — easy to run on new image models as they ship. The VLM-as-judge methodology with structured subscores is a template Patrick could adapt for other world-model evaluations.