Abstract
ImageTime is a diagnostic benchmark that tests whether image generation models can coherently "imagine time" — preserve identities, objects, spatial relations, and causal order across multiple visual states. Given an action instruction (and optionally a reference image), a model must produce one image containing four ordered key states: initial state → action onset → transition state → final state. This four-keyframe protocol is more temporally demanding than single-image generation but avoids the confounds of dense video dynamics. Tasks are organized in a progressive capability hierarchy with stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations, scored by GPT-5.5 under a VLM-as-judge protocol.
Key Contributions
- Four-keyframe protocol — one image with four ordered states (initial, onset, transition, final), positioned between single-image generation and dense video
- Progressive capability hierarchy — tasks structured so each level demands more temporal reasoning
- Structured evaluation — stage-wise state predicates + cross-frame temporal constraints + forbidden causal violations
- GPT-5.5 VLM-as-judge — interpretable capability scores, diagnostic subscores, and failure labels
- Multi-family benchmark — reveals where current image generation systems succeed, fail, and drift on coherent visual world modeling
Method Details
Benchmark design:
- Task formulation: given an action instruction a (and optionally a reference image I_ref specifying the initial state), generate an image containing four ordered keyframes — initial state s_0, action onset s_1, transition state s_2, final state s_3
- Progressive capability hierarchy: tasks are organized into levels that demand progressively more sophisticated temporal reasoning (preservation of identity, spatial relations, causal order, etc.)
- Per-task decomposition: each scenario decomposes into:
- Stage-wise state predicates — what must be true in each keyframe
- Cross-frame temporal constraints — what must be preserved across the sequence
- Forbidden causal violations — what physical/causal events must not occur
- VLM-as-judge (GPT-5.5): a structured scoring protocol produces:
- Capability scores — overall temporal reasoning ability
- Diagnostic subscores — per-capability breakdowns
- Failure labels — categorical failure modes for error analysis
- Multi-family evaluation: results span multiple image-generation model families to map where each succeeds and drifts
The design deliberately sits between single-image generation (no time) and dense video generation (confounded by motion/physics modeling) — it isolates temporal coherence as a property of the image model itself.
Key Results
- ImageTime exposes failure modes of current image generation models on temporally-ordered visual states that single-image benchmarks miss
- The four-keyframe protocol is more temporally demanding than single-image generation while avoiding dense video's physics confound
- Specific numerical results across model families are reported in the paper body (abstract does not enumerate them); the headline claim is that no tested model fully solves the progressive capability hierarchy
- Failure labels and diagnostic subscores give actionable error analysis per model family
Limitations and Future Work
- A four-keyframe image is a proxy for temporal reasoning — it tests discrete state transitions but not continuous dynamics
- VLM-as-judge (GPT-5.5) inherits any biases in the judge model; agreement with human raters needs separate validation
- The benchmark focuses on visual coherence; semantic plausibility and physical correctness beyond "forbidden causal violations" are not exhaustively covered
- Extension to longer horizons (≥8 keyframes) and to reference-guided editing / previsualization workflows is open
Relevance to Patrick's Research
ImageTime is a sharp empirical probe of a question that matters for world modeling: can a static image model reason about discrete time? For Patrick's tracking, it sits in the evaluation stream (alongside WorldScore, MIRA-Bench, etc.) and complements video-world-model benchmarks by isolating temporal coherence from motion dynamics. The four-keyframe protocol is also operationally simple — easy to run on new image models as they ship. The VLM-as-judge methodology with structured subscores is a template Patrick could adapt for other world-model evaluations.