Abstract

ImageTime is a diagnostic benchmark that tests whether image generation models can coherently "imagine time" — preserve identities, objects, spatial relations, and causal order across multiple visual states. Given an action instruction (and optionally a reference image), a model must produce one image containing four ordered key states: initial state → action onset → transition state → final state. This four-keyframe protocol is more temporally demanding than single-image generation but avoids the confounds of dense video dynamics. Tasks are organized in a progressive capability hierarchy with stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations, scored by GPT-5.5 under a VLM-as-judge protocol.

Key Contributions

Method Details

Benchmark design:

  1. Task formulation: given an action instruction a (and optionally a reference image I_ref specifying the initial state), generate an image containing four ordered keyframes — initial state s_0, action onset s_1, transition state s_2, final state s_3
  2. Progressive capability hierarchy: tasks are organized into levels that demand progressively more sophisticated temporal reasoning (preservation of identity, spatial relations, causal order, etc.)
  3. Per-task decomposition: each scenario decomposes into:
  1. VLM-as-judge (GPT-5.5): a structured scoring protocol produces:
  1. Multi-family evaluation: results span multiple image-generation model families to map where each succeeds and drifts

The design deliberately sits between single-image generation (no time) and dense video generation (confounded by motion/physics modeling) — it isolates temporal coherence as a property of the image model itself.

Key Results

Limitations and Future Work

Relevance to Patrick's Research

ImageTime is a sharp empirical probe of a question that matters for world modeling: can a static image model reason about discrete time? For Patrick's tracking, it sits in the evaluation stream (alongside WorldScore, MIRA-Bench, etc.) and complements video-world-model benchmarks by isolating temporal coherence from motion dynamics. The four-keyframe protocol is also operationally simple — easy to run on new image models as they ship. The VLM-as-judge methodology with structured subscores is a template Patrick could adapt for other world-model evaluations.