MiraBench is a hierarchical benchmark that defines *action-conditioned reliability* as the core evaluation target for robotic world models. The benchmark evaluates three progressively demanding levels: Physics Adherence (reference-free physical consistency), Action-Following Fidelity (predictions respecting task-relevant action inputs), and Optimism Bias Detection (calibration when actions should not succeed). Over 16,000 human-annotated judgments across 12 model configurations reveal that visual fidelity poorly predicts action fidelity, model scale does not reliably improve action following, and optimism bias is pervasive.
- First benchmark to evaluate *action-conditioned reliability* (not just visual fidelity) in robotic world models
Architecture: MiraBench does not propose a new world model — it evaluates existing ones. The benchmark framework decomposes action-conditioned reliability into three levels:
1. Physics Adherence: Evaluates reference-free physical consistency — whether predicted futures obey physical laws (no object penetration, realistic collisions, gravity) without a ground-truth reference.
2. Action-Following Fidelity: Measures whether predictions correctly respect task-relevant action inputs. Uses paired action-conditioning tests where the same scene is generated under different action commands.
3. Optimism Bias Detection: Probes systematic tendency to predict success even under failure-inducing actions. Tests calibration — if an action should fail, does the model predict failure or over-confidently predict success?
Models evaluated include vector-conditioned robotic world models, text-conditioned generative world models, open-weight systems (e.g., based on diffusion transformers), and closed-source systems across multiple scales.
The authors acknowledge that MiraBench reveals systemic gaps in current world models but does not itself provide solutions. Future work should focus on developing world models that explicitly optimize for action-conditioned reliability rather than visual fidelity. The benchmark primarily covers simulated/robotic domains; extension to real-world robot deployment remains open.
Directly relevant to evaluating whether world models (e.g., video diffusion models, JEPA-style architectures) produce *actionable* predictions for robotics. The finding that visual fidelity ≠ action fidelity is critical for anyone building world models for embodied AI. This benchmark provides a methodology for evaluating world model quality beyond aesthetic quality.
--- *Source: arXiv:2605.29360 | PDF: 2605.29360.pdf*