MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

Abstract

MiraBench is a hierarchical benchmark that defines *action-conditioned reliability* as the core evaluation target for robotic world models. The benchmark evaluates three progressively demanding levels: Physics Adherence (reference-free physical consistency), Action-Following Fidelity (predictions respecting task-relevant action inputs), and Optimism Bias Detection (calibration when actions should not succeed). Over 16,000 human-annotated judgments across 12 model configurations reveal that visual fidelity poorly predicts action fidelity, model scale does not reliably improve action following, and optimism bias is pervasive.

Key Contributions

- First benchmark to evaluate *action-conditioned reliability* (not just visual fidelity) in robotic world models

Hierarchical 3-level evaluation framework: Physics Adherence → Action-Following Fidelity → Optimism Bias Detection

Comprehensive evaluation corpus: 16,000+ human judgments across tasks, failure categories, and model types

12 model configurations evaluated spanning vector-conditioned, text-conditioned, open-weight, closed-source, and multiple scales

Method Details

Architecture: MiraBench does not propose a new world model — it evaluates existing ones. The benchmark framework decomposes action-conditioned reliability into three levels:

1. Physics Adherence: Evaluates reference-free physical consistency — whether predicted futures obey physical laws (no object penetration, realistic collisions, gravity) without a ground-truth reference.

2. Action-Following Fidelity: Measures whether predictions correctly respect task-relevant action inputs. Uses paired action-conditioning tests where the same scene is generated under different action commands.

3. Optimism Bias Detection: Probes systematic tendency to predict success even under failure-inducing actions. Tests calibration — if an action should fail, does the model predict failure or over-confidently predict success?

Models evaluated include vector-conditioned robotic world models, text-conditioned generative world models, open-weight systems (e.g., based on diffusion transformers), and closed-source systems across multiple scales.

Key Results

| Finding | Detail | |---------|--------| | Visual fidelity ≠ Action fidelity | Visual quality scores poorly correlate with action-following scores across all models | | Scale doesn't help action following | Increasing model parameters does not reliably improve action fidelity | | Optimism bias pervasive | All evaluated models show systematic tendency to predict success even under failure conditions | | Dataset scale | 16,000+ human judgments collected |

Limitations and Future Work

The authors acknowledge that MiraBench reveals systemic gaps in current world models but does not itself provide solutions. Future work should focus on developing world models that explicitly optimize for action-conditioned reliability rather than visual fidelity. The benchmark primarily covers simulated/robotic domains; extension to real-world robot deployment remains open.

Relevance to Patrick's Research

Directly relevant to evaluating whether world models (e.g., video diffusion models, JEPA-style architectures) produce *actionable* predictions for robotics. The finding that visual fidelity ≠ action fidelity is critical for anyone building world models for embodied AI. This benchmark provides a methodology for evaluating world model quality beyond aesthetic quality.

--- *Source: arXiv:2605.29360 | PDF: 2605.29360.pdf*