Vision-language-action (VLA) policies operate in a closed loop in real-world robot tasks: a robot observes the scene, executes an action chunk, and conditions its next decision on the resulting observation. However, most existing world models for robot action evaluation are limited to open-loop prediction along pre-collected action trajectories. This prevents them from supporting closed-loop VLA evaluation, where each action chunk must be conditioned on the observation generated by the previous execution. To address this gap, we propose PiL-World, a chunk-wise world model designed for policy-in-the-loop VLA evaluation. Given the current observation and the action trajectory rolled out by a VLA policy, PiL-World generates multi-view future observations that are consistent with the VLA rollout and match the image inputs required by the policy. By alternating between VLA inference and world-model prediction, PiL-World enables closed-loop evaluation without real robot execution at every step. To improve rollout fidelity, PiL-World conditions video generation on action-derived visual control from head-view robot motion and latent histories that encode task execution context, while jointly predicting complementary multi-view observations. Beyond successful teleoperated demonstrations, it also learns from failed execution trajectories, helping the imagined rollouts better match the distribution of real policy executions. We evaluate PiL-World on three real dual-arm manipulation tasks. PiL-World generates imagined rollouts that are highly consistent with real robot executions. More importantly, compared with the baseline, it reduces the error between VLA success rates measured in real-world rollouts and those estimated through closed-loop world-model evaluation from 63.2% to 12.0%.
- Closed-Loop VLA Evaluation: First world model designed for policy-in-the-loop evaluation, alternating between VLA inference and world model prediction.
Core Problem: Existing world models predict along pre-collected trajectories (open-loop), but VLA policies are closed-loop -- each action chunk is conditioned on the observation from the previous execution.
PiL-World Approach:
Technical Details:
Evaluated on three real dual-arm manipulation tasks. PiL-World generates imagined rollouts highly consistent with real robot executions.
The approach requires access to the VLA policy for rolling out action trajectories. It may be computationally expensive for very long horizon tasks. Future work could explore extending PiL-World to deformable object manipulation and dynamic environments with external perturbations.
PiL-World directly addresses a key challenge in world model research: how to evaluate VLA policies without expensive real-world deployment. The chunk-wise, closed-loop evaluation paradigm is essential for accurate policy assessment. This is highly relevant to anyone working on world models for robotics, particularly the intersection of world models and VLA policy evaluation.