PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation

Abstract

Vision-language-action (VLA) policies operate in a closed loop in real-world robot tasks: a robot observes the scene, executes an action chunk, and conditions its next decision on the resulting observation. However, most existing world models for robot action evaluation are limited to open-loop prediction along pre-collected action trajectories. This prevents them from supporting closed-loop VLA evaluation, where each action chunk must be conditioned on the observation generated by the previous execution. To address this gap, we propose PiL-World, a chunk-wise world model designed for policy-in-the-loop VLA evaluation. Given the current observation and the action trajectory rolled out by a VLA policy, PiL-World generates multi-view future observations that are consistent with the VLA rollout and match the image inputs required by the policy. By alternating between VLA inference and world-model prediction, PiL-World enables closed-loop evaluation without real robot execution at every step. To improve rollout fidelity, PiL-World conditions video generation on action-derived visual control from head-view robot motion and latent histories that encode task execution context, while jointly predicting complementary multi-view observations. Beyond successful teleoperated demonstrations, it also learns from failed execution trajectories, helping the imagined rollouts better match the distribution of real policy executions. We evaluate PiL-World on three real dual-arm manipulation tasks. PiL-World generates imagined rollouts that are highly consistent with real robot executions. More importantly, compared with the baseline, it reduces the error between VLA success rates measured in real-world rollouts and those estimated through closed-loop world-model evaluation from 63.2% to 12.0%.

Key Contributions

- Closed-Loop VLA Evaluation: First world model designed for policy-in-the-loop evaluation, alternating between VLA inference and world model prediction.

Chunk-Wise Generation: Generates multi-view future observations conditioned on current observation + VLA action trajectory, in a chunk-wise manner (matching VLA action chunk sizes).

Action-Derived Visual Control: Uses head-view robot motion and latent histories to condition video generation.

Learning from Failures: Incorporates failed execution trajectories to better match real policy execution distribution.

Major Reduction in Sim-Real Gap: Reduces error between VLA success rates measured in real vs. estimated via world-model evaluation from 63.2% to 12.0%.

Method Details

Core Problem: Existing world models predict along pre-collected trajectories (open-loop), but VLA policies are closed-loop -- each action chunk is conditioned on the observation from the previous execution.

PiL-World Approach:

1. Given current observation + VLA action trajectory (rolled out by the policy under evaluation) 2. Generate multi-view future observations consistent with the VLA rollout 3. Observations must match the image inputs required by the VLA policy 4. Alternate between VLA inference and world-model prediction

Technical Details:

Action-derived visual control from head-view robot motion

Latent histories encoding task execution context

Joint multi-view observation prediction

Trained on both successful and failed trajectories

Key Results

| Metric | Value | |--------|-------| | Sim-Real Success Rate Gap (baseline) | 63.2% | | Sim-Real Success Rate Gap (PiL-World) | 12.0% | | Reduction in Error | 51.2 percentage points |

Evaluated on three real dual-arm manipulation tasks. PiL-World generates imagined rollouts highly consistent with real robot executions.

Limitations and Future Work

The approach requires access to the VLA policy for rolling out action trajectories. It may be computationally expensive for very long horizon tasks. Future work could explore extending PiL-World to deformable object manipulation and dynamic environments with external perturbations.

Relevance to Patrick's Research

PiL-World directly addresses a key challenge in world model research: how to evaluate VLA policies without expensive real-world deployment. The chunk-wise, closed-loop evaluation paradigm is essential for accurate policy assessment. This is highly relevant to anyone working on world models for robotics, particularly the intersection of world models and VLA policy evaluation.