PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation

Abstract

PiL-World is a chunk-wise world model designed for policy-in-the-loop VLA evaluation. Unlike existing world models limited to open-loop prediction, PiL-World generates multi-view future observations conditioned on both the current observation and the action trajectory rolled out by a VLA policy. By alternating between VLA inference and world-model prediction, it enables closed-loop evaluation without real robot execution at every step.

Key Contributions

- Closed-loop VLA evaluation: First world model to support closed-loop VLA evaluation where each action chunk is conditioned on the observation generated by previous execution

Action-derived visual control: Conditions video generation on action-derived visual control from head-view robot motion and latent histories encoding task execution context

Learning from failures: Learns from both successful teleoperated demonstrations and failed execution trajectories, improving imagined rollout fidelity

Method Details

PiL-World takes two inputs at each step:

1. The current observation (robot's view) 2. The action trajectory rolled out by the VLA policy

The model generates multi-view future observations that are:

Consistent with the VLA rollout

Match the image inputs required by the policy for its next decision

Key architectural elements:

Action-derived visual control: Head-view robot motion controls the video generation

Latent histories: Encode task execution context across timesteps

Joint multi-view prediction: Predicts complementary views simultaneously

Failed trajectory learning: Trains on failed executions to better match the distribution of real policy rollouts

Key Results

Evaluated on three real dual-arm manipulation tasks:

| Metric | Baseline | PiL-World | |--------|----------|-----------| | Real vs. estimated success rate error | 63.2% | 12.0% |

PiL-World generates imagined rollouts highly consistent with real robot executions, dramatically reducing the gap between real-world and simulated evaluation of VLA policies.

Limitations and Future Work

The paper focuses on manipulation tasks with dual-arm robots; applicability to navigation or other embodied domains is not explored. Future work could extend to more complex task hierarchies and longer-horizon evaluations.

Relevance to Patrick's Research

PiL-World addresses a critical gap in world model evaluation: most world models are evaluated open-loop, but real VLA deployment is closed-loop. This work provides a methodology for evaluating world models as VLA evaluation tools, not just as video generators. The 51 percentage point reduction in success rate estimation error demonstrates that world models can serve as effective VLA evaluation proxies when properly designed for closed-loop conditioning.