PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation

Abstract

PiL-World is a chunk-wise world model designed for policy-in-the-loop VLA evaluation. Unlike existing world models limited to open-loop prediction, PiL-World generates multi-view future observations conditioned on both the current observation and the action trajectory rolled out by a VLA policy. By alternating between VLA inference and world-model prediction, it enables closed-loop evaluation without real robot execution at every step.

Key Contributions

- Closed-loop VLA evaluation: First world model to support closed-loop VLA evaluation where each action chunk is conditioned on the observation generated by previous execution

  • Action-derived visual control: Conditions video generation on action-derived visual control from head-view robot motion and latent histories encoding task execution context
  • Learning from failures: Learns from both successful teleoperated demonstrations and failed execution trajectories, improving imagined rollout fidelity

    Method Details

    PiL-World takes two inputs at each step:

  • 1. The current observation (robot's view) 2. The action trajectory rolled out by the VLA policy

    The model generates multi-view future observations that are:

  • Consistent with the VLA rollout
  • Match the image inputs required by the policy for its next decision

    Key architectural elements:

  • Action-derived visual control: Head-view robot motion controls the video generation
  • Latent histories: Encode task execution context across timesteps
  • Joint multi-view prediction: Predicts complementary views simultaneously
  • Failed trajectory learning: Trains on failed executions to better match the distribution of real policy rollouts

    Key Results

    Evaluated on three real dual-arm manipulation tasks:

    | Metric | Baseline | PiL-World | |--------|----------|-----------| | Real vs. estimated success rate error | 63.2% | 12.0% |

    PiL-World generates imagined rollouts highly consistent with real robot executions, dramatically reducing the gap between real-world and simulated evaluation of VLA policies.

    Limitations and Future Work

    The paper focuses on manipulation tasks with dual-arm robots; applicability to navigation or other embodied domains is not explored. Future work could extend to more complex task hierarchies and longer-horizon evaluations.

    Relevance to Patrick's Research

    PiL-World addresses a critical gap in world model evaluation: most world models are evaluated open-loop, but real VLA deployment is closed-loop. This work provides a methodology for evaluating world models as VLA evaluation tools, not just as video generators. The 51 percentage point reduction in success rate estimation error demonstrates that world models can serve as effective VLA evaluation proxies when properly designed for closed-loop conditioning.