arXiv: 2606.01027 Authors: Pengfei Zhou, Shengcong Chen, Di Chen, Jiaxu Wang, Rongjun Jin, Bingwen Zhu, Yike Pan, Songen Gu, Kuanning Wang, Shufeng Nan, Xingyu Qiu, Chenhao Qiu, Pu Yang, Yunuo Cai, Jianxiong Gao, Yifan Li, Yanwei Fu, Xiangyu Yue, Zhi Chen, Jianlan Luo Affiliation: Shanghai Innovation Institute, AGIBOT Finch Submitted: 31 May 2026
Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. τ₀-WM is a unified video-action world model that integrates policy learning, video prediction, and action evaluation within a single future-predictive framework. Built on a shared video diffusion backbone, τ₀-WM provides two complementary interfaces: a Video Action Model (VAM) that jointly predicts future visual latents and continuous action chunks, and an Action-Conditioned Video Simulator (ACVS) that rolls out candidate action chunks into multi-view futures and predicts dense task-progress scores. Trained on ~27,300 hours of heterogeneous data (real-robot teleoperation, UMI-style, egocentric human videos, rollout/failure trajectories). At inference, uses test-time computation to sample, rank, and rectify actions.
1. Unified video-action architecture: Shares a predictive representation across policy learning (VAM) and action-conditioned simulation (ACVS), both built on a Wan2.2-TI2V-5B video diffusion backbone 2. Heterogeneous data training: Jointly trains on real-robot teleoperation (17.8K hours on AGIBOT-G01, ARX, dual-arm Franka), UMI-style demonstrations (6.5K hours), and egocentric human videos (3.0K hours) using modality-specific supervision masks 3. Test-time proposal-evaluation-revision: Samples multiple action candidates, ranks them with Re-denoising Consistency Score (RCS), and invokes ACVS for simulator-based rectification of low-quality candidates
- Base: Wan2.2-TI2V-5B video diffusion backbone (5B video DiT)
- Objective: Joint flow-matching for both video latents and action chunks
Coarse-to-fine strategy:
1. Re-denoising Consistency Score (RCS): Samples N action candidates, re-noises each at K random timesteps, measures average re-denoising error. Selects candidate maximizing RCS (most consistent with learned conditional action manifold)
2. Low-quality Action Rectification (LAR): When RCS falls below threshold γ, ACVS is invoked to evaluate all candidates. For each candidate:
τ₀-WM achieves highest average success rate across AGIBOT-G01, ARX, and dual-arm Franka on long-horizon fine-grained manipulation tasks (Toolbox, School Bag, Faucet, Badminton).
Heterogeneous pre-training improves zero-shot from 14% to 55% avg success.
TTC improves single-attempt success from 43% to 60%.
1. Tactile sensing: Many dexterous tasks require tactile feedback beyond vision; incorporating tactile sensing may improve contact-rich interactions (insertion, fastening, deformable objects) 2. Uncertainty estimation: Better uncertainty estimation and longer-horizon evaluation could further improve action selection in difficult states 3. Longer horizons: Extending predictive modeling to longer temporal horizons for more complex manipulation scenarios 4. Faucet task remains challenging for all methods (strict alignment constraints); the task is far from saturated
- Directly advances video diffusion world models for robotics (relevant to Sora/VDM, DeepMind's Genie/Genesis, Voyager)
- Wan2.2-TI2V-5B: Wan video generation models
---
*Note generated from arXiv:2606.01027 (31 May 2026)*