τ₀-WM: A Unified Video-Action World Model for Robotic Manipulation

arXiv: 2606.01027 Authors: Pengfei Zhou, Shengcong Chen, Di Chen, Jiaxu Wang, Rongjun Jin, Bingwen Zhu, Yike Pan, Songen Gu, Kuanning Wang, Shufeng Nan, Xingyu Qiu, Chenhao Qiu, Pu Yang, Yunuo Cai, Jianxiong Gao, Yifan Li, Yanwei Fu, Xiangyu Yue, Zhi Chen, Jianlan Luo Affiliation: Shanghai Innovation Institute, AGIBOT Finch Submitted: 31 May 2026

Abstract

Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. τ₀-WM is a unified video-action world model that integrates policy learning, video prediction, and action evaluation within a single future-predictive framework. Built on a shared video diffusion backbone, τ₀-WM provides two complementary interfaces: a Video Action Model (VAM) that jointly predicts future visual latents and continuous action chunks, and an Action-Conditioned Video Simulator (ACVS) that rolls out candidate action chunks into multi-view futures and predicts dense task-progress scores. Trained on ~27,300 hours of heterogeneous data (real-robot teleoperation, UMI-style, egocentric human videos, rollout/failure trajectories). At inference, uses test-time computation to sample, rank, and rectify actions.

Key Contributions

1. Unified video-action architecture: Shares a predictive representation across policy learning (VAM) and action-conditioned simulation (ACVS), both built on a Wan2.2-TI2V-5B video diffusion backbone 2. Heterogeneous data training: Jointly trains on real-robot teleoperation (17.8K hours on AGIBOT-G01, ARX, dual-arm Franka), UMI-style demonstrations (6.5K hours), and egocentric human videos (3.0K hours) using modality-specific supervision masks 3. Test-time proposal-evaluation-revision: Samples multiple action candidates, ranks them with Re-denoising Consistency Score (RCS), and invokes ACVS for simulator-based rectification of low-quality candidates

Method

Architecture

- Base: Wan2.2-TI2V-5B video diffusion backbone (5B video DiT)

VAM (Video Action Model): 5.5B parameters total (5B video + 0.5B action decoder DiT). Consists of:

- Wan VAE encoding multi-view observations into latent canvas - Video branch predicting future visual latents through conditional denoising - Action branch (0.5B DiT-style decoder) coupled to video transformer via cross-attention at matched stages

ACVS (Action-Conditioned Video Simulator): Reuses video backbone without action decoder. Conditions future latent prediction on candidate action chunks via action blocks projected through MLPs, injected into diffusion-time and AdaLN modulation embeddings

Total parameters: ~5.5B (VAM) + ~5B (ACVS shares backbone)

Training

- Objective: Joint flow-matching for both video latents and action chunks

Supervision masks: Heterogeneous data sources contribute only to losses supported by their modalities (robot data → both video + action; egocentric video → video only)

Data composition:

- 17.8K hours real-robot teleoperation (AGIBOT-G01, ARX, dual-arm Franka) - 6.5K hours UMI-style demonstrations (Gen-DAS Grippers) - 3.0K hours egocentric human videos (EgoDEX, EgoVerse, Xperience-10m)

Failure data: Incorporated for reward construction to teach ACVS to identify unsuccessful outcomes

Inference: Test-Time Computation

Coarse-to-fine strategy:

1. Re-denoising Consistency Score (RCS): Samples N action candidates, re-noises each at K random timesteps, measures average re-denoising error. Selects candidate maximizing RCS (most consistent with learned conditional action manifold)

2. Low-quality Action Rectification (LAR): When RCS falls below threshold γ, ACVS is invoked to evaluate all candidates. For each candidate:

- Predicts imagined future rollout + dense reward trajectory - Computes rollout value J = max_q r̂_{t+q} - Selects highest-value rollout, converts to future conditioning - Re-queries VAM conditioned on selected future to produce refined action

Key Results

Main Results (4 Tasks, 3 Embodiments)

τ₀-WM achieves highest average success rate across AGIBOT-G01, ARX, and dual-arm Franka on long-horizon fine-grained manipulation tasks (Toolbox, School Bag, Faucet, Badminton).

Pre-training Data Ablation (Table I)

| Setting | Zero-shot (Pen→Holder) | SFT (Object-wipe-place) | |---------|------------------------|-------------------------| | Robot only (clean/clutter/avg) | 0.22 / 0.06 / 0.14 | 0.85 / 0.55 / 0.70 | | Robot+UMI+Ego | 0.56 / 0.53 / 0.55 | 0.90 / 0.75 / 0.83 |

Heterogeneous pre-training improves zero-shot from 14% to 55% avg success.

Test-Time Computation (Table II)

| Variant | Tissue→Box | Pen→Box | Avg | |---------|------------|---------|-----| | w/o TTC | 0.55 | 0.30 | 0.43 | | w. CFG | 0.25 | 0.15 | 0.20 | | w. ACG | 0.40 | 0.35 | 0.38 | | w. RCS | 0.65 | 0.35 | 0.50 | | w. RCS + LAR | 0.70 | 0.50 | 0.60 |

TTC improves single-attempt success from 43% to 60%.

Limitations and Future Work

1. Tactile sensing: Many dexterous tasks require tactile feedback beyond vision; incorporating tactile sensing may improve contact-rich interactions (insertion, fastening, deformable objects) 2. Uncertainty estimation: Better uncertainty estimation and longer-horizon evaluation could further improve action selection in difficult states 3. Longer horizons: Extending predictive modeling to longer temporal horizons for more complex manipulation scenarios 4. Faucet task remains challenging for all methods (strict alignment constraints); the task is far from saturated

Relevance to Patrick's Research

- Directly advances video diffusion world models for robotics (relevant to Sora/VDM, DeepMind's Genie/Genesis, Voyager)

Novel architecture unifying action generation + video prediction + action evaluation in a single framework

Strong emphasis on test-time reasoning — proposal-evaluation-rectification loop is a key pattern for world model deployment

Training on heterogeneous data (robot + UMI + egocentric video) demonstrates data diversity strategy for world models

Authors from AGIBOT Finch (Chinese robotics lab); homepage: https://finch.agibot.com/research/tau0-wm

References

- Wan2.2-TI2V-5B: Wan video generation models

Cosmos World Foundation Model (NVIDIA)

π₀.5: Vision-language-action model with open-world generalization

Fast-WAM: Do world action models need test-time future imagination?

ACG: Action coherence guidance for flow-based VLA models

---

*Note generated from arXiv:2606.01027 (31 May 2026)*