VERA: Turning Video Models into Generalist Robot Policies

Authors: Sizhe Lester Li, Evan Kim, Xingjian Bai, Tong Zhao, Tao Pang, Max Simchowitz, Vincent Sitzmann (MIT CSAIL) arXiv: 2605.27817 Submitted: 26 May 2026 Conference: CoRL 2026 Categories: cs.RO, cs.AI, cs.CV, cs.LG Project: vera.csail.mit.edu

Abstract

Video generative models have emerged as a promising robotics backbone. Rather than jointly fine-tuning video models with action-labeled data (as in robot foundation models), this paper decouples video planning from action prediction: the video planner is left unchanged while an embodiment-specific inverse dynamics model (IDM) is trained separately. The approach, called VERA (Video-to-Embodied Robot Action Model), combines an action-free video world model with a Jacobian-based IDM that maps pixel motion to actions. VERA achieves strong performance across simulated and real-world benchmarks including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation.

Key Contributions

- Decoupled video planning + IDM architecture: Unlike robot foundation models that jointly predict observations and actions, VERA keeps the video world model frozen and learns a separate embodiment-specific inverse dynamics model (IDM) based on the robot embodiment Jacobian.

Jacobian-IDM (J-IDM) design: Predicts actions by inverting a learned tangent map between action perturbations and pixel motion. The J-IDM is data-efficient (works with limited action data) and scales to high-dimensional action spaces (16-DoF dexterous hands).

Cross-embodiment generalization: The same frozen video planner can be paired with different embodiment-specific J-IDMs to control different robots, without retraining the video model.

Method Details

Video World Model

The video planner is treated as a frozen, action-free world model that generates future video frames from current observations. The paper does not specify which specific video model is used, but the architecture details (Appendix A.1) describe a video tokenizer and decoder, diffusion-forcing training objective, and multi-view formatting with temporal context and look-ahead.

Jacobian Inverse Dynamics Model (J-IDM)

The core technical contribution is the J-IDM, which maps predicted pixel motions to robot actions:

1. Embodiment Jacobian: Models the relationship between action perturbations (joint velocity changes) and resulting pixel motion in the image space. 2. Image-space Jacobian field: Learned function that predicts how pixel regions will change given action perturbations. 3. Forward-to-Inverse conversion: Instead of training an IDM directly on action labels, the method first trains a forward model (action → pixel motion), then analytically inverts it via the Jacobian to get actions from observed pixel motion. 4. Training objective: Joint forward-inverse training with consistency losses ensuring the forward and inverse mappings are coherent.

Closed-Loop Policy

VERA operates in closed-loop: the video model generates future frames, the J-IDM extracts actions from predicted pixel motion, and the robot executes actions while re-observing for the next step. The authors use closed-loop replanning at test time.

Key Results

| Benchmark | VERA Performance | Baseline (UniPi*) | |-----------|-----------------|-------------------| | Allegro-Sim (16-DoF dexterous) | High success rate | 0.0% success | | Panda-Sim MimicGen (7-DoF) | High success rate | 0.0% success | | PushT-Sim | High success rate | 0.0% success |

EgoBench results (from the paper's broader benchmarking context): The best video-MLLM agent achieves 30.62% accuracy in the best-performing scenario, averaging 19.43% across four scenarios. The paper focuses on VERA's performance relative to UniPi*-style direct inverse-dynamics baselines, where VERA significantly outperforms.

VERA achieves zero-shot Panda arm manipulation (unseen task prompts at test time) and generalizes across embodiments by swapping the J-IDM while keeping the video planner fixed.

Limitations and Future Work

- Video model dependency: VERA's performance is bounded by the video planner's quality; if the video model generates physically implausible futures, the J-IDM has no way to correct.

IDM per-embodiment requirement: While the video planner is shared, a new J-IDM must be trained for each robot embodiment, requiring some action-labeled data for each.

Occlusion sensitivity: The J-IDM relies on pixel-level correspondence, making it potentially sensitive to occlusions and viewpoint changes not seen during training.

Limited to manipulation: The method is evaluated on manipulation tasks; generalization to navigation or locomotion is not demonstrated.

Future work: Scaling to more embodiments, combining with language-conditioned video planners, and improving robustness under long-horizon occlusions.

Relevance to Patrick's Research

This paper is directly relevant to the Voyager/NVIDIA and DeepMind Genie/Genesis lines in Patrick's world model tracking. VERA demonstrates that decoupled video world model + IDM is a viable alternative to end-to-end robot foundation models — an important architectural finding. The Jacobian-inversion approach to mapping video predictions to actions is novel and the cross-embodiment generalization result (same video planner, different IDMs) is architecturally interesting. The zero-shot manipulation result on a real Panda arm is a strong sim-to-real demonstration.