Authors: Sizhe Lester Li, Evan Kim, Xingjian Bai, Tong Zhao, Tao Pang, Max Simchowitz, Vincent Sitzmann (MIT CSAIL) arXiv: 2605.27817 Submitted: 26 May 2026 Conference: CoRL 2026 Categories: cs.RO, cs.AI, cs.CV, cs.LG Project: vera.csail.mit.edu
Video generative models have emerged as a promising robotics backbone. Rather than jointly fine-tuning video models with action-labeled data (as in robot foundation models), this paper decouples video planning from action prediction: the video planner is left unchanged while an embodiment-specific inverse dynamics model (IDM) is trained separately. The approach, called VERA (Video-to-Embodied Robot Action Model), combines an action-free video world model with a Jacobian-based IDM that maps pixel motion to actions. VERA achieves strong performance across simulated and real-world benchmarks including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation.
- Decoupled video planning + IDM architecture: Unlike robot foundation models that jointly predict observations and actions, VERA keeps the video world model frozen and learns a separate embodiment-specific inverse dynamics model (IDM) based on the robot embodiment Jacobian.
1. Embodiment Jacobian: Models the relationship between action perturbations (joint velocity changes) and resulting pixel motion in the image space. 2. Image-space Jacobian field: Learned function that predicts how pixel regions will change given action perturbations. 3. Forward-to-Inverse conversion: Instead of training an IDM directly on action labels, the method first trains a forward model (action → pixel motion), then analytically inverts it via the Jacobian to get actions from observed pixel motion. 4. Training objective: Joint forward-inverse training with consistency losses ensuring the forward and inverse mappings are coherent.
EgoBench results (from the paper's broader benchmarking context): The best video-MLLM agent achieves 30.62% accuracy in the best-performing scenario, averaging 19.43% across four scenarios. The paper focuses on VERA's performance relative to UniPi*-style direct inverse-dynamics baselines, where VERA significantly outperforms.
VERA achieves zero-shot Panda arm manipulation (unseen task prompts at test time) and generalizes across embodiments by swapping the J-IDM while keeping the video planner fixed.
- Video model dependency: VERA's performance is bounded by the video planner's quality; if the video model generates physically implausible futures, the J-IDM has no way to correct.
This paper is directly relevant to the Voyager/NVIDIA and DeepMind Genie/Genesis lines in Patrick's world model tracking. VERA demonstrates that decoupled video world model + IDM is a viable alternative to end-to-end robot foundation models — an important architectural finding. The Jacobian-inversion approach to mapping video predictions to actions is novel and the cross-embodiment generalization result (same video planner, different IDMs) is architecturally interesting. The zero-shot manipulation result on a real Panda arm is a strong sim-to-real demonstration.