Robots Need More than VLA and World Models

arXiv: 2606.06556 Authors: Elis Karcini, Faisal Mehrban, Quang Nguyen, Mac Schwager, Arash Ajoudani, Cesar Cadena, Jan Peters, Marco Hutter, Haitham Bou-Ammar Submitted: 4 June 2026 Categories: cs.RO, cs.AI

Abstract

Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. This position paper argues this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world's abundant unstructured behavioural data into grounded robot supervision.

Key Contributions

- Identifies four missing components for next-generation robotics: 1. Data interfaces for autolabelling unstructured behaviour 2. Embodiment interfaces for retargeting human motion to robot actions 3. World-model interfaces for physics-grounded 3D reasoning 4. Reward interfaces for inferring task progress and success from video and language

Surveys recent progress in robot foundation models, cross-embodiment datasets, learning from video, world models, and reward modelling

Proposes a research agenda for building robotics systems that can learn from the broader physical world, not just robot demonstrations

Method Details

This is a position paper that synthesizes existing work rather than proposing a new model. The authors analyze why current VLA and world model approaches are insufficient for generalist robot intelligence. They identify that human motion, internet video, simulation rollouts, and interactive demonstrations contain rich task information but lack:

Embodiment-specific action labels

Task semantics

Reward structure

The paper reviews the state-of-the-art in four areas and articulates what each area is missing.

Key Findings

As a position/survey paper, this work does not report experimental metrics. It provides a research agenda and framework for understanding the gaps in current robot learning approaches.

Limitations and Future Work

The paper explicitly calls out the need for future work in:

Developing data interfaces that can autolabel unstructured behavioural data

Creating embodiment interfaces that can retarget human motions to diverse robot morphologies

Building world-model interfaces for physics-grounded 3D reasoning (beyond 2D video)

Reward interfaces that can infer task progress from video and language without explicit reward functions

Relevance to Patrick's Research

This position paper is highly relevant to world model research because it explicitly argues that current world models are missing key components needed for robotics. The authors identify world-model interfaces as one of four critical gaps, stating that existing world models lack physics-grounded 3D reasoning capabilities needed for reliable robot manipulation. This aligns with the broader theme of world models needing to incorporate physical reasoning and embodiment constraints.