arXiv: 2606.06556 Authors: Elis Karcini, Faisal Mehrban, Quang Nguyen, Mac Schwager, Arash Ajoudani, Cesar Cadena, Jan Peters, Marco Hutter, Haitham Bou-Ammar Submitted: 4 June 2026 Categories: cs.RO, cs.AI
Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. This position paper argues this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world's abundant unstructured behavioural data into grounded robot supervision.
- Identifies four missing components for next-generation robotics: 1. Data interfaces for autolabelling unstructured behaviour 2. Embodiment interfaces for retargeting human motion to robot actions 3. World-model interfaces for physics-grounded 3D reasoning 4. Reward interfaces for inferring task progress and success from video and language
This is a position paper that synthesizes existing work rather than proposing a new model. The authors analyze why current VLA and world model approaches are insufficient for generalist robot intelligence. They identify that human motion, internet video, simulation rollouts, and interactive demonstrations contain rich task information but lack:
The paper reviews the state-of-the-art in four areas and articulates what each area is missing.
As a position/survey paper, this work does not report experimental metrics. It provides a research agenda and framework for understanding the gaps in current robot learning approaches.
The paper explicitly calls out the need for future work in:
This position paper is highly relevant to world model research because it explicitly argues that current world models are missing key components needed for robotics. The authors identify world-model interfaces as one of four critical gaps, stating that existing world models lack physics-grounded 3D reasoning capabilities needed for reliable robot manipulation. This aligns with the broader theme of world models needing to incorporate physical reasoning and embodiment constraints.