arXiv: 2606.05979 Authors: Yi Yang, Zhihong Liu, Siqi Kou, Yiyang Chen, Yanzhe Hu, Jianbo Zhou, Boyuan Zhao, Zhijie Wei, Xiao Xia, Xueqi Li, Pengfei Liu, Zhijie Deng Submitted: 4 June 2026 Categories: cs.RO, cs.AI, cs.CL
WLA models are a new class of embodied foundation models that take textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions. WLA combines the world modeling interface to learn from extensive egocentric videos (as in WAM) with language reasoning capacities for complex long-horizon tasks (as in VLA models).
- Proposes WLA (World-Language-Action), a unified model that jointly predicts: - Textual subtasks - Subgoal images - Robot actions
Architecture: Autoregressive Transformer backbone (not diffusion-based like WAMs)
Core Components:
Training: The model predicts "next state" comprising:
Inference:
Scale: WLA-0 has 2B active parameters.
WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities. It also shows promise learning novel tasks directly from cross-embodiment robot videos without action annotations.
The paper does not explicitly discuss limitations. Potential future directions:
WLA represents an important architectural innovation: unifying world modeling with language reasoning and action synthesis in a single autoregressive model. The key insight is using AR instead of diffusion for the backbone, and the meta-query mechanism that allows world prediction to be toggled. The test-time scaling capability is particularly interesting for world model research.