We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the *world modeling interface* to learn from extensive egocentric videos as in the world-action model (WAM) and the *language reasoning* capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an *autoregressive (AR) Transformer backbone*, instead of a bidirectional diffusion Transformer as in WAMs, to predict the *next state*, comprising the *semantic-level* textual intention and complementary *fine-grained* physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction *implicitly* impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94% success rate on RoboTwin2.0 Clean and 56.5% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from *cross-embodiment robot videos* without action annotations.
- Unified WLA Architecture: Jointly predicts textual subtasks, subgoal images, and robot actions in a single model, combining world modeling + language reasoning + action synthesis.
Architecture:
Scaling: Test-time scaling via activated world prediction for improved control.
WLA-0 demonstrates state-of-the-art multi-task and long-horizon learning abilities across simulated and real-world environments.
The model achieves 56.5% on RMBench (long-horizon tasks), indicating room for improvement on complex multi-step reasoning. Future work could explore larger WLA variants and longer-horizon test-time scaling. The cross-embodiment capability is promising but needs further validation across more diverse robot morphologies.
WLA is directly relevant as it proposes a unified world model architecture that combines world modeling with language reasoning and action synthesis. The explicit use of a "World Expert" and "Action Expert" with distinct supervision objectives provides a template for modular world model design. The meta-query mechanism for test-time control of world prediction involvement is an interesting approach to balancing imagination and reactive control.