World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Abstract

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the *world modeling interface* to learn from extensive egocentric videos as in the world-action model (WAM) and the *language reasoning* capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an *autoregressive (AR) Transformer backbone*, instead of a bidirectional diffusion Transformer as in WAMs, to predict the *next state*, comprising the *semantic-level* textual intention and complementary *fine-grained* physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction *implicitly* impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94% success rate on RoboTwin2.0 Clean and 56.5% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from *cross-embodiment robot videos* without action annotations.

Key Contributions

- Unified WLA Architecture: Jointly predicts textual subtasks, subgoal images, and robot actions in a single model, combining world modeling + language reasoning + action synthesis.

Autoregressive Transformer Backbone: Uses AR instead of diffusion for predicting next state (both semantic intention and physical dynamics), enabling test-time scaling.

World Expert + Action Expert: Separate experts supervised by world modeling objective; physical dynamics help characterize state-action correlation.

Meta-Query Mechanism: World prediction implicitly impacts action generation but can be disabled at inference for efficiency.

Cross-Embodiment Learning: Can learn novel tasks from robot videos without action annotations.

Method Details

Architecture:

Backbone: Autoregressive Transformer (not diffusion-based like WAMs)

Inputs: Textual instructions, images, robot states

Outputs: Textual subtasks, subgoal images, robot actions

Dual Expert Design:

- World Expert: Supervises physical dynamics prediction via world modeling objective - Action Expert: Uses physical dynamics to ease state-action correlation learning

Meta-queries: Enable/disable world prediction impact on actions at runtime

Scaling: Test-time scaling via activated world prediction for improved control.

Key Results

| Metric | Value | |--------|-------| | Model Size | 2B active parameters | | Inference Speed | 40 ms per step (RTX 5090) | | RoboTwin2.0 Clean Success Rate | 92.94% | | RMBench Success Rate | 56.5% |

WLA-0 demonstrates state-of-the-art multi-task and long-horizon learning abilities across simulated and real-world environments.

Limitations and Future Work

The model achieves 56.5% on RMBench (long-horizon tasks), indicating room for improvement on complex multi-step reasoning. Future work could explore larger WLA variants and longer-horizon test-time scaling. The cross-embodiment capability is promising but needs further validation across more diverse robot morphologies.

Relevance to Patrick's Research

WLA is directly relevant as it proposes a unified world model architecture that combines world modeling with language reasoning and action synthesis. The explicit use of a "World Expert" and "Action Expert" with distinct supervision objectives provides a template for modular world model design. The meta-query mechanism for test-time control of world prediction involvement is an interesting approach to balancing imagination and reactive control.