World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Abstract

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the *world modeling interface* to learn from extensive egocentric videos as in the world-action model (WAM) and the *language reasoning* capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an *autoregressive (AR) Transformer backbone*, instead of a bidirectional diffusion Transformer as in WAMs, to predict the *next state*, comprising the *semantic-level* textual intention and complementary *fine-grained* physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction *implicitly* impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94% success rate on RoboTwin2.0 Clean and 56.5% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from *cross-embodiment robot videos* without action annotations.

Key Contributions

- Unified WLA Architecture: Jointly predicts textual subtasks, subgoal images, and robot actions in a single model, combining world modeling + language reasoning + action synthesis.

  • Autoregressive Transformer Backbone: Uses AR instead of diffusion for predicting next state (both semantic intention and physical dynamics), enabling test-time scaling.
  • World Expert + Action Expert: Separate experts supervised by world modeling objective; physical dynamics help characterize state-action correlation.
  • Meta-Query Mechanism: World prediction implicitly impacts action generation but can be disabled at inference for efficiency.
  • Cross-Embodiment Learning: Can learn novel tasks from robot videos without action annotations.

    Method Details

    Architecture:

  • Backbone: Autoregressive Transformer (not diffusion-based like WAMs)
  • Inputs: Textual instructions, images, robot states
  • Outputs: Textual subtasks, subgoal images, robot actions
  • Dual Expert Design:
  • - World Expert: Supervises physical dynamics prediction via world modeling objective - Action Expert: Uses physical dynamics to ease state-action correlation learning
  • Meta-queries: Enable/disable world prediction impact on actions at runtime

    Scaling: Test-time scaling via activated world prediction for improved control.

    Key Results

    | Metric | Value | |--------|-------| | Model Size | 2B active parameters | | Inference Speed | 40 ms per step (RTX 5090) | | RoboTwin2.0 Clean Success Rate | 92.94% | | RMBench Success Rate | 56.5% |

    WLA-0 demonstrates state-of-the-art multi-task and long-horizon learning abilities across simulated and real-world environments.

    Limitations and Future Work

    The model achieves 56.5% on RMBench (long-horizon tasks), indicating room for improvement on complex multi-step reasoning. Future work could explore larger WLA variants and longer-horizon test-time scaling. The cross-embodiment capability is promising but needs further validation across more diverse robot morphologies.

    Relevance to Patrick's Research

    WLA is directly relevant as it proposes a unified world model architecture that combines world modeling with language reasoning and action synthesis. The explicit use of a "World Expert" and "Action Expert" with distinct supervision objectives provides a template for modular world model design. The meta-query mechanism for test-time control of world prediction involvement is an interesting approach to balancing imagination and reactive control.