World-Language-Action (WLA) Model for Unified World Modeling, Language Reasoning, and Action Synthesis

arXiv: 2606.05979 Authors: Yi Yang, Zhihong Liu, Siqi Kou, Yiyang Chen, Yanzhe Hu, Jianbo Zhou, Boyuan Zhao, Zhijie Wei, Xiao Xia, Xueqi Li, Pengfei Liu, Zhijie Deng Submitted: 4 June 2026 Categories: cs.RO, cs.AI, cs.CL

Abstract

WLA models are a new class of embodied foundation models that take textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions. WLA combines the world modeling interface to learn from extensive egocentric videos (as in WAM) with language reasoning capacities for complex long-horizon tasks (as in VLA models).

Key Contributions

- Proposes WLA (World-Language-Action), a unified model that jointly predicts: - Textual subtasks - Subgoal images - Robot actions

  • Introduces autoregressive (AR) Transformer backbone instead of bidirectional diffusion Transformer used in WAMs
  • Uses meta-queries to make world prediction implicitly impact action generation (can be disabled at inference)
  • Enables test-time scaling for improved robot control

    Method Details

    Architecture: Autoregressive Transformer backbone (not diffusion-based like WAMs)

    Core Components:

  • 1. World Expert: Supervises physical dynamics prediction through world modeling objective 2. Action Expert: Leverages predicted physical dynamics to ease state-action correlation characterization 3. Meta-queries: Enable world prediction to implicitly influence action generation without requiring world prediction at inference time

    Training: The model predicts "next state" comprising:

  • Semantic-level textual intention
  • Fine-grained physical dynamics

    Inference:

  • WLA-0 runs at 40ms per inference on NVIDIA RTX 5090
  • Can disable world prediction during inference if not needed
  • Can activate world prediction at test-time for improved control (test-time scaling)

    Scale: WLA-0 has 2B active parameters.

    Key Results

    | Benchmark | WLA-0 Result | |-----------|-------------| | RoboTwin2.0 Clean success rate | 92.94% | | RMBench success rate | 56.5% |

    WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities. It also shows promise learning novel tasks directly from cross-embodiment robot videos without action annotations.

    Limitations and Future Work

    The paper does not explicitly discuss limitations. Potential future directions:

  • Scaling WLA to larger parameter counts
  • Improving RMBench performance (56.5% is not yet high)
  • Extending to more diverse robot embodiments
  • Improving zero-shot cross-embodiment transfer

    Relevance to Patrick's Research

    WLA represents an important architectural innovation: unifying world modeling with language reasoning and action synthesis in a single autoregressive model. The key insight is using AR instead of diffusion for the backbone, and the meta-query mechanism that allows world prediction to be toggled. The test-time scaling capability is particularly interesting for world model research.