World-Language-Action (WLA) Model for Unified World Modeling, Language Reasoning, and Action Synthesis

arXiv: 2606.05979 Authors: Yi Yang, Zhihong Liu, Siqi Kou, Yiyang Chen, Yanzhe Hu, Jianbo Zhou, Boyuan Zhao, Zhijie Wei, Xiao Xia, Xueqi Li, Pengfei Liu, Zhijie Deng Submitted: 4 June 2026 Categories: cs.RO, cs.AI, cs.CL

Abstract

WLA models are a new class of embodied foundation models that take textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions. WLA combines the world modeling interface to learn from extensive egocentric videos (as in WAM) with language reasoning capacities for complex long-horizon tasks (as in VLA models).

Key Contributions

- Proposes WLA (World-Language-Action), a unified model that jointly predicts: - Textual subtasks - Subgoal images - Robot actions

Introduces autoregressive (AR) Transformer backbone instead of bidirectional diffusion Transformer used in WAMs

Uses meta-queries to make world prediction implicitly impact action generation (can be disabled at inference)

Enables test-time scaling for improved robot control

Method Details

Architecture: Autoregressive Transformer backbone (not diffusion-based like WAMs)

Core Components:

1. World Expert: Supervises physical dynamics prediction through world modeling objective 2. Action Expert: Leverages predicted physical dynamics to ease state-action correlation characterization 3. Meta-queries: Enable world prediction to implicitly influence action generation without requiring world prediction at inference time

Training: The model predicts "next state" comprising:

Semantic-level textual intention

Fine-grained physical dynamics

Inference:

WLA-0 runs at 40ms per inference on NVIDIA RTX 5090

Can disable world prediction during inference if not needed

Can activate world prediction at test-time for improved control (test-time scaling)

Scale: WLA-0 has 2B active parameters.

Key Results

| Benchmark | WLA-0 Result | |-----------|-------------| | RoboTwin2.0 Clean success rate | 92.94% | | RMBench success rate | 56.5% |

WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities. It also shows promise learning novel tasks directly from cross-embodiment robot videos without action annotations.

Limitations and Future Work

The paper does not explicitly discuss limitations. Potential future directions:

Scaling WLA to larger parameter counts

Improving RMBench performance (56.5% is not yet high)

Extending to more diverse robot embodiments

Improving zero-shot cross-embodiment transfer

Relevance to Patrick's Research

WLA represents an important architectural innovation: unifying world modeling with language reasoning and action synthesis in a single autoregressive model. The key insight is using AR instead of diffusion for the backbone, and the meta-query mechanism that allows world prediction to be toggled. The test-time scaling capability is particularly interesting for world model research.