Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

title: "Light-WAM: Efficient World Action Models with State-Fusion Action Decoding"

arxiv_id: "2606.08242"

date: "2026-06-06"

tags: [world-model, robot-learning, action-model, efficient-inference, manipulation]

---

Abstract

Light-WAM is a lightweight World Action Model (WAM) for robot manipulation that retains the future-prediction benefits of large WAMs while drastically reducing trainable parameters and inference latency. It combines a compact video backbone with a novel StateFusionActionExpert that reads adapted states from multiple backbone layers and predicts action chunks in a single forward pass — no heavy generative action head.

Key Contributions

Lightweight WAM architecture: 0.44B trainable parameters — orders of magnitude smaller than typical WAMs
StateFusionActionExpert: A new action head that fuses adapted states from multiple backbone layers through learned-query pooling and predicts action chunks in one forward pass
Downsampled latent-space video supervision: Future-video co-training is performed in a compact latent space, cutting training cost without losing the representation-learning benefit of video prediction
Inference profile: 72.03 ms latency, 4.1 GiB peak GPU memory, with improved training throughput

Method Details

The model has two coupled components:

Compact video backbone: A small video transformer (≈0.44B params) encodes the current observation and a short history of frames. Future-video supervision is applied in a downsampled latent space — the backbone predicts compressed future latents, not raw pixels — which keeps the auxiliary loss cheap while still forcing the backbone to learn temporally-structured representations.

StateFusionActionExpert (action head):

Multi-layer state adaptation: From each of K backbone layers, a learned adapter projects the layer's token sequence to a "state" representation. This captures both low-level visual features and high-level task structure.
Learned-query pooling: A small set of learned query tokens attends across the multi-layer states (cross-attention) and pools them into a fixed-size representation.
Action chunk prediction: A single MLP head reads the pooled representation and outputs a chunk of future actions in one forward pass — autoregressive decoding is not used.

Training objective: Joint loss = action-prediction loss (behavior cloning) + future-latent prediction loss (WAM-style auxiliary). The video loss is weighted to act as a regularizer on the backbone, not as the primary signal.

Key Results

LIBERO benchmark: Maintains strong performance matching or exceeding larger WAM baselines
RoboTwin 2.0: Usable multi-task performance with significantly smaller model size
0.44B parameters, 72.03 ms inference latency, 4.1 GiB peak GPU memory
Improved training throughput versus larger WAMs
PDF: https://arxiv.org/pdf/2606.08242

Limitations and Future Work

Future-video supervision is in latent space, so the backbone's generative quality (e.g., for visual planning) is not directly evaluated
Action chunk prediction is monolithic — no explicit mechanism for variable-length horizons or replanning
Evaluated on manipulation benchmarks; transfer to locomotion, mobile manipulation, or long-horizon tasks is open

Relevance to Patrick's Research

Light-WAM is relevant for two reasons: (1) it provides a concrete data point on the efficiency frontier of world-action models — useful if Patrick is tracking when WAMs become deployable in real-time control loops; (2) the StateFusionActionExpert's design (multi-layer state fusion, learned-query pooling) is a clean architectural pattern that could transfer to other multi-modal policies. The 72 ms latency number is a useful benchmark for "what does it take to close the loop with a WAM at 10+ Hz."