title: "Light-WAM: Efficient World Action Models with State-Fusion Action Decoding"
arxiv_id: "2606.08242"
date: "2026-06-06"
tags: [world-model, robot-learning, action-model, efficient-inference, manipulation]
---
Abstract
Light-WAM is a lightweight World Action Model (WAM) for robot manipulation that retains the future-prediction benefits of large WAMs while drastically reducing trainable parameters and inference latency. It combines a compact video backbone with a novel StateFusionActionExpert that reads adapted states from multiple backbone layers and predicts action chunks in a single forward pass — no heavy generative action head.
Key Contributions
- Lightweight WAM architecture: 0.44B trainable parameters — orders of magnitude smaller than typical WAMs
- StateFusionActionExpert: A new action head that fuses adapted states from multiple backbone layers through learned-query pooling and predicts action chunks in one forward pass
- Downsampled latent-space video supervision: Future-video co-training is performed in a compact latent space, cutting training cost without losing the representation-learning benefit of video prediction
- Inference profile: 72.03 ms latency, 4.1 GiB peak GPU memory, with improved training throughput
Method Details
The model has two coupled components:
- Compact video backbone: A small video transformer (≈0.44B params) encodes the current observation and a short history of frames. Future-video supervision is applied in a downsampled latent space — the backbone predicts compressed future latents, not raw pixels — which keeps the auxiliary loss cheap while still forcing the backbone to learn temporally-structured representations.
- StateFusionActionExpert (action head):
- Multi-layer state adaptation: From each of K backbone layers, a learned adapter projects the layer's token sequence to a "state" representation. This captures both low-level visual features and high-level task structure.
- Learned-query pooling: A small set of learned query tokens attends across the multi-layer states (cross-attention) and pools them into a fixed-size representation.
- Action chunk prediction: A single MLP head reads the pooled representation and outputs a chunk of future actions in one forward pass — autoregressive decoding is not used.
- Training objective: Joint loss = action-prediction loss (behavior cloning) + future-latent prediction loss (WAM-style auxiliary). The video loss is weighted to act as a regularizer on the backbone, not as the primary signal.
Key Results
- LIBERO benchmark: Maintains strong performance matching or exceeding larger WAM baselines
- RoboTwin 2.0: Usable multi-task performance with significantly smaller model size
- 0.44B parameters, 72.03 ms inference latency, 4.1 GiB peak GPU memory
- Improved training throughput versus larger WAMs
- PDF: https://arxiv.org/pdf/2606.08242
Limitations and Future Work
- Future-video supervision is in latent space, so the backbone's generative quality (e.g., for visual planning) is not directly evaluated
- Action chunk prediction is monolithic — no explicit mechanism for variable-length horizons or replanning
- Evaluated on manipulation benchmarks; transfer to locomotion, mobile manipulation, or long-horizon tasks is open
Relevance to Patrick's Research
Light-WAM is relevant for two reasons: (1) it provides a concrete data point on the efficiency frontier of world-action models — useful if Patrick is tracking when WAMs become deployable in real-time control loops; (2) the StateFusionActionExpert's design (multi-layer state fusion, learned-query pooling) is a clean architectural pattern that could transfer to other multi-modal policies. The 72 ms latency number is a useful benchmark for "what does it take to close the loop with a WAM at 10+ Hz."