title: "Light-WAM: Efficient World Action Models with State-Fusion Action Decoding"

arxiv_id: "2606.08242"

date: "2026-06-06"

tags: [world-model, robot-learning, action-model, efficient-inference, manipulation]

---

Abstract

Light-WAM is a lightweight World Action Model (WAM) for robot manipulation that retains the future-prediction benefits of large WAMs while drastically reducing trainable parameters and inference latency. It combines a compact video backbone with a novel StateFusionActionExpert that reads adapted states from multiple backbone layers and predicts action chunks in a single forward pass — no heavy generative action head.

Key Contributions

Method Details

The model has two coupled components:

  1. Compact video backbone: A small video transformer (≈0.44B params) encodes the current observation and a short history of frames. Future-video supervision is applied in a downsampled latent space — the backbone predicts compressed future latents, not raw pixels — which keeps the auxiliary loss cheap while still forcing the backbone to learn temporally-structured representations.
  1. StateFusionActionExpert (action head):
  1. Training objective: Joint loss = action-prediction loss (behavior cloning) + future-latent prediction loss (WAM-style auxiliary). The video loss is weighted to act as a regularizer on the backbone, not as the primary signal.

Key Results

Limitations and Future Work

Relevance to Patrick's Research

Light-WAM is relevant for two reasons: (1) it provides a concrete data point on the efficiency frontier of world-action models — useful if Patrick is tracking when WAMs become deployable in real-time control loops; (2) the StateFusionActionExpert's design (multi-layer state fusion, learned-query pooling) is a clean architectural pattern that could transfer to other multi-modal policies. The 72 ms latency number is a useful benchmark for "what does it take to close the loop with a WAM at 10+ Hz."