DriveWAM is a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy for end-to-end autonomous driving. It unifies video and action streams into a joint temporal token sequence trained under flow matching. A frozen vision-language model provides scene-evolving driving guidance, while selective KV memory maintains bounded modality-aware memory pools for long-horizon rollout. Experiments on NAVSIM and PhysicalAI-Autonomous-Vehicles benchmarks demonstrate strong planning performance, and scaling from 4k to 100k driving clips confirms the approach's data scaling potential.
- First work to adapt pretrained video diffusion transformer into an autoregressive video-action policy for end-to-end driving
Architecture: DriveWAM adapts a pretrained video diffusion transformer (VDM) into a world-action model by:
1. Unified Temporal Token Sequence: Video frames and action commands (e.g., steering, acceleration) are organized into a single temporal token stream, treating actions as additional "modality tokens" alongside visual tokens.
2. Joint Flow Matching: Instead of separate video prediction and action prediction heads, DriveWAM trains under a joint flow-matching objective that predicts the combined video-action trajectory. This preserves the pretrained video-generation architecture while learning action generation.
3. Scene-Evolving Driving Guidance: A frozen vision-language model (VLM) processes the video context and produces chunk-specific semantic intent (e.g., "merge onto highway", "yield at intersection"). These high-level instructions guide the video-action generation without fine-tuning the VLM.
4. Selective KV Memory: For long-horizon rollout, DriveWAM maintains bounded modality-aware memory pools. At each step, it performs relevance-redundancy cache selection to decide which video and action tokens to keep in the KV cache, keeping memory bounded while preserving useful context.
Pretrained Backbone: Uses a video diffusion transformer pretrained on large-scale video data as the foundation, benefiting from temporal dynamics and motion priors learned during pretraining — advantages over vision-language models pretrained only on static image-text pairs.
The data scaling study from 4k to 100k driving clips shows monotonically improving performance, confirming that world-action modeling benefits from larger datasets — an important finding for the field's scaling trajectory.
- Reliance on pretrained video diffusion transformer quality: if the backbone has biases, DriveWAM inherits them
DriveWAM directly demonstrates how video generative priors (Sora/VDM-style models) can be leveraged for robotics and control — in this case, autonomous driving. The flow-matching approach for joint video-action prediction is architecturally similar to how world models might be used in robotics more broadly. The selective KV memory mechanism for bounded inference is a practical solution for anyone doing long-horizon world model rollouts. This connects DeepMind's Genesis/Genie work and NVIDIA's Voyager to real-world control applications.
---