DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

Abstract

DriveWAM is a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy for end-to-end autonomous driving. It unifies video and action streams into a joint temporal token sequence trained under flow matching. A frozen vision-language model provides scene-evolving driving guidance, while selective KV memory maintains bounded modality-aware memory pools for long-horizon rollout. Experiments on NAVSIM and PhysicalAI-Autonomous-Vehicles benchmarks demonstrate strong planning performance, and scaling from 4k to 100k driving clips confirms the approach's data scaling potential.

Key Contributions

- First work to adapt pretrained video diffusion transformer into an autoregressive video-action policy for end-to-end driving

  • Joint video-action flow matching objective preserving pretrained video-generation architecture
  • Scene-evolving driving guidance using frozen VLM for high-level semantic intent
  • Selective KV memory with relevance-redundancy cache selection for bounded long-horizon inference
  • Data scaling study (4k → 100k clips) demonstrating scaling potential of world-action modeling

    Method Details

    Architecture: DriveWAM adapts a pretrained video diffusion transformer (VDM) into a world-action model by:

    1. Unified Temporal Token Sequence: Video frames and action commands (e.g., steering, acceleration) are organized into a single temporal token stream, treating actions as additional "modality tokens" alongside visual tokens.

    2. Joint Flow Matching: Instead of separate video prediction and action prediction heads, DriveWAM trains under a joint flow-matching objective that predicts the combined video-action trajectory. This preserves the pretrained video-generation architecture while learning action generation.

    3. Scene-Evolving Driving Guidance: A frozen vision-language model (VLM) processes the video context and produces chunk-specific semantic intent (e.g., "merge onto highway", "yield at intersection"). These high-level instructions guide the video-action generation without fine-tuning the VLM.

    4. Selective KV Memory: For long-horizon rollout, DriveWAM maintains bounded modality-aware memory pools. At each step, it performs relevance-redundancy cache selection to decide which video and action tokens to keep in the KV cache, keeping memory bounded while preserving useful context.

    Pretrained Backbone: Uses a video diffusion transformer pretrained on large-scale video data as the foundation, benefiting from temporal dynamics and motion priors learned during pretraining — advantages over vision-language models pretrained only on static image-text pairs.

    Key Results

    | Experiment | Result | |------------|--------| | NAVSIM benchmark | Strong planning performance | | PhysicalAI-AV benchmark | Strong planning performance | | Data scaling (4k → 100k clips) | Consistent improvement with more data |

    The data scaling study from 4k to 100k driving clips shows monotonically improving performance, confirming that world-action modeling benefits from larger datasets — an important finding for the field's scaling trajectory.

    Limitations and Future Work

    - Reliance on pretrained video diffusion transformer quality: if the backbone has biases, DriveWAM inherits them

  • Scene-evolving guidance from frozen VLM may not always be aligned with optimal driving strategy
  • Evaluation primarily on simulated benchmarks; real-world deployment generalization unclear
  • Long-tail safety-critical scenarios may require additional training data beyond 100k clips
  • Action discretization: continuous action spaces may lose fidelity in tokenization

    Relevance to Patrick's Research

    DriveWAM directly demonstrates how video generative priors (Sora/VDM-style models) can be leveraged for robotics and control — in this case, autonomous driving. The flow-matching approach for joint video-action prediction is architecturally similar to how world models might be used in robotics more broadly. The selective KV memory mechanism for bounded inference is a practical solution for anyone doing long-horizon world model rollouts. This connects DeepMind's Genesis/Genie work and NVIDIA's Voyager to real-world control applications.

    ---

  • *Source: arXiv:2605.28544 | PDF: 2605.28544.pdf*