Abstract

minWM addresses the challenge of converting video diffusion foundation models into real-time interactive video world models. The authors present a full-stack open-source framework spanning data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference. Interactive world models require controllable, causal, and low-latency rollout — demands that existing video generation models do not natively satisfy.

Key Contributions

- minWM full-stack framework: Open-source framework covering the entire pipeline from raw video data to streaming interactive world model inference

  • Controllable fine-tuning with AR training: Enables video diffusion models to respond to controls while maintaining temporal consistency
  • Few-step distillation: Reduces inference latency to real-time levels while preserving generation quality
  • Streaming inference: Supports continuous, low-latency rollout for interactive applications

    Method Details

    The pipeline consists of five stages:

    1. Data construction: Curating video data with action/control annotations suitable for world model training

  • 2. Controllable fine-tuning: Adapting pre-trained video diffusion models to accept action/control signals while retaining generative quality 3. Autoregressive (AR) training: Training the model to generate in a causal, sequential manner suitable for interactive rollout, where each frame conditions on previous frames and current action 4. Few-step distillation: Compressing the multi-step diffusion denoising process into 4-8 steps without significant quality degradation 5. Streaming inference engine: Optimized CUDA kernels and KV-cache management for continuous low-latency generation

    The architecture builds on top of existing video diffusion foundation models (e.g., based on DiT/UViT architectures) and adds action conditioning pathways.

    Key Results

    - Framework enables video world models with controllable, causal, low-latency rollout — three core requirements for interactive deployment

  • Few-step distillation reduces denoising steps from 50+ to 4-8 steps while maintaining visual quality
  • Open-source release with full training and inference code
  • Project page: https://github.com/minwm/minwm

    Limitations and Future Work

    - Performance depends heavily on quality and coverage of training data/action annotations

  • Real-time performance requires GPU hardware; edge deployment not yet supported
  • Current framework targets video generation; integration with planning/control modules is future work

    Relevance to Patrick's Research

    minWM provides a practical, reproducible framework for building interactive video world models — directly relevant if Patrick is tracking the gap between Sora-style video generators and truly interactive world models. The open-source release makes it a strong baseline for comparing world model architectures. The emphasis on causal, controllable, low-latency rollout captures the three key dimensions that separate "video generation" from "world model."