minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Abstract

minWM addresses the challenge of converting video diffusion foundation models into real-time interactive video world models. The authors present a full-stack open-source framework spanning data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference. Interactive world models require controllable, causal, and low-latency rollout — demands that existing video generation models do not natively satisfy.

Key Contributions

- minWM full-stack framework: Open-source framework covering the entire pipeline from raw video data to streaming interactive world model inference

Controllable fine-tuning with AR training: Enables video diffusion models to respond to controls while maintaining temporal consistency

Few-step distillation: Reduces inference latency to real-time levels while preserving generation quality

Streaming inference: Supports continuous, low-latency rollout for interactive applications

Method Details

The pipeline consists of five stages:

1. Data construction: Curating video data with action/control annotations suitable for world model training

2. Controllable fine-tuning: Adapting pre-trained video diffusion models to accept action/control signals while retaining generative quality 3. Autoregressive (AR) training: Training the model to generate in a causal, sequential manner suitable for interactive rollout, where each frame conditions on previous frames and current action 4. Few-step distillation: Compressing the multi-step diffusion denoising process into 4-8 steps without significant quality degradation 5. Streaming inference engine: Optimized CUDA kernels and KV-cache management for continuous low-latency generation

The architecture builds on top of existing video diffusion foundation models (e.g., based on DiT/UViT architectures) and adds action conditioning pathways.

Key Results

- Framework enables video world models with controllable, causal, low-latency rollout — three core requirements for interactive deployment

Few-step distillation reduces denoising steps from 50+ to 4-8 steps while maintaining visual quality

Open-source release with full training and inference code

Project page: https://github.com/minwm/minwm

Limitations and Future Work

- Performance depends heavily on quality and coverage of training data/action annotations

Real-time performance requires GPU hardware; edge deployment not yet supported

Current framework targets video generation; integration with planning/control modules is future work

Relevance to Patrick's Research

minWM provides a practical, reproducible framework for building interactive video world models — directly relevant if Patrick is tracking the gap between Sora-style video generators and truly interactive world models. The open-source release makes it a strong baseline for comparing world model architectures. The emphasis on causal, controllable, low-latency rollout captures the three key dimensions that separate "video generation" from "world model."