minWM addresses the challenge of converting video diffusion foundation models into real-time interactive video world models. The authors present a full-stack open-source framework spanning data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference. Interactive world models require controllable, causal, and low-latency rollout — demands that existing video generation models do not natively satisfy.
- minWM full-stack framework: Open-source framework covering the entire pipeline from raw video data to streaming interactive world model inference
The pipeline consists of five stages:
1. Data construction: Curating video data with action/control annotations suitable for world model training
The architecture builds on top of existing video diffusion foundation models (e.g., based on DiT/UViT architectures) and adds action conditioning pathways.
- Framework enables video world models with controllable, causal, low-latency rollout — three core requirements for interactive deployment
- Performance depends heavily on quality and coverage of training data/action annotations
minWM provides a practical, reproducible framework for building interactive video world models — directly relevant if Patrick is tracking the gap between Sora-style video generators and truly interactive world models. The open-source release makes it a strong baseline for comparing world model architectures. The emphasis on causal, controllable, low-latency rollout captures the three key dimensions that separate "video generation" from "world model."