Abstract
BiWM is the first full-stack open-source framework for interactive video world models under the bidirectional autoregressive paradigm. Existing causal pipelines (e.g., minWM) need four training stages and still trail bidirectional models in quality due to error accumulation, while bidirectional-only models like Yume-1.5 and Matrix-Game-3.0 self-correct but lack open frameworks. BiWM collapses the pipeline to two training stages (control fine-tuning + few-step DMD distillation), spans four backbones (Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, LTX-2.3-22B), and adds pluggable history compression plus an optional NVFP4 4-bit path for long rollouts.
Key Contributions
- First open-source bidirectional-autoregressive framework for interactive video world models — fills the gap left by minWM (causal-only) and closed-source systems like Yume-1.5 / Matrix-Game-3.0
- Two-stage training recipe — control fine-tuning + few-step Distribution Matching Distillation (DMD), converging in "a few hundred steps on 8×H200 GPUs"
- Backbone-agnostic — single recipe ports to Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, and LTX-2.3-22B
- Pluggable history compression — supports FramePack-style and PackForcing-style modules for long rollouts
- Optional NVFP4 4-bit training/inference — quantization path for memory-constrained deployment
- Mode-covering regularizers — GAN loss + mass-covering forward-KL added to counter DMD's mode-seeking tendency and preserve scene dynamics
Method Details
Pipeline (two stages from a pretrained video backbone):
- Stage 1 — Control fine-tuning: inject camera and action conditioning into the pretrained video diffusion backbone, retaining its bidirectional generative quality
- Stage 2 — Few-step DMD: distill the multi-step diffusion sampler into a few-step generator conditioned on actions/camera; this yields the action/camera-controllable world model
Bidirectional autoregressive rollout at inference: each rollout chunk is generated bidirectionally (high fidelity, self-correcting error propagation) and chunks are stitched autoregressively for long horizons — combining the strengths of Yume-1.5 / Matrix-Game-3.0 with interactive control.
Stabilizers:
- GAN adversarial loss against the multi-step teacher
- Mass-covering forward-KL term to prevent DMD mode collapse
- Plug-in history compression (FramePack / PackForcing) to bound KV-cache growth
Backbones covered: Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, LTX-2.3-22B (all leading open video diffusion models).
Key Results
- Converges in a few hundred steps on 8×H200 GPUs — much cheaper than minWM's four-stage pipeline
- Real-world camera control in settings where minWM loses controllability (per the authors)
- Long-horizon rollout stability via self-correcting bidirectional chunks + history compression
- Open-source release for resource-constrained research and high-fidelity environment simulation
(No specific % metrics reported in the abstract — quantitative benchmarks presumably in the paper body.)
Limitations and Future Work
- Quantitative benchmark numbers are deferred to the paper body; the abstract emphasizes pipeline and qualitative stability
- 4-bit NVFP4 path requires compatible hardware (Hopper/Blackwell-class NVIDIA GPUs)
- DMD mode-seeking is mitigated but not eliminated; very long rollouts may still drift
- "Open-source for resource-constrained research" suggests the largest backbones (22B) still need multi-GPU setups
Relevance to Patrick's Research
BiWM is a direct counterpoint to minWM and a practical open-source path to the Yume-1.5 / Matrix-Game-3.0 class of interactive world models. The two-stage recipe (control fine-tune + DMD) is the kind of operational simplification that makes world-model research reproducible outside well-funded labs. The pluggable FramePack / PackForcing compression is directly relevant to long-horizon planning, where cache growth is the bottleneck. For Patrick's tracking, BiWM is the new open-source baseline to beat on camera-controllable interactive rollouts.