BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

Abstract

BiWM is the first full-stack open-source framework for interactive video world models under the bidirectional autoregressive paradigm. Existing causal pipelines (e.g., minWM) need four training stages and still trail bidirectional models in quality due to error accumulation, while bidirectional-only models like Yume-1.5 and Matrix-Game-3.0 self-correct but lack open frameworks. BiWM collapses the pipeline to two training stages (control fine-tuning + few-step DMD distillation), spans four backbones (Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, LTX-2.3-22B), and adds pluggable history compression plus an optional NVFP4 4-bit path for long rollouts.

Key Contributions

First open-source bidirectional-autoregressive framework for interactive video world models — fills the gap left by minWM (causal-only) and closed-source systems like Yume-1.5 / Matrix-Game-3.0
Two-stage training recipe — control fine-tuning + few-step Distribution Matching Distillation (DMD), converging in "a few hundred steps on 8×H200 GPUs"
Backbone-agnostic — single recipe ports to Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, and LTX-2.3-22B
Pluggable history compression — supports FramePack-style and PackForcing-style modules for long rollouts
Optional NVFP4 4-bit training/inference — quantization path for memory-constrained deployment
Mode-covering regularizers — GAN loss + mass-covering forward-KL added to counter DMD's mode-seeking tendency and preserve scene dynamics

Method Details

Pipeline (two stages from a pretrained video backbone):

Stage 1 — Control fine-tuning: inject camera and action conditioning into the pretrained video diffusion backbone, retaining its bidirectional generative quality
Stage 2 — Few-step DMD: distill the multi-step diffusion sampler into a few-step generator conditioned on actions/camera; this yields the action/camera-controllable world model

Bidirectional autoregressive rollout at inference: each rollout chunk is generated bidirectionally (high fidelity, self-correcting error propagation) and chunks are stitched autoregressively for long horizons — combining the strengths of Yume-1.5 / Matrix-Game-3.0 with interactive control.

Stabilizers:

GAN adversarial loss against the multi-step teacher
Mass-covering forward-KL term to prevent DMD mode collapse
Plug-in history compression (FramePack / PackForcing) to bound KV-cache growth

Backbones covered: Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, LTX-2.3-22B (all leading open video diffusion models).

Key Results

Converges in a few hundred steps on 8×H200 GPUs — much cheaper than minWM's four-stage pipeline
Real-world camera control in settings where minWM loses controllability (per the authors)
Long-horizon rollout stability via self-correcting bidirectional chunks + history compression
Open-source release for resource-constrained research and high-fidelity environment simulation

(No specific % metrics reported in the abstract — quantitative benchmarks presumably in the paper body.)

Limitations and Future Work

Quantitative benchmark numbers are deferred to the paper body; the abstract emphasizes pipeline and qualitative stability
4-bit NVFP4 path requires compatible hardware (Hopper/Blackwell-class NVIDIA GPUs)
DMD mode-seeking is mitigated but not eliminated; very long rollouts may still drift
"Open-source for resource-constrained research" suggests the largest backbones (22B) still need multi-GPU setups

Relevance to Patrick's Research

BiWM is a direct counterpoint to minWM and a practical open-source path to the Yume-1.5 / Matrix-Game-3.0 class of interactive world models. The two-stage recipe (control fine-tune + DMD) is the kind of operational simplification that makes world-model research reproducible outside well-funded labs. The pluggable FramePack / PackForcing compression is directly relevant to long-horizon planning, where cache growth is the bottleneck. For Patrick's tracking, BiWM is the new open-source baseline to beat on camera-controllable interactive rollouts.