Abstract

BiWM is the first full-stack open-source framework for interactive video world models under the bidirectional autoregressive paradigm. Existing causal pipelines (e.g., minWM) need four training stages and still trail bidirectional models in quality due to error accumulation, while bidirectional-only models like Yume-1.5 and Matrix-Game-3.0 self-correct but lack open frameworks. BiWM collapses the pipeline to two training stages (control fine-tuning + few-step DMD distillation), spans four backbones (Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, LTX-2.3-22B), and adds pluggable history compression plus an optional NVFP4 4-bit path for long rollouts.

Key Contributions

Method Details

Pipeline (two stages from a pretrained video backbone):

  1. Stage 1 — Control fine-tuning: inject camera and action conditioning into the pretrained video diffusion backbone, retaining its bidirectional generative quality
  2. Stage 2 — Few-step DMD: distill the multi-step diffusion sampler into a few-step generator conditioned on actions/camera; this yields the action/camera-controllable world model

Bidirectional autoregressive rollout at inference: each rollout chunk is generated bidirectionally (high fidelity, self-correcting error propagation) and chunks are stitched autoregressively for long horizons — combining the strengths of Yume-1.5 / Matrix-Game-3.0 with interactive control.

Stabilizers:

Backbones covered: Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, LTX-2.3-22B (all leading open video diffusion models).

Key Results

(No specific % metrics reported in the abstract — quantitative benchmarks presumably in the paper body.)

Limitations and Future Work

Relevance to Patrick's Research

BiWM is a direct counterpoint to minWM and a practical open-source path to the Yume-1.5 / Matrix-Game-3.0 class of interactive world models. The two-stage recipe (control fine-tune + DMD) is the kind of operational simplification that makes world-model research reproducible outside well-funded labs. The pluggable FramePack / PackForcing compression is directly relevant to long-horizon planning, where cache growth is the bottleneck. For Patrick's tracking, BiWM is the new open-source baseline to beat on camera-controllable interactive rollouts.