DisCo: World Models with Discrete Camera Motion Control

title: "DisCo: World Models with Discrete Camera Motion Control"

arxiv_id: "2606.07967"

date: "2026-06-06"

tags: [world-model, video-diffusion, controllable-generation, camera-control, benchmark]

---

Abstract

DisCo identifies *action representation entanglement* as the key bottleneck in controllable video world models: when camera motion is encoded as a continuous trajectory, the learned features for distinct motion patterns collapse onto each other, breaking action following. The paper proposes conditioning video generation on a compact set of discrete action primitives and releases DisCoBench for short-horizon, long-horizon, and highly-dynamic exploration evaluation.

Key Contributions

Diagnosis of action representation entanglement: Empirical and analytical evidence that continuous camera representations cause high feature similarity across distinct motion patterns, degrading action controllability
Discrete action primitive conditioning: A compact, learned vocabulary of camera-motion primitives used as the action representation for the video world model, improving action separability
DisCoBench: A new benchmark with short-term, long-horizon, and highly-dynamic exploration scenarios for evaluating controllable video world models
Significantly more reliable action following than continuous-trajectory baselines, while preserving visual quality

Method Details

The architecture is a controllable video diffusion world model with a categorical action-conditioning pathway:

Discrete primitive vocabulary: Camera motions (pan, tilt, dolly, zoom, and their compositions) are clustered into a finite vocabulary of N primitives. Each primitive corresponds to a learned embedding that is fed into the video model alongside the noise/frame conditioning.
Action embedding pathway: At inference, the user (or a higher-level policy) selects a sequence of primitives; these are embedded and injected into the denoising network via cross-attention or additive conditioning, depending on the backbone.
Backbone: Built on top of a latent video diffusion model (DiT-style) operating in a compressed latent space; the contribution is orthogonal to the specific backbone choice.
DisCoBench protocol: Three difficulty tiers — (a) short-term (1-3 second rollouts with simple actions), (b) long-horizon (10+ seconds, action chaining), (c) highly-dynamic (fast camera moves, large displacements). Metrics include action-following accuracy, visual quality (FVD), and temporal coherence.

The key insight is that *separability* in action-feature space, not expressivity of the action space, drives action-following reliability in diffusion-based world models.

Key Results

Significantly more reliable action following than continuous-trajectory baselines on DisCoBench across all three difficulty tiers (per the paper's qualitative claim — specific numerical gains not reported in the abstract)
Visual quality preserved: FVD and other perceptual metrics remain comparable to the continuous-action baseline, indicating the discretization does not trade off generation fidelity
Ablations show that increasing the vocabulary size beyond a modest N yields diminishing returns, suggesting a small discrete set captures the relevant motion manifold
PDF: https://arxiv.org/pdf/2606.07967 (full results tables require reading the paper)

Limitations and Future Work

The discrete vocabulary is pre-defined (camera motions); extending to full 6-DoF continuous control, articulated object interactions, or agent embodiment remains future work
DisCoBench focuses on camera-controlled video; benchmarks for action-conditioned video in robotics or autonomous driving are not part of this release
Generalization to out-of-distribution camera trajectories (e.g., novel compound motions) is not extensively evaluated

Relevance to Patrick's Research

DisCo is directly relevant to Patrick's interest in controllable world models: it isolates a concrete architectural choice (continuous vs. discrete action representation) and shows that discreteness, not expressivity, is the lever for controllability. This is a useful counter-argument to the prevailing trend of using ever-richer continuous control signals in video world models. DisCoBench also gives Patrick a concrete evaluation surface if he is comparing controllable video world models.