title: "DisCo: World Models with Discrete Camera Motion Control"

arxiv_id: "2606.07967"

date: "2026-06-06"

tags: [world-model, video-diffusion, controllable-generation, camera-control, benchmark]

---

Abstract

DisCo identifies *action representation entanglement* as the key bottleneck in controllable video world models: when camera motion is encoded as a continuous trajectory, the learned features for distinct motion patterns collapse onto each other, breaking action following. The paper proposes conditioning video generation on a compact set of discrete action primitives and releases DisCoBench for short-horizon, long-horizon, and highly-dynamic exploration evaluation.

Key Contributions

Method Details

The architecture is a controllable video diffusion world model with a categorical action-conditioning pathway:

  1. Discrete primitive vocabulary: Camera motions (pan, tilt, dolly, zoom, and their compositions) are clustered into a finite vocabulary of N primitives. Each primitive corresponds to a learned embedding that is fed into the video model alongside the noise/frame conditioning.
  2. Action embedding pathway: At inference, the user (or a higher-level policy) selects a sequence of primitives; these are embedded and injected into the denoising network via cross-attention or additive conditioning, depending on the backbone.
  3. Backbone: Built on top of a latent video diffusion model (DiT-style) operating in a compressed latent space; the contribution is orthogonal to the specific backbone choice.
  4. DisCoBench protocol: Three difficulty tiers — (a) short-term (1-3 second rollouts with simple actions), (b) long-horizon (10+ seconds, action chaining), (c) highly-dynamic (fast camera moves, large displacements). Metrics include action-following accuracy, visual quality (FVD), and temporal coherence.

The key insight is that *separability* in action-feature space, not expressivity of the action space, drives action-following reliability in diffusion-based world models.

Key Results

Limitations and Future Work

Relevance to Patrick's Research

DisCo is directly relevant to Patrick's interest in controllable world models: it isolates a concrete architectural choice (continuous vs. discrete action representation) and shows that discreteness, not expressivity, is the lever for controllability. This is a useful counter-argument to the prevailing trend of using ever-richer continuous control signals in video world models. DisCoBench also gives Patrick a concrete evaluation surface if he is comparing controllable video world models.