Abstract

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point-cloud memory constructed in RGB space, which is computationally expensive (repeated rendering + VAE encoding) and lossy (round-trip through pixel space discards learned latent features). The authors introduce latent spatial memory — a persistent 3D cache that stores scene information directly in the diffusion latent space — and build Mirage, a framework that lifts latent tokens into 3D via depth-guided back-projection and queries them through direct latent-space warping, eliminating both pixel-space reconstruction and repeated encoding/rendering.

Key Contributions

Method Details

Architecture:

  1. Latent token extraction — the diffusion model (backbone unspecified, but works with video diffusion priors) produces latent tokens for each generated frame
  2. Depth-guided back-projection — latent tokens are lifted from 2D frame coordinates into a 3D point cloud in latent space using a monocular depth estimator; no render-to-RGB step
  3. Persistent 3D latent cache — points accumulate across frames as the camera moves, indexed by 3D position
  4. Latent-space novel-view querying — to render a new viewpoint, the cache is queried by 3D position and warped directly into the target frame's latent coordinates; the warped latents condition the next diffusion denoising step
  5. No pixel-space round trip — the framework never decodes to RGB for memory operations, preserving the rich features of the learned representation

The design exploits a key insight: the diffusion latent space already carries geometric prior from training, so 3D consistency can be maintained inside the latents rather than imposed from an external point-cloud module.

Key Results

Limitations and Future Work

Relevance to Patrick's Research

Mirage directly attacks the central efficiency bottleneck of 3D-consistent video world models: the pixel-space round trip. For anyone tracking the Genie/Sora/WAN-class of video world models, the 10.57× speedup and 55× memory win are the kind of numbers that change deployment economics. The core idea — that the diffusion latent already has 3D prior baked in, so memory should live there too — is conceptually aligned with the JEPA philosophy of "predict in representation space, never decode." Worth tracking whether this generalizes to interactive game-environment rollouts where long-horizon cache growth dominates.