Latent Spatial Memory for Video World Models (Mirage)

Abstract

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point-cloud memory constructed in RGB space, which is computationally expensive (repeated rendering + VAE encoding) and lossy (round-trip through pixel space discards learned latent features). The authors introduce latent spatial memory — a persistent 3D cache that stores scene information directly in the diffusion latent space — and build Mirage, a framework that lifts latent tokens into 3D via depth-guided back-projection and queries them through direct latent-space warping, eliminating both pixel-space reconstruction and repeated encoding/rendering.

Key Contributions

Latent spatial memory — a persistent 3D cache living entirely in the diffusion latent space, not in pixel/RGB space
Mirage framework — depth-guided back-projection of latent tokens + latent-space warping for novel-view synthesis, with no VAE re-encoding
State-of-the-art on WorldScore while delivering 10.57× faster end-to-end video generation and 55× smaller memory footprint than explicit 3D baselines
Strong reconstruction on RealEstate10K by leveraging the diffusion model's own geometric prior

Method Details

Architecture:

Latent token extraction — the diffusion model (backbone unspecified, but works with video diffusion priors) produces latent tokens for each generated frame
Depth-guided back-projection — latent tokens are lifted from 2D frame coordinates into a 3D point cloud in latent space using a monocular depth estimator; no render-to-RGB step
Persistent 3D latent cache — points accumulate across frames as the camera moves, indexed by 3D position
Latent-space novel-view querying — to render a new viewpoint, the cache is queried by 3D position and warped directly into the target frame's latent coordinates; the warped latents condition the next diffusion denoising step
No pixel-space round trip — the framework never decodes to RGB for memory operations, preserving the rich features of the learned representation

The design exploits a key insight: the diffusion latent space already carries geometric prior from training, so 3D consistency can be maintained inside the latents rather than imposed from an external point-cloud module.

Key Results

10.57× faster end-to-end video generation vs. explicit 3D memory baselines
55× reduction in memory footprint for the spatial memory module
State-of-the-art on WorldScore (composite world-model generation benchmark)
Strong reconstruction on RealEstate10K — competitive with explicit 3D methods while staying in latent space

Limitations and Future Work

Depth estimator quality caps back-projection accuracy; errors compound over long rollouts
Latent-space warping is bounded by the diffusion backbone's geometric prior — non-rigid scenes and large camera excursions remain hard
The 55× memory reduction is relative to explicit 3D baselines; absolute memory still grows with trajectory length, so cache compression / eviction policies are a clear next step

Relevance to Patrick's Research

Mirage directly attacks the central efficiency bottleneck of 3D-consistent video world models: the pixel-space round trip. For anyone tracking the Genie/Sora/WAN-class of video world models, the 10.57× speedup and 55× memory win are the kind of numbers that change deployment economics. The core idea — that the diffusion latent already has 3D prior baked in, so memory should live there too — is conceptually aligned with the JEPA philosophy of "predict in representation space, never decode." Worth tracking whether this generalizes to interactive game-environment rollouts where long-horizon cache growth dominates.