Abstract

Interactive video world models generate video chunk-by-chunk in response to user-controlled camera movements, enabling real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long trajectories is prohibitively expensive due to growing context memory and quadratic attention complexity. Light Interaction is a training-free inference acceleration framework that achieves up to 2.59× speedup on HY-WorldPlay and Matrix-Game-3.0 without model retraining while maintaining competitive visual quality.

Key Contributions

- Introduces adaptive context management that discards retrieved spatial memory during novel exploration and adjusts temporal context based on local latent dynamics

  • Proposes denoising cache acceleration that reuses early-step model outputs when the camera revisits familiar regions
  • Implements hardware-software co-designed 3D block sparse attention with fused Triton kernels for efficient computation

    Method Details

    Light Interaction is built on the insight that interaction naturally enables trajectory-dependent adaptive computation. The framework combines three components:

    1. Adaptive Context Management: Spatial memory is discarded during novel exploration while temporal context is adjusted according to local latent dynamics. This reduces memory overhead for long trajectories.

    2. Denoising Cache Acceleration: When the camera revisits familiar regions, early-step model outputs are reused instead of being recomputed, avoiding repeated denoising steps.

    3. 3D Block Sparse Attention: Hardware-software co-designed 3D block sparse attention with fused Triton kernels replaces standard attention, reducing computational complexity for interactive video generation.

    The framework is training-free, meaning it can be applied to existing interactive video world models without any fine-tuning or retraining.

    Key Results

    | Benchmark | Speedup | Quality Maintained | |-----------|---------|-------------------| | HY-WorldPlay | 2.59× | Yes | | Matrix-Game-3.0 | 2.59× | Yes |

    Limitations and Future Work

    Future work could explore applying the framework to longer-horizon interactive tasks, other world model architectures, or scenarios with more complex camera movements. The training-free nature means it may not fully exploit domain-specific patterns that could yield further gains with targeted training.

    Relevance to Patrick's Research

    Directly relevant to NVIDIA Voyager (world model-driven Minecraft agent) and Uni Virtual World Model (CVPR) research. Interactive video world models are key for embodied AI training, virtual scene navigation, and game simulation. The training-free nature of Light Interaction makes it attractive for accelerating existing world models without architectural changes. This could benefit Patrick's tracking of physical world modeling for robotics and game-playing agents.