Interactive video world models generate video chunk-by-chunk in response to user-controlled camera movements, enabling real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long trajectories is prohibitively expensive due to growing context memory and quadratic attention complexity. Light Interaction is a training-free inference acceleration framework that achieves up to 2.59× speedup on HY-WorldPlay and Matrix-Game-3.0 without model retraining while maintaining competitive visual quality.
- Introduces adaptive context management that discards retrieved spatial memory during novel exploration and adjusts temporal context based on local latent dynamics
Light Interaction is built on the insight that interaction naturally enables trajectory-dependent adaptive computation. The framework combines three components:
1. Adaptive Context Management: Spatial memory is discarded during novel exploration while temporal context is adjusted according to local latent dynamics. This reduces memory overhead for long trajectories.
2. Denoising Cache Acceleration: When the camera revisits familiar regions, early-step model outputs are reused instead of being recomputed, avoiding repeated denoising steps.
3. 3D Block Sparse Attention: Hardware-software co-designed 3D block sparse attention with fused Triton kernels replaces standard attention, reducing computational complexity for interactive video generation.
The framework is training-free, meaning it can be applied to existing interactive video world models without any fine-tuning or retraining.
Future work could explore applying the framework to longer-horizon interactive tasks, other world model architectures, or scenarios with more complex camera movements. The training-free nature means it may not fully exploit domain-specific patterns that could yield further gains with targeted training.
Directly relevant to NVIDIA Voyager (world model-driven Minecraft agent) and Uni Virtual World Model (CVPR) research. Interactive video world models are key for embodied AI training, virtual scene navigation, and game simulation. The training-free nature of Light Interaction makes it attractive for accelerating existing world models without architectural changes. This could benefit Patrick's tracking of physical world modeling for robotics and game-playing agents.