Abstract

Streaming video generation models typically rely on temporal-centric memory organizing historical context as raw frames or chunk segments, leading to identity drift and semantic inconsistency when entities exit the frame. SlotMemory shifts the memory abstraction from "when" an event occurred to "what" is being represented by decomposing the transformer's key-value manifold into discrete semantic slots. Evaluated on 60-second interactive narratives using Wan2.1-T2V-1.3B, SlotMemory achieves a quality score of 81.61 and a 22.8% relative improvement in dynamic consistency over the strongest existing streaming baseline.

Key Contributions

- Introduces object-centric Key-Value memory mechanism that decomposes temporal memory into discrete semantic slots acting as routing addresses for entity-level persistence

  • Achieves 22.8% relative improvement in dynamic consistency over the strongest existing streaming baseline for long-form video synthesis
  • Demonstrates 81.61 quality score on 60-second interactive narratives using Wan2.1-T2V-1.3B backbone

    Method Details

    SlotMemory replaces temporal-centric memory with an object-centric abstraction:

    1. Semantic Slot Decomposition: The transformer's key-value manifold is decomposed into discrete, reusable semantic slots. Each slot acts as a routing address to index and store high-fidelity key-value tokens for specific entities or visual concepts.

    2. Entity-Level Persistence: By storing content by "what" rather than "when," SlotMemory enables entities to maintain identity and semantic consistency even when they exit the frame and re-enter later.

    3. Prompt-Aware Retrieval: During interactive prompt transitions, the slot-based memory allows the model to retrieve relevant entity information regardless of temporal proximity, avoiding drift caused by storing raw frame-based histories.

    The approach shifts the memory primitive from raw temporal capacity to structured semantic representation.

    Key Results

    | Metric | Value | Comparison | |--------|-------|------------| | Quality Score | 81.61 | State-of-the-art for streaming | | Dynamic Consistency Improvement | +22.8% relative | vs. strongest existing streaming baseline | | Backbone | Wan2.1-T2V-1.3B | - | | Video Length | 60-second interactive narratives | - |

    Limitations and Future Work

    Future work could explore extending SlotMemory to longer videos (beyond 60 seconds), applying object-centric memory to other video generation architectures, or combining slot-based memory with explicit object tracking mechanisms. The approach may require careful slot initialization when handling a large number of distinct entities.

    Relevance to Patrick's Research

    Relevant to world model research broadly, particularly for video generation world models (Sora/VDM, Genesis, Genie). SlotMemory addresses a key challenge in long-form video generation—maintaining entity consistency—that is also central to physical world modeling for robotics and game simulation. The object-centric abstraction could benefit Voyager-like agents that need to track entities across long interactions in generated game environments.