Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

Abstract

Gamma-World presents a generative multi-agent world model for interactive video simulation that extends beyond single-agent or two-player settings. It introduces Simplex Rotary Agent Encoding, a parameter-free extension of 3D RoPE that represents agents as vertices of a regular simplex in rotary angle space, enabling permutation-symmetric agent identities without learned per-slot embeddings. Sparse Hub Attention reduces cross-agent attention from quadratic to linear complexity. A full-context diffusion teacher is distilled into a causal student with KV caching for real-time 24 FPS action-responsive generation. The model generalizes from two to four players without retraining.

Key Contributions

- First generative multi-agent world model supporting scalable agent counts (beyond two players) with principled permutation symmetry

Simplex Rotary Agent Encoding: parameter-free agent identity using regular simplex geometry in 3D RoPE space

Sparse Hub Attention: reduces cross-agent attention from O(n²) to O(n) via learnable hub tokens

Real-time inference: 24 FPS action-responsive generation via diffusion distillation and KV caching

Generalizes 2→4 players without additional training

Method Details

Architecture: Gamma-World builds on video diffusion transformer architecture extended for multi-agent control.

Simplex Rotary Agent Encoding: Each agent is assigned a unique phase in a rotary embedding space structured as a regular simplex (3 vertices in 3D, 4 in 4D, etc.). This ensures:

Permutation symmetry: any permutation of agent identities produces the same encoding

Distinct phases: each agent has a unique angle

Parameter-free: no learned embeddings per agent slot

Scalable: works for any number of agents without architecture changes

Sparse Hub Attention: Instead of all-to-all cross-agent attention (quadratic in agent count), learnable hub tokens mediate interactions. Each agent attends to/from hub tokens, reducing complexity to O(n) in number of agents.

Distillation for Real-Time: A full-context diffusion teacher (processes entire video at once) is distilled into a causal student that generates temporal blocks sequentially with KV caching, enabling action-responsive generation at 24 FPS.

Training: The model is trained in multiplayer virtual environments with ground-truth state information. Agents remain independently controllable and permutation-symmetric.

Key Results

| Metric | Gamma-World | Slot-based Baseline | Dense-attention Baseline | |--------|-------------|--------------------|-----------------------| | Video fidelity | Higher | Lower | Lower | | Action controllability | Higher | Lower | Lower | | Inter-agent consistency | Higher | Lower | Lower | | Generalization | 2→4 players, no retrain | N/A | N/A | | Inference speed | 24 FPS (causal student) | N/A | N/A |

Limitations and Future Work

- Currently demonstrated in multiplayer virtual environments; real-world embodied agents not yet tested

Performance on heterogeneous agents (different types/abilities) not thoroughly explored

Long-horizon consistency across very long trajectories (>minutes) remains challenging

Real-world deployment requires sim-to-real transfer which is not addressed

Relevance to Patrick's Research

Multi-agent world modeling is an important frontier beyond single-agent video prediction. The parameter-free simplex encoding is an elegant solution to the permutation symmetry problem that could inspire other world model architectures. The linear-scaling attention mechanism is critical for practical deployment. This connects to Voyager/NVIDIA world model work and generalizes single-agent world models (like Sora/VDM) to multi-agent settings.

---

*Source: arXiv:2605.28816 | PDF: 2605.28816.pdf*