Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others' actions manifest and change the world. E$^3$C is a controllable video diffusion framework for egocentric generation that builds structured and compact conditions disentangling persistent scene structure from human-driven dynamics. From context frames, E$^3$C constructs a semi-dense point cloud-based 3D memory and augments each point with appearance descriptors from video-VAE features. Rendering this memory into target viewpoints produces conditioning aligned with the target frames.
- 3D Environmental Memory: Constructs a semi-dense point cloud from context frames, augmented with appearance descriptors from video-VAE features, enabling rendering into novel target viewpoints
The framework uses a video diffusion model conditioned on:
Experiments conducted on the Nymeria dataset.
- Improves visual fidelity over strong baselines
The paper does not explicitly discuss limitations. Future work could extend the framework to handle multiple agents interacting simultaneously or to generalize to outdoor environments with less structured lighting.
E$^3$C represents a concrete step toward world models for embodied agents — specifically, modeling how the visual world changes as a function of camera wearer actions and other agents' actions. The 3D point cloud memory approach is a form of structured world representation, and the separation of scene structure from human dynamics mirrors the modular world model approach. This is directly relevant to JEPA-style predictive world models for embodied AI.