Abstract

Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others' actions manifest and change the world. E$^3$C is a controllable video diffusion framework for egocentric generation that builds structured and compact conditions disentangling persistent scene structure from human-driven dynamics. From context frames, E$^3$C constructs a semi-dense point cloud-based 3D memory and augments each point with appearance descriptors from video-VAE features. Rendering this memory into target viewpoints produces conditioning aligned with the target frames.

Key Contributions

- 3D Environmental Memory: Constructs a semi-dense point cloud from context frames, augmented with appearance descriptors from video-VAE features, enabling rendering into novel target viewpoints

  • Ego-Exo Human Pose Control: Separates human dynamics — observed people controlled via skeleton renderings (exo), camera wearer specified by 3D body joints and 6DoF wrist motion (ego)
  • Ego Motion Encoder: Introduces persistent cross-attention tokens to preserve ego human control when the wearer's body parts are invisible due to self-occlusions

    Method Details

    The framework uses a video diffusion model conditioned on:

  • 1. 3D Memory Rendering: Semi-dense point cloud built from context frames, each point augmented with video-VAE appearance features. Rendering this memory into target viewpoints produces viewpoint-aligned conditioning 2. Dual Human Control: Exo human control uses skeleton renderings for observed people; Ego human control uses 3D body joints and 6DoF wrist motion for the camera wearer 3. Ego Motion Encoder: Cross-attention token mechanism that maintains ego control signals even when body parts are self-occluded

    Experiments conducted on the Nymeria dataset.

    Key Results

    - Improves visual fidelity over strong baselines

  • Improves camera-motion accuracy over baselines
  • Improves object consistency over baselines
  • Improves ego and exo human control over baselines
  • Also enables intuitive scene editing

    Limitations and Future Work

    The paper does not explicitly discuss limitations. Future work could extend the framework to handle multiple agents interacting simultaneously or to generalize to outdoor environments with less structured lighting.

    Relevance to Patrick's Research

    E$^3$C represents a concrete step toward world models for embodied agents — specifically, modeling how the visual world changes as a function of camera wearer actions and other agents' actions. The 3D point cloud memory approach is a form of structured world representation, and the separation of scene structure from human dynamics mirrors the modular world model approach. This is directly relevant to JEPA-style predictive world models for embodied AI.