GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

arXiv: 2605.22882 Authors: Kaichen Zhou, Yuzhen Chen, Fangneng Zhan, Hang Hua, Grace Chen, Xinhai Chang, Ao Qu, Yilun Du, Zhuang Liu, Paul Pu Liang, Mengyu Wang Submitted: 20 May 2026 (revised 5 Jun 2026, v3) Categories: cs.CV, cs.RO

Abstract

Video world models can generate realistic futures from a single instruction, but they often fail to track the same physical points consistently across time. As a result, the generated videos appear plausible yet lack the physical grounding required for reliable action execution. GEM-4D injects dense 4D correspondence supervision distilled from a pretrained geometry foundation model into the video generative backbone during training.

Key Contributions

- Proposes GEM-4D, a geometry-grounded video world model that jointly captures appearance and geometric structure

  • Introduces dense 4D correspondence supervision distilled from a pretrained geometry foundation model
  • Develops an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories
  • Achieves state-of-the-art performance on both video prediction and geometric consistency

    Method Details

    GEM-4D uses a single-stream architecture (no additional inference cost) that injects 4D correspondence supervision during training. The key architectural choices:

    1. Geometry Foundation Model: Uses a pretrained geometry model to provide dense 4D correspondence supervision

  • 2. Joint Training: Enables the model to capture both appearance and geometric structure simultaneously 3. Inverse Dynamics Module: Converts correspondence-consistent video rollouts into executable robot trajectories

    The model maintains a single-stream architecture with no additional inference cost compared to standard video prediction models.

    Key Results

    | Metric | Value | |--------|-------| | Real-world manipulation success (before) | 61% | | Real-world manipulation success (after GEM-4D) | 81% | | Improvement | +20 percentage points |

    GEM-4D achieves state-of-the-art on both:

  • Video prediction quality
  • Geometric consistency across simulation and real-world scenarios

    Limitations and Future Work

    The paper does not explicitly discuss limitations. Future directions could include:

  • Extending to more complex manipulation tasks
  • Improving correspondence tracking in occluded scenes
  • Scaling to larger robotic systems

    Relevance to Patrick's Research

    GEM-4D directly addresses a key limitation of existing video world models for robotics: geometric inconsistency. The approach of distilling geometry supervision from a foundation model into the video generation backbone is a practical solution. The 20 percentage point improvement in real-world manipulation success demonstrates that physical grounding matters for robot manipulation tasks.