GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

arXiv: 2605.22882 Authors: Kaichen Zhou, Yuzhen Chen, Fangneng Zhan, Hang Hua, Grace Chen, Xinhai Chang, Ao Qu, Yilun Du, Zhuang Liu, Paul Pu Liang, Mengyu Wang Submitted: 20 May 2026 (revised 5 Jun 2026, v3) Categories: cs.CV, cs.RO

Abstract

Video world models can generate realistic futures from a single instruction, but they often fail to track the same physical points consistently across time. As a result, the generated videos appear plausible yet lack the physical grounding required for reliable action execution. GEM-4D injects dense 4D correspondence supervision distilled from a pretrained geometry foundation model into the video generative backbone during training.

Key Contributions

- Proposes GEM-4D, a geometry-grounded video world model that jointly captures appearance and geometric structure

Introduces dense 4D correspondence supervision distilled from a pretrained geometry foundation model

Develops an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories

Achieves state-of-the-art performance on both video prediction and geometric consistency

Method Details

GEM-4D uses a single-stream architecture (no additional inference cost) that injects 4D correspondence supervision during training. The key architectural choices:

1. Geometry Foundation Model: Uses a pretrained geometry model to provide dense 4D correspondence supervision

2. Joint Training: Enables the model to capture both appearance and geometric structure simultaneously 3. Inverse Dynamics Module: Converts correspondence-consistent video rollouts into executable robot trajectories

The model maintains a single-stream architecture with no additional inference cost compared to standard video prediction models.

Key Results

| Metric | Value | |--------|-------| | Real-world manipulation success (before) | 61% | | Real-world manipulation success (after GEM-4D) | 81% | | Improvement | +20 percentage points |

GEM-4D achieves state-of-the-art on both:

Video prediction quality

Geometric consistency across simulation and real-world scenarios

Limitations and Future Work

The paper does not explicitly discuss limitations. Future directions could include:

Extending to more complex manipulation tasks

Improving correspondence tracking in occluded scenes

Scaling to larger robotic systems

Relevance to Patrick's Research

GEM-4D directly addresses a key limitation of existing video world models for robotics: geometric inconsistency. The approach of distilling geometry supervision from a foundation model into the video generation backbone is a practical solution. The 20 percentage point improvement in real-world manipulation success demonstrates that physical grounding matters for robot manipulation tasks.