arXiv: 2605.22882 Authors: Kaichen Zhou, Yuzhen Chen, Fangneng Zhan, Hang Hua, Grace Chen, Xinhai Chang, Ao Qu, Yilun Du, Zhuang Liu, Paul Pu Liang, Mengyu Wang Submitted: 20 May 2026 (revised 5 Jun 2026, v3) Categories: cs.CV, cs.RO
Video world models can generate realistic futures from a single instruction, but they often fail to track the same physical points consistently across time. As a result, the generated videos appear plausible yet lack the physical grounding required for reliable action execution. GEM-4D injects dense 4D correspondence supervision distilled from a pretrained geometry foundation model into the video generative backbone during training.
- Proposes GEM-4D, a geometry-grounded video world model that jointly captures appearance and geometric structure
GEM-4D uses a single-stream architecture (no additional inference cost) that injects 4D correspondence supervision during training. The key architectural choices:
1. Geometry Foundation Model: Uses a pretrained geometry model to provide dense 4D correspondence supervision
The model maintains a single-stream architecture with no additional inference cost compared to standard video prediction models.
GEM-4D achieves state-of-the-art on both:
The paper does not explicitly discuss limitations. Future directions could include:
GEM-4D directly addresses a key limitation of existing video world models for robotics: geometric inconsistency. The approach of distilling geometry supervision from a foundation model into the video generation backbone is a practical solution. The 20 percentage point improvement in real-world manipulation success demonstrates that physical grounding matters for robot manipulation tasks.