GeoVR: Learning Geometric Representations from Videos for Spatial Intelligent MLLMs

Abstract

GeoVR is a framework that learns geometric representations using purely 2D video sequences to endow Multimodal LLMs (MLLMs) with intrinsic 3D awareness. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models through four complementary geometric targets: camera pose estimation, depth regression, metric scale prediction, and multi-scale 3D feature distillation.

Key Contributions

- 2D-to-3D representation learning: Learns 3D awareness from 2D videos without requiring large-scale 3D training data

Multi-objective geometric learning: Four complementary geometric targets jointly train geometric representations

Internal representation reshaping: Goes beyond superficial feature mixing to restructure the MLLM's latent space with physical and geometric constraints

Method Details

GeoVR distills geometry knowledge into MLLMs through four complementary geometric targets:

1. Inter-frame camera pose estimation: Embeds varying viewpoint dynamics by estimating camera motion between frames

2. Dense depth map regression: Anchors physical distances with pixel-wise depth predictions 3. Metric scale factor prediction: Enables real-world calibration by predicting metric scale 4. Multi-scale 3D feature distillation: Aligns intermediate feature space with 3D foundation model representations

The key insight is that geometric constraints (physical and explicit) naturally cause the model's internal representations to develop 3D awareness, rather than relying on post-hoc feature alignment.

Key Results

GeoVR achieves state-of-the-art performance on spatial reasoning benchmarks, establishing a new paradigm for endowing foundation models with spatial intelligence. The multi-objective geometric learning strategy outperforms approaches that use only one or two of the geometric targets.

Limitations and Future Work

GeoVR requires pre-trained 3D foundation models for distillation, which may not be available for all domains. The framework also focuses on spatial reasoning and may not generalize to other forms of geometric understanding (e.g., physical properties of objects). Future work could explore self-supervised 3D representation learning without reliance on pre-trained 3D models.

Relevance to Patrick's Research

GeoVR represents a paradigm for endowing LLMs/MLLMs with world model-like spatial understanding through representation reshaping. Unlike video generation world models (Sora, VDM), GeoVR focuses on internal geometric representations — a complementary approach to world modeling that could enhance agents' spatial reasoning capabilities. The four-target distillation strategy provides a template for multi-objective training of world model representations.