GeoVR is a framework that learns geometric representations using purely 2D video sequences to endow Multimodal LLMs (MLLMs) with intrinsic 3D awareness. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models through four complementary geometric targets: camera pose estimation, depth regression, metric scale prediction, and multi-scale 3D feature distillation.
- 2D-to-3D representation learning: Learns 3D awareness from 2D videos without requiring large-scale 3D training data
GeoVR distills geometry knowledge into MLLMs through four complementary geometric targets:
1. Inter-frame camera pose estimation: Embeds varying viewpoint dynamics by estimating camera motion between frames
The key insight is that geometric constraints (physical and explicit) naturally cause the model's internal representations to develop 3D awareness, rather than relying on post-hoc feature alignment.
GeoVR achieves state-of-the-art performance on spatial reasoning benchmarks, establishing a new paradigm for endowing foundation models with spatial intelligence. The multi-objective geometric learning strategy outperforms approaches that use only one or two of the geometric targets.
GeoVR requires pre-trained 3D foundation models for distillation, which may not be available for all domains. The framework also focuses on spatial reasoning and may not generalize to other forms of geometric understanding (e.g., physical properties of objects). Future work could explore self-supervised 3D representation learning without reliance on pre-trained 3D models.
GeoVR represents a paradigm for endowing LLMs/MLLMs with world model-like spatial understanding through representation reshaping. Unlike video generation world models (Sora, VDM), GeoVR focuses on internal geometric representations — a complementary approach to world modeling that could enhance agents' spatial reasoning capabilities. The four-target distillation strategy provides a template for multi-objective training of world model representations.