REST3D reconstructs physically stable 3D scenes from a single RGB image, enabling casual images to become simulation-ready digital assets. The method introduces an agentic physical scene understanding technique that constructs a scene-tree representation capturing object physical states and inter-object relationships from a gravity-support perspective. This structural prior initializes scene reconstruction, followed by scene-tree-guided alignment and physics-constrained optimization to resolve physical violations (floating objects, penetration) while preserving visual consistency with the input image.
- Scene-tree representation: Agentic physical scene understanding that captures object physical states and inter-object gravity-support relationships as a structured prior
REST3D operates in three stages:
1. Agentic physical scene understanding: An agent analyzes the input image and constructs a scene-tree — a hierarchical representation encoding each object's physical state (supported, floating, etc.) and inter-object gravity-support relationships. This provides a structural prior about which objects should rest on which surfaces.
2. Scene initialization: Uses image-to-3D models (e.g., reconstruction transformers) to generate initial 3D geometry, then uses the scene-tree to guide alignment — ensuring objects are positioned according to the physical relationships identified in stage 1.
3. Physics-constrained optimization: Refines the scene to resolve remaining physical violations (objects intersecting, objects floating above support surfaces) through optimization that respects physics constraints while preserving visual consistency with the original input image.
The approach combines computer vision scene understanding with physics-based optimization, using the scene-tree as a bridge between visual reconstruction and physical validity.
- Significantly reduces physical errors compared to prior single-image reconstruction methods
- Scene-tree accuracy depends on the quality of physical scene understanding; errors in this stage propagate
REST3D directly addresses a practical challenge for world models: generating 3D scenes that are not just visually plausible but physically consistent and simulation-ready. The scene-tree concept — encoding gravity-support relationships — is an interesting approach to injecting physical commonsense into 3D reconstruction. For world models that need to generate or reason about physical environments, the gap between "looks plausible" and "simulates correctly" is critical, and REST3D takes a concrete step toward closing it.