A Cookbook of 3D Vision: Data, Learning Paradigms, and Application

Abstract

This paper provides a data-centric taxonomy of 3D vision that connects geometric representations, datasets, learning frameworks, and applications within one conceptual map, covering point clouds, meshes, voxels, 3D Gaussians, 2D-supervised 3D learning, implicit neural representations, and 4D world modeling.

Key Contributions

- Unified data-centric map: Organizes 3D vision around data representations (point clouds, meshes, voxels, implicit fields, 3D Gaussians) and their acquisition pipelines.

Benchmark ecosystem analysis: Shows how dataset design and supervision regimes shape learning paradigms, including scalability constraints.

4D world modeling framing: Situates the extension from 3D to 4D (temporal) scene understanding as an emerging frontier, alongside 2D-supervised 3D learning and neural implicit fields.

Method Details

Survey/cookbook format. No novel method proposed. The paper synthesizes existing work into a structured taxonomy organized around three axes:

1. Data Representations: Point clouds, meshes, voxel grids, RGB-D, multi-view images, CAD/B-Rep models, neural implicit fields, 3D Gaussians — each with efficiency-fidelity tradeoffs. 2. Datasets & Benchmarks: How benchmark design constrains and enables progress (ScanNet, nuScenes, BEHAVIOR, etc.). 3. Modeling Paradigms: Geometry-based pipelines, 2D-supervised 3D learning, implicit neural fields, and 4D video/world modeling.

Authors: Hongyang Du (Brown), Zongxia Li (UMD), Dawei Liu (UPenn), et al. Accepted to CVPR 2026 OpenSUN3D Workshop.

Key Results

This is a survey paper — no experimental results. The contribution is organizational/synthetic:

Repository: https://github.com/Hongyang-Du/awesome-3d-datasets

Framework connecting 4D world modeling to the broader 3D vision landscape.

Identified trend: emerging work on balancing efficiency and fidelity in temporal scene understanding.

Limitations & Future Work

- Survey format — no new methods or experiments.

4D world modeling section is brief (~1 page) given scope; a dedicated survey on that topic would be valuable.

Rapidly evolving area; taxonomy may need updating as 3D Gaussian splatting and neural rendering mature.

Relevance to Patrick's Research

Moderate relevance. The 4D world modeling framing connects to video prediction and temporal world model research. The taxonomy is useful background for understanding how different 3D representations feed into world modeling systems. However, this is primarily a survey/educational resource rather than a research contribution.