Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

Abstract

Astra is an agentic spatial reasoning framework that couples an RL-trained VLM policy (Astra-VL) with a Bagel-based world simulator (Astra-WM), enabling VLMs to actively acquire imagined visual evidence by interacting with the world simulator during reasoning. Astra-WM generates novel-view observations from context images and natural-language camera motions, allowing the agent to "think with imagination."

Key Contributions

- Astra framework: Couples Astra-VL (RL-trained VLM policy) with Astra-WM (Bagel-based world simulator) for action-conditioned visual imagination

View consistency tuning: Astra-WM is trained with view consistency tuning to improve pose and content consistency across generated views

World-simulator-in-the-loop two-phase RL curriculum: Stabilizes tool-use exploration and teaches the model when imagined observations improve over direct answering

Method Details

The framework consists of two coupled components:

1. Astra-WM (World Simulator): A Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. Trained with view consistency tuning to improve pose and content consistency across views.

2. Astra-VL (Agentic Policy): An RL-trained VLM policy that learns when, where, and how to invoke the world simulator during reasoning. The policy decides whether imagined observations would improve its answer before calling the simulator.

The RL stage uses a two-phase curriculum: first phase stabilizes tool-use exploration, second phase advances the model's ability to invoke the simulator only when imagined observations improve over direct answering.

Key Results

| Model | Benchmark | Baseline | +Astra-WM | +Astra-VL | |-------|-----------|----------|----------|----------| | Gemini-3-Flash | MMSI-Bench | 45.1 | 49.5 | - | | Qwen3-VL | MMSI-Bench | 29.8 | - | 38.8 | | Qwen3-VL | MindCube | 36.8 | - | 42.7 |

Key insight: Both the world simulator and the agentic policy are necessary for effective world-model-augmented reasoning. Learning when to imagine is as important as the imagination itself.

Limitations and Future Work

The paper notes that effective world-model-augmented reasoning requires learning when, where, and how to imagine — this is a learned behavior, not implicit. Future work could explore more sophisticated criteria for imagination invocation and extending to 3D scene understanding beyond spatial reasoning benchmarks.

Relevance to Patrick's Research

This work directly addresses the question of how to integrate world simulators into VLM reasoning pipelines. The two-phase RL curriculum for learning tool-use with world models is particularly relevant for understanding how to build agents that can reason about unobserved states. The distinction between systematic replanning (using the simulator) vs. blind trial-and-error has implications for world model evaluation.