World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

arXiv: 2606.03603 Authors: Yucheng Zhou, Wei Tao, Yiwen Guo, Jianbing Shen Submitted: 2 June 2026 Categories: cs.CV, cs.AI, cs.CL

Abstract

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes. World models generate concrete visual rollouts while MLLMs reason abstractly. This paper studies when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. The authors propose PF-OPSD for controlled concrete reasoning.

Key Contributions

- Formulates controlled concrete reasoning — learning when to invoke, verify, and integrate visual future simulation alongside abstract reasoning

Constructs two human-verified benchmarks:

- VRQABench: Controllable spatial lookahead tasks - OpenWorldQA: Open-domain physical prediction

Proposes PF-OPSD (Privileged-Future On-Policy Self-Distillation) training method

Shows that PF-OPSD improves robustness to noisy or conflicting rollouts

Method Details

Problem: Generated rollouts from world models are stochastic and may be visually plausible but task-incorrect. Need to determine:

1. When is visual simulation useful? 2. Is a rollout credible? 3. How should it influence the final answer?

PF-OPSD Training:

Uses ground-truth future videos and answers as teacher-side privileged context during training

The deployable student never observes true futures at test time

Evaluates on-policy concrete-reasoning trajectories

Key insight: Training with ground-truth futures lets the model learn when simulation is reliable, then distills this judgment into a student that only sees static observations.

Key Results

| Benchmark | PF-OPSD vs Baseline | |-----------|---------------------| | VRQABench | +10.6% improvement | | OpenWorldQA | +10.9% improvement |

PF-OPSD also increases robustness to noisy or conflicting rollouts.

Limitations and Future Work

The paper does not explicitly discuss limitations. Future directions could include:

Extending to more complex physical reasoning domains

Improving rollout verification methods

Reducing the gap between teacher (privileged) and student (deployable) performance

Relevance to Patrick's Research

This paper directly addresses the integration of world models (for concrete simulation) with LLMs (for abstract reasoning). The PF-OPSD approach of using ground-truth futures during training but not at test time is a practical way to get the benefits of simulation without requiring perfect world models at inference. The new benchmarks (VRQABench, OpenWorldQA) provide valuable evaluation infrastructure for world model research.