arXiv: 2606.03603 Authors: Yucheng Zhou, Wei Tao, Yiwen Guo, Jianbing Shen Submitted: 2 June 2026 Categories: cs.CV, cs.AI, cs.CL
World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes. World models generate concrete visual rollouts while MLLMs reason abstractly. This paper studies when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. The authors propose PF-OPSD for controlled concrete reasoning.
- Formulates controlled concrete reasoning — learning when to invoke, verify, and integrate visual future simulation alongside abstract reasoning
Problem: Generated rollouts from world models are stochastic and may be visually plausible but task-incorrect. Need to determine:
PF-OPSD Training:
Key insight: Training with ground-truth futures lets the model learn when simulation is reliable, then distills this judgment into a student that only sees static observations.
PF-OPSD also increases robustness to noisy or conflicting rollouts.
The paper does not explicitly discuss limitations. Future directions could include:
This paper directly addresses the integration of world models (for concrete simulation) with LLMs (for abstract reasoning). The PF-OPSD approach of using ground-truth futures during training but not at test time is a practical way to get the benefits of simulation without requiring perfect world models at inference. The new benchmarks (VRQABench, OpenWorldQA) provide valuable evaluation infrastructure for world model research.