World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

arXiv: 2606.03603 Authors: Yucheng Zhou, Wei Tao, Yiwen Guo, Jianbing Shen Submitted: 2 June 2026 Categories: cs.CV, cs.AI, cs.CL

Abstract

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes. World models generate concrete visual rollouts while MLLMs reason abstractly. This paper studies when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. The authors propose PF-OPSD for controlled concrete reasoning.

Key Contributions

- Formulates controlled concrete reasoning — learning when to invoke, verify, and integrate visual future simulation alongside abstract reasoning

  • Constructs two human-verified benchmarks:
  • - VRQABench: Controllable spatial lookahead tasks - OpenWorldQA: Open-domain physical prediction
  • Proposes PF-OPSD (Privileged-Future On-Policy Self-Distillation) training method
  • Shows that PF-OPSD improves robustness to noisy or conflicting rollouts

    Method Details

    Problem: Generated rollouts from world models are stochastic and may be visually plausible but task-incorrect. Need to determine:

  • 1. When is visual simulation useful? 2. Is a rollout credible? 3. How should it influence the final answer?

    PF-OPSD Training:

  • Uses ground-truth future videos and answers as teacher-side privileged context during training
  • The deployable student never observes true futures at test time
  • Evaluates on-policy concrete-reasoning trajectories

    Key insight: Training with ground-truth futures lets the model learn when simulation is reliable, then distills this judgment into a student that only sees static observations.

    Key Results

    | Benchmark | PF-OPSD vs Baseline | |-----------|---------------------| | VRQABench | +10.6% improvement | | OpenWorldQA | +10.9% improvement |

    PF-OPSD also increases robustness to noisy or conflicting rollouts.

    Limitations and Future Work

    The paper does not explicitly discuss limitations. Future directions could include:

  • Extending to more complex physical reasoning domains
  • Improving rollout verification methods
  • Reducing the gap between teacher (privileged) and student (deployable) performance

    Relevance to Patrick's Research

    This paper directly addresses the integration of world models (for concrete simulation) with LLMs (for abstract reasoning). The PF-OPSD approach of using ground-truth futures during training but not at test time is a practical way to get the benefits of simulation without requiring perfect world models at inference. The new benchmarks (VRQABench, OpenWorldQA) provide valuable evaluation infrastructure for world model research.