Abstract

This paper introduces W≡W (World Models in Words), an evaluation framework for auditing language-expressed physical commitments of Vision-Language Models (VLMs). Rather than scoring only I,q → a (input, question to answer), the framework asks models to produce a typed trace: I,q → (s₀, Δs, s₁, a) representing initial state, state transition, resulting state, and answer. A hybrid verifier checks schema validity, state grounding, transition consistency, and answer-trace compatibility. The key finding: 35% of correct answers from mid-tier models are backed by physically invalid traces, revealing failures that answer-only evaluation misses.

Key Contributions

- Typed trace evaluation: Instead of scoring only final answers, requires VLMs to produce explicit physical state transitions

  • Hybrid verifier: Checks schema validity, state grounding, transition consistency, and answer-trace compatibility
  • TraceBank dataset: Controlled trace resource with schema-validated synthetic scenarios across multiple physics families, contrastive preference pairs, and model outputs
  • Recovery via reranking: Verifier-guided reranking recovers up to 7 percentage points of trace validity without sacrificing answer accuracy

    Method Details

    The framework requires VLMs to produce a typed trace with four components:

  • 1. s₀: Initial physical state 2. Δs: State transition (the physical change) 3. s₁: Resulting physical state 4. a: Final answer

    A hybrid verifier then checks:

  • Schema validity (trace is well-formed)
  • State grounding (states match the visual input)
  • Transition consistency (Δs is physically plausible)
  • Answer-trace compatibility (answer follows from the trace)

    The verifier produces typed error labels: object, relation, force, transition, temporal, unit/scale, and faithfulness errors.

    Key Results

    - 35% of correct answers from mid-tier models are backed by physically invalid traces

  • 7 percentage points of trace validity can be recovered via verifier-guided reranking without sacrificing answer accuracy
  • 41% relative reduction in hidden inconsistency via trace-level preference tuning
  • Evaluated multiple VLMs on controlled and external physical-reasoning examples

    Limitations and Future Work

    - The framework depends on verifier quality and may have its own biases

  • TraceBank uses synthetic scenarios which may not fully capture real-world complexity
  • Future work could extend to more complex physical phenomena and multi-step reasoning chains

    Relevance to Patrick's Research

    This work provides a critical evaluation framework for VLMs that claim world model capabilities. The key insight that correct answers often mask incorrect physical reasoning is directly relevant to assessing whether VLMs truly build world models or just pattern-match to right answers. The framework offers a more rigorous way to audit physical understanding beyond benchmark accuracy scores.