This paper introduces W≡W (World Models in Words), an evaluation framework for auditing language-expressed physical commitments of Vision-Language Models (VLMs). Rather than scoring only I,q → a (input, question to answer), the framework asks models to produce a typed trace: I,q → (s₀, Δs, s₁, a) representing initial state, state transition, resulting state, and answer. A hybrid verifier checks schema validity, state grounding, transition consistency, and answer-trace compatibility. The key finding: 35% of correct answers from mid-tier models are backed by physically invalid traces, revealing failures that answer-only evaluation misses.
- Typed trace evaluation: Instead of scoring only final answers, requires VLMs to produce explicit physical state transitions
The framework requires VLMs to produce a typed trace with four components:
A hybrid verifier then checks:
The verifier produces typed error labels: object, relation, force, transition, temporal, unit/scale, and faithfulness errors.
- 35% of correct answers from mid-tier models are backed by physically invalid traces
- The framework depends on verifier quality and may have its own biases
This work provides a critical evaluation framework for VLMs that claim world model capabilities. The key insight that correct answers often mask incorrect physical reasoning is directly relevant to assessing whether VLMs truly build world models or just pattern-match to right answers. The framework offers a more rigorous way to audit physical understanding beyond benchmark accuracy scores.