PRISM: PRior-guided Imagination Sampling in World Models

title: "PRISM: PRior-guided Imagination Sampling in World Models"

arxiv_id: "2606.07974"

date: "2026-06-06"

tags: [world-model, jepa, model-based-rl, mpc, planning, continuous-control]

---

Abstract

PRISM tackles model-based continuous control with a learned latent (JEPA-style) world model and argues that the bottleneck is not the simulator's fidelity but which candidate actions the planner evaluates. It re-uses the world model's frozen encoder to predict a state-conditioned Gaussian action prior, then fuses that prior into the planner's sampling distribution through a parameter-free, precision-weighted Product-of-Gaussians update — no extra VLM, no extra visual encoder.

Key Contributions

Action prior from the world model itself: A lightweight MLP head on the frozen JEPA encoder predicts a state-conditioned Gaussian prior, eliminating the need for an independent VLM or visual encoder
Precision-weighted Product-of-Gaussians fusion: Closed-form, parameter-free integration of the learned prior with the planner's sampling distribution — confident where the prior is confident, hands off where it isn't
Task-agnostic and minimal: One extra MLP, no new datasets, no auxiliary losses; applicable to any latent world model that exposes a frozen encoder
Strong empirical gains on standard continuous-control tasks: +35 percentage points on Cube, +32 percentage points on PushT over vanilla world-model-based MPC, with no significant inference overhead

Method Details

The architecture is a standard JEPA-style latent world model — a vision encoder that maps observations to latents, and a latent dynamics predictor that rolls out future latent states conditioned on actions. On top of this:

Frozen encoder + prior MLP: The encoder weights are frozen; a small MLP is trained to read the encoder's representation of the current state and emit the parameters (μ, σ) of a state-conditioned Gaussian over candidate actions. Training uses the same dataset as the world model — expert demonstrations are not used as demonstrations but as labels for action supervision.
Planner sampling distribution: A model-predictive-control (MPC) planner samples candidate action sequences from a base distribution (typically CEM with isotropic Gaussian proposals).
Product-of-Gaussians fusion (precision-weighted): At each planning step, the base distribution's precision (1/σ²) is added to the prior's precision, and the fused mean is the precision-weighted sum. This is parameter-free, closed-form, and degrades gracefully to the base sampler when the prior is uncertain.
Closed-loop rollouts: The world model scores each sampled trajectory; the first action of the best-scoring sequence is executed, and the process repeats.

The key insight is that the world model already encodes the agent's action intuition — extracting it via a single MLP head avoids the architectural bloat of pairing the world model with a separate large VLM.

Key Results

Cube task: +35 percentage points success rate over vanilla world-model-based MPC
PushT task: +32 percentage points success rate over vanilla world-model-based MPC
No significant inference overhead versus the baseline MPC planner
Ablations isolate the contribution of (a) re-using the world model encoder vs. a fresh encoder, and (b) Product-of-Gaussians fusion vs. mean-only or variance-only fusion
PDF: https://arxiv.org/pdf/2606.07974

Limitations and Future Work

Evaluated on short-horizon continuous-control tasks (Cube, PushT); scaling to longer-horizon tasks, language-conditioned goals, or higher-dimensional action spaces is open
The prior is Gaussian and unimodal; multi-modal action distributions (e.g., branching strategies) are not explicitly modeled
The paper assumes the world model is JEPA-style latent; applicability to pixel-space generative world models (e.g., video diffusion) is not explored

Relevance to Patrick's Research

PRISM sits at the intersection of Patrick's interests: JEPA-style latent world models, model-based planning, and the question of how a world model is actually *used* rather than how accurately it predicts. The architectural minimalism — one MLP, no new modules — is a useful counterpoint to the "throw a VLM at it" trend in action prior learning. The +35pp / +32pp numbers also provide a clean baseline for comparing any future action-prior work.