arXiv: 2606.00267 Authors: Junwon Seo, Sushant Veer, Ran Tian, Wenhao Ding, Apoorva Sharma, Karen Leung, Edward Schmerling, Marco Pavone, Andrea Bajcsy Submitted: 29 May 2026 Project Page: https://junwon.me/StressDream/
Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs model distributions over futures, policy evaluation typically relies on nominal imaginations, which can miss high-impact outcomes unless prohibitively many samples are drawn. StressDream steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. The optimization uses two complementary objectives: a semantic objective with a Vision-Language Model (VLM) that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents optimized noise from drifting out-of-distribution (OOD). Evaluated on state-of-the-art video world models for autonomous driving and robotic manipulation, StressDream effectively steers imaginations toward high-impact yet plausible outcomes (e.g., task failures), enabling robust policy evaluation by identifying actions whose plausible futures include undesirable outcomes.
1. StressDream framework: Optimizes initial noise of diffusion-based video world models to steer imaginations toward user-specified high-impact outcomes at inference time 2. Dual-objective optimization: Combines (a) semantic objective via VLM reasoning about generated video content and (b) plausibility objective to prevent OOD drift 3. Broad applicability: Demonstrated on both autonomous driving and robotic manipulation domains with state-of-the-art video world models
Standard video world models generate future observations conditioned on actions, but nominal generation samples from the center of the distribution, missing rare-but-high-impact outcomes. Drawing enough samples to find these outcomes is computationally prohibitive.
Optimizes the initial noise vector zā of diffusion-based video WMs to generate videos depicting a user-specified target event (e.g., "task failure").
Two complementary objectives:
1. Semantic objective: Uses a VLM to score how well the generated video depicts the target event. The VLM provides gradient information by reasoning about the semantic content of generated frames.
2. Plausibility objective: Ensures the optimized noise remains on-manifold and does not drift OOD. Prevents generating implausible videos that don't actually represent likely outcomes of the specified action.
Optimization: The initial noise zā is treated as a learnable parameter and optimized at inference time to maximize the semantic score while maintaining plausibility.
- Policy evaluation: Identify actions whose plausible futures include undesirable outcomes (failures)
The paper demonstrates StressDream on:
Results show StressDream successfully steers imaginations toward:
The approach enables robust policy evaluation by finding low-probability but high-consequence outcomes that nominal sampling would miss.
1. VLM dependence: Quality of semantic steering depends on VLM's ability to understand and reason about generated video content
- Relevant to video world models for robotics (Voyager, DeepMind's Genie/Genesis)
- DriveDream (autonomous driving world model)
---
*Note generated from arXiv:2606.00267 (29 May 2026)*