StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

arXiv: 2606.00267 Authors: Junwon Seo, Sushant Veer, Ran Tian, Wenhao Ding, Apoorva Sharma, Karen Leung, Edward Schmerling, Marco Pavone, Andrea Bajcsy Submitted: 29 May 2026 Project Page: https://junwon.me/StressDream/

Abstract

Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs model distributions over futures, policy evaluation typically relies on nominal imaginations, which can miss high-impact outcomes unless prohibitively many samples are drawn. StressDream steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. The optimization uses two complementary objectives: a semantic objective with a Vision-Language Model (VLM) that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents optimized noise from drifting out-of-distribution (OOD). Evaluated on state-of-the-art video world models for autonomous driving and robotic manipulation, StressDream effectively steers imaginations toward high-impact yet plausible outcomes (e.g., task failures), enabling robust policy evaluation by identifying actions whose plausible futures include undesirable outcomes.

Key Contributions

1. StressDream framework: Optimizes initial noise of diffusion-based video world models to steer imaginations toward user-specified high-impact outcomes at inference time 2. Dual-objective optimization: Combines (a) semantic objective via VLM reasoning about generated video content and (b) plausibility objective to prevent OOD drift 3. Broad applicability: Demonstrated on both autonomous driving and robotic manipulation domains with state-of-the-art video world models

Method

Problem

Standard video world models generate future observations conditioned on actions, but nominal generation samples from the center of the distribution, missing rare-but-high-impact outcomes. Drawing enough samples to find these outcomes is computationally prohibitive.

StressDream Approach

Optimizes the initial noise vector zā‚€ of diffusion-based video WMs to generate videos depicting a user-specified target event (e.g., "task failure").

Two complementary objectives:

1. Semantic objective: Uses a VLM to score how well the generated video depicts the target event. The VLM provides gradient information by reasoning about the semantic content of generated frames.

2. Plausibility objective: Ensures the optimized noise remains on-manifold and does not drift OOD. Prevents generating implausible videos that don't actually represent likely outcomes of the specified action.

Optimization: The initial noise zā‚€ is treated as a learnable parameter and optimized at inference time to maximize the semantic score while maintaining plausibility.

Applications

- Policy evaluation: Identify actions whose plausible futures include undesirable outcomes (failures)

  • Policy improvement: By understanding what can go wrong, inform better action selection
  • Stress testing: Systematically explore edge cases in simulation before real-world deployment

    Key Results

    The paper demonstrates StressDream on:

  • Autonomous driving scenarios (with DriveDream-style world models)
  • Robotic manipulation tasks

    Results show StressDream successfully steers imaginations toward:

  • Task failures and other high-impact outcomes
  • Plausible futures specified by text prompts at inference time

    The approach enables robust policy evaluation by finding low-probability but high-consequence outcomes that nominal sampling would miss.

    Limitations and Future Work

    1. VLM dependence: Quality of semantic steering depends on VLM's ability to understand and reason about generated video content

  • 2. Optimization overhead: Inference-time optimization adds computational cost compared to single forward pass 3. Plausibility boundary: The trade-off between semantic alignment and plausibility maintenance requires careful tuning 4. Generalization: May not generalize to all types of high-impact events or all world model architectures

    Relevance to Patrick's Research

    - Relevant to video world models for robotics (Voyager, DeepMind's Genie/Genesis)

  • Novel approach for test-time steering of diffusion-based world models
  • Directly addresses policy evaluation in simulation — identifying failure modes before real-world deployment
  • Dual-objective optimization (semantic + plausibility) is a pattern applicable to other world model applications
  • Authors include researchers from Stanford and NVIDIA — connected to the broader world model research ecosystem

    References

    - DriveDream (autonomous driving world model)

  • Video generation models as world simulators (Sora-era work)
  • Cosmos World Foundation Model (NVIDIA)

    ---

    *Note generated from arXiv:2606.00267 (29 May 2026)*