PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models

Abstract

Latent world models (LWMs) have strengthened end-to-end autonomous driving by forecasting compact scene dynamics for downstream planning. However, existing LWM-based planners usually generate trajectories directly from entangled latent representations. This compact latent-to-planner pathway lacks explicit modeling of risk, drivability, and diverse style preferences, making driving-style dynamics difficult to supervise, inspect, or modulate before a final trajectory is selected. We propose PLAN-S (PLANning with latent Style dynamics), a planner-facing bridge that addresses this compactness-controllability dilemma by decoding a style-conditioned, four-channel semantic cost map from the latent representation. The cost map is conditioned on ego state and driving style and is consumed up-stream of the planning decision through two host-side interfaces: attention-level fusion for regression planners and reward-level fusion for anchor-score planners. We validate PLAN-S on two architecturally distinct hosts, ResWorld on nuScenes and WoTE on NAVSIM, while keeping the host backbones frozen to isolate the contribution of the proposed bridge. On nuScenes, PLAN-S reduces L2 at every horizon over the baseline, with 0.55 m average L2 and a 42% relative reduction in the 3 s collision rate. On NAVSIM, the rule-cost variant reaches 89.4 Predictive Driver Model Score (PDMS), while the learned cost variant provides complementary gains on baseline-challenging scenes.

Key Contributions

- Style-Conditioned Cost Map: Decodes a four-channel semantic cost map from the latent representation, conditioned on ego state and driving style preferences.

  • Dual Interface Fusion: Integrates with host planners via attention-level fusion (regression planners) or reward-level fusion (anchor-score planners), keeping host backbones frozen.
  • Interpretable Style Dynamics: Enables explicit supervision, inspection, and modulation of driving-style dynamics before final trajectory selection.

    Method Details

    PLAN-S bridges the gap between latent world models and downstream planning by introducing a style-conditioned semantic cost map:

    - Input: Compact latent representation from the LWM (frozen backbone)

  • Output: Four-channel semantic cost map (representing different aspects of drivability/risk/style)
  • Conditioning: Ego state + driving style preferences
  • Host Interfaces:
  • - Attention-level fusion for regression planners - Reward-level fusion for anchor-score planners

    Validated on two distinct host architectures: 1. ResWorld on nuScenes 2. WoTE on NAVSIM

    Key Results

    | Metric | Result | |--------|--------| | Average L2 (nuScenes) | 0.55 m | | 3s Collision Rate Reduction | 42% relative | | PDMS Score (NAVSIM, rule-cost) | 89.4 |

    - Reduces L2 at every horizon over baseline on nuScenes

  • Rule-cost variant reaches 89.4 PDMS on NAVSIM
  • Learned cost variant provides complementary gains on baseline-challenging scenes
  • Cost pathway ablation shows direct contribution to safer trajectory selection
  • Produces diverse cost maps with spatially consistent variations aligned to different driving styles

    Limitations and Future Work

    The approach keeps host backbones frozen to isolate PLAN-S's contribution; performance with joint training remains unexplored. Future work could extend style conditioning to multi-agent scenarios and explore learned cost functions beyond rule-based variants.

    Relevance to Patrick's Research

    PLAN-S is a concrete example of how world models can be integrated into autonomous driving systems with interpretable style controls. The cost map abstraction provides a clean interface between perception (latent world model) and planning. This pattern of "latent model → structured output → planning" is directly relevant to world model research for robotics and embodied AI.