PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models

Abstract

Latent world models (LWMs) have strengthened end-to-end autonomous driving by forecasting compact scene dynamics for downstream planning. However, existing LWM-based planners usually generate trajectories directly from entangled latent representations. This compact latent-to-planner pathway lacks explicit modeling of risk, drivability, and diverse style preferences, making driving-style dynamics difficult to supervise, inspect, or modulate before a final trajectory is selected. We propose PLAN-S (PLANning with latent Style dynamics), a planner-facing bridge that addresses this compactness-controllability dilemma by decoding a style-conditioned, four-channel semantic cost map from the latent representation. The cost map is conditioned on ego state and driving style and is consumed up-stream of the planning decision through two host-side interfaces: attention-level fusion for regression planners and reward-level fusion for anchor-score planners. We validate PLAN-S on two architecturally distinct hosts, ResWorld on nuScenes and WoTE on NAVSIM, while keeping the host backbones frozen to isolate the contribution of the proposed bridge. On nuScenes, PLAN-S reduces L2 at every horizon over the baseline, with 0.55 m average L2 and a 42% relative reduction in the 3 s collision rate. On NAVSIM, the rule-cost variant reaches 89.4 Predictive Driver Model Score (PDMS), while the learned cost variant provides complementary gains on baseline-challenging scenes.

Key Contributions

- Style-Conditioned Cost Map: Decodes a four-channel semantic cost map from the latent representation, conditioned on ego state and driving style preferences.

Dual Interface Fusion: Integrates with host planners via attention-level fusion (regression planners) or reward-level fusion (anchor-score planners), keeping host backbones frozen.

Interpretable Style Dynamics: Enables explicit supervision, inspection, and modulation of driving-style dynamics before final trajectory selection.

Method Details

PLAN-S bridges the gap between latent world models and downstream planning by introducing a style-conditioned semantic cost map:

- Input: Compact latent representation from the LWM (frozen backbone)

Output: Four-channel semantic cost map (representing different aspects of drivability/risk/style)

Conditioning: Ego state + driving style preferences

Host Interfaces:

- Attention-level fusion for regression planners - Reward-level fusion for anchor-score planners

Validated on two distinct host architectures: 1. ResWorld on nuScenes 2. WoTE on NAVSIM

Key Results

| Metric | Result | |--------|--------| | Average L2 (nuScenes) | 0.55 m | | 3s Collision Rate Reduction | 42% relative | | PDMS Score (NAVSIM, rule-cost) | 89.4 |

- Reduces L2 at every horizon over baseline on nuScenes

Rule-cost variant reaches 89.4 PDMS on NAVSIM

Learned cost variant provides complementary gains on baseline-challenging scenes

Cost pathway ablation shows direct contribution to safer trajectory selection

Produces diverse cost maps with spatially consistent variations aligned to different driving styles

Limitations and Future Work

The approach keeps host backbones frozen to isolate PLAN-S's contribution; performance with joint training remains unexplored. Future work could extend style conditioning to multi-agent scenarios and explore learned cost functions beyond rule-based variants.

Relevance to Patrick's Research

PLAN-S is a concrete example of how world models can be integrated into autonomous driving systems with interpretable style controls. The cost map abstraction provides a clean interface between perception (latent world model) and planning. This pattern of "latent model → structured output → planning" is directly relevant to world model research for robotics and embodied AI.