Abstract

Joint-Embedding Predictive Architectures (JEPAs) learn compact latent world models by predicting future embeddings, but no single coordinate of the latent is designated to encode task progression. SD-JEPA carves the JEPA latent into two orthogonal subspaces: a low-dimensional progression subspace shaped by a cosine-margin triplet loss, and a high-dimensional content subspace regularized by the SIGReg objective. This disentanglement improves planning and enables the 1-D angular progression coordinate to function as a scene-aware compass.

Key Contributions

- Proposes subspace decomposition of the JEPA latent into progression (low-dimensional) and content (high-dimensional) subspaces that act on disjoint coordinates, composing additively rather than competing

  • Shows the resulting 1-D angular progression coordinate advances with task progress, regresses on backtracking, and relocalizes to semantically appropriate task-phase sectors under perturbations
  • Demonstrates +0.18 pooled AUROC improvement over standard latent-prediction-error surprise for localizing semantic events, with 97.5% per-episode win rate at ±1-step tolerance

    Method Details

    SD-JEPA modifies the JEPA architecture by decomposing the latent representation into two orthogonal subspaces:

    1. Progression Subspace (low-dimensional, ~4.2% of latent = 8 dimensions): Shaped by a cosine-margin triplet loss that encourages the latent to encode task progress as angular displacement. This subspace is designed to be sensitive to task phase rather than specific observations.

    2. Content Subspace (high-dimensional): Regularized by the existing SIGReg (Simple Generative Regularization) objective from LeWM, which prevents collapse while maintaining semantic content representations.

    The key insight is that the two anti-collapse forces act on disjoint coordinates, so they compose additively. A subspace-ablation falsifier confirms the split is the load-bearing ingredient.

    Key Results

    | Metric | Value | Comparison | |--------|-------|------------| | AUROC improvement (semantic event localization) | +0.18 pooled AUROC | vs. latent-prediction-error surprise | | Per-episode win rate (±1-step tolerance) | 97.5% | on 40 held-out cube episodes | | Task-progress variance explained | 72-95% | by 8-dim progression subspace across 4 environments |

    SD-JEPA improves over the LeWM baseline on majority of control benchmarks at matched compute, and outperforms the strongest non-LeWM JEPA baseline on Push-T.

    Limitations and Future Work

    The paper focuses on planning and semantic event localization. Future work could explore using the progression subspace for hierarchical planning, skill discovery, or as a reward signal for reinforcement learning. The method requires careful initialization of the cosine-margin triplet loss; more research may be needed to make it robust across diverse environments.

    Relevance to Patrick's Research

    Directly relevant to AMI/JEPA research (Joint Embedding Predictive Architecture) tracked for Patrick's world model knowledge base. SD-JEPA advances the understanding of how latent world models can separate task progress from visual content, enabling more interpretable and usable representations for planning. The progress-detecting "compass" property has implications for task understanding in embodied AI agents like Voyager.