Joint-Embedding Predictive Architectures (JEPAs) learn compact latent world models by predicting future embeddings, but no single coordinate of the latent is designated to encode task progression. SD-JEPA carves the JEPA latent into two orthogonal subspaces: a low-dimensional progression subspace shaped by a cosine-margin triplet loss, and a high-dimensional content subspace regularized by the SIGReg objective. This disentanglement improves planning and enables the 1-D angular progression coordinate to function as a scene-aware compass.
- Proposes subspace decomposition of the JEPA latent into progression (low-dimensional) and content (high-dimensional) subspaces that act on disjoint coordinates, composing additively rather than competing
SD-JEPA modifies the JEPA architecture by decomposing the latent representation into two orthogonal subspaces:
1. Progression Subspace (low-dimensional, ~4.2% of latent = 8 dimensions): Shaped by a cosine-margin triplet loss that encourages the latent to encode task progress as angular displacement. This subspace is designed to be sensitive to task phase rather than specific observations.
2. Content Subspace (high-dimensional): Regularized by the existing SIGReg (Simple Generative Regularization) objective from LeWM, which prevents collapse while maintaining semantic content representations.
The key insight is that the two anti-collapse forces act on disjoint coordinates, so they compose additively. A subspace-ablation falsifier confirms the split is the load-bearing ingredient.
SD-JEPA improves over the LeWM baseline on majority of control benchmarks at matched compute, and outperforms the strongest non-LeWM JEPA baseline on Push-T.
The paper focuses on planning and semantic event localization. Future work could explore using the progression subspace for hierarchical planning, skill discovery, or as a reward signal for reinforcement learning. The method requires careful initialization of the cosine-margin triplet loss; more research may be needed to make it robust across diverse environments.
Directly relevant to AMI/JEPA research (Joint Embedding Predictive Architecture) tracked for Patrick's world model knowledge base. SD-JEPA advances the understanding of how latent world models can separate task progress from visual content, enabling more interpretable and usable representations for planning. The progress-detecting "compass" property has implications for task understanding in embodied AI agents like Voyager.