Abstract

Supply-chain AI agents face a gap: LLMs interpret policies but lack physical grounding, while RL optimizes flows but is semantically blind to unstructured constraints. ReflectiChain bridges this with a Generative Supply Chain World Model (SC-WM) — a 6-dim graph-latent space with physical conservation — plus Double-Loop Learning that separates epistemic uncertainty (KL-trust-region-bounded policy adaptation) from aleatoric uncertainty (stochastic latent rollouts). On Semi-Sim, a 10-node semiconductor benchmark with SIR risk propagation, 6 perturbation types, and 10 policy constraint templates, ReflectiChain improves Rationale Consistency Score by +33.0% (p < 0.0001, d = 2.78), maintains 82.3% operability under adversarial shocks, and gains +40.2% under moderate pressure (anti-fragile behavior).

Key Contributions

Method Details

Architecture:

  1. Graph encoder — heterogeneous supply network (nodes = firms/facilities, edges = material/information flows) is encoded into a 6-dimensional latent space per node; physical conservation (flow balance, capacity) is enforced as a hard constraint
  2. Generative latent dynamics — SC-WM rolls forward the latent graph under policy actions and exogenous perturbations, producing a distribution over future states (aleatoric uncertainty captured by sampling)
  3. Epistemic uncertainty estimator — separate module that estimates where the world model itself is uncertain (e.g., sparse data, distribution shift) via a knowledge-boundary detector
  4. Double-Loop policy adaptation:
  1. Empirical Bayesian updating — observed outcomes update both the world-model posterior and the policy's belief state
  2. Rationale Consistency scoring — evaluated against human-annotated policy rationales (used as the primary metric)

Key Results

Limitations and Future Work

The authors enumerate five limitation categories (per the abstract; specifics in the paper body), likely covering:

Relevance to Patrick's Research

ReflectiChain is a domain-specific world model that demonstrates the epistemic-vs-aleatoric decomposition LeCun and others have argued is essential for autonomous intelligence. The Double-Loop Learning pattern (KL trust region in the outer loop + stochastic rollouts in the inner loop) is a concrete, implementable instantiation of "know what you don't know" that generalizes beyond supply chains to any agent that must act under model uncertainty. The +33% rationale consistency with d=2.78 is a strong effect size to cite when arguing that world-model-grounded LLMs beat either pure-LLM or pure-RL approaches. For Patrick's tracking, this is a useful case study in applied world modeling, complementary to the embodied/simulator-focused papers.