Abstract

ForesightFlow is a self-guided flow-matching policy that augments generated action chunks with a learned success-potential trajectory, enabling a single model to propose AND score candidate actions (best-of-K) without an external critic.

Key Contributions

- Decoupled advantage-weighted flow matching: Exponentiated advantage weights apply only to action velocities (not potential coordinates), preventing value hallucination from overconfident scores.

  • One-step boundary estimator: Allows advantage computation with a single stop-gradient forward pass, reducing compute overhead.
  • Self-guided best-of-K inference: Same flow model proposes and ranks K candidate actions — no separate critic network needed.

    Method Details

    ForesightFlow trains a conditional flow matching model over action chunks conditioned on vision-language inputs. Each action chunk is augmented with a learned "success-potential" scalar field. The flow model simultaneously:

  • 1. Proposes candidate action sequences via flow interpolation 2. Scores them via the learned potential (higher potential = more likely success)

    The key architectural insight is separating the velocity field into two heads: one for action velocities (weighted by exponentiated advantages during training) and one for potential velocities (trained uniformly). This prevents failure gradients from being suppressed during policy improvement.

    Applied to: 5 BEHAVIOR-1K simulation tasks + 5 real-world bimanual manipulation tasks. Uses a VLA backbone (RT-series architecture) with flow matching heads.

    Key Results

    | Setting | Result | |---------|--------| | Simulation success | Matches strongest separate-critic baseline | | Real-world bimanual success | Improves over imitation baselines | | Training compute | Reduces by 38% vs. separate-critic offline RL | | Ablation: decoupled vs. coupled | Decoupling prevents value hallucination | | Ablation: one-step estimator | Preserves candidate-ranking fidelity | | Ablation: self-guided sampling | Improves long-horizon execution |

    Limitations & Future Work

    - Best-of-K inference at test time is computationally expensive; the one-step estimator mitigates but doesn't eliminate this.

  • Relies on the quality of the success-potential supervision signal — sparse reward environments may need shaped potentials.
  • Evaluated on manipulation; generalization to other robot morphologies not tested.

    Relevance to Patrick's Research

    Directly relevant to VLA world model policy learning. The flow-matching paradigm offers an alternative to diffusion or autoregressive action generation for world-model-based control. Key idea of using "potential" as an internal world model to guide exploration is aligned with predictive architecture research.