title: "FAWAM: Force-Aware World Action Models for Closed-Loop Contact-Rich Manipulation"
arxiv_id: "2606.08555"
date: "2026-06-07"
tags: [world-model, robot-learning, contact-rich, force-aware, manipulation, closed-loop]
---
Abstract
FAWAM extends the World Action Model paradigm to contact-rich manipulation by incorporating force signals at three levels — perception, prediction, and closed-loop execution. It encodes historical 6-axis force/torque to modulate action generation, jointly predicts future actions and end-effector wrenches to model contact evolution, and uses the predicted wrench trajectory as an execution-time reference for online residual correction.
Key Contributions
- Three-level force integration: Force is used at (1) perception — modulating the action head; (2) prediction — jointly predicting actions and future wrenches; (3) closed-loop execution — residual correction against predicted wrench trajectories
- Joint action + wrench prediction: The model explicitly forecasts how contact forces will evolve, not just what action to take
- Online residual correction: At execution time, real-time force feedback is compared against the predicted wrench trajectory; a learned residual corrects the action online
- Large empirical gains: +36.25% average success rate over vision-only baselines; +21.25% over existing force-aware baselines on real-world contact-rich tasks
Method Details
The model is a WAM with three force-aware extensions:
- Perception-level force encoding: A small encoder processes the history of 6-axis F/T signals (forces and torques in 3D) and produces a force embedding. This embedding modulates the action head via FiLM or cross-attention, conditioning action generation on contact state.
- Prediction-level wrench forecasting: The model has two heads sharing a backbone:
- Action head: Predicts a chunk of future actions (similar to standard WAMs).
- Wrench head: Predicts the trajectory of future end-effector wrenches. The two heads are trained jointly, so the action and wrench predictions are mutually consistent — the model learns a "what contact will look like if I do this" representation.
- Execution-level residual correction: At runtime, the predicted wrench trajectory is used as a reference. A small residual policy reads (predicted wrench, real-time measured wrench, current state) and outputs a correction that is added to the planned action. This makes the system robust to model error and to unexpected contact events.
The backbone is a video + force transformer; the training objective combines behavior-cloning loss, wrench-prediction loss, and a wrench-trajectory consistency loss between the predicted action's implied contact dynamics and the predicted wrench trajectory.
Key Results
- +36.25% average success rate over vision-only baselines across multiple real-world contact-rich tasks
- +21.25% average success rate over existing force-aware baselines
- Real-world experiments (not just simulation) across multiple contact-rich tasks (e.g., insertion, assembly, surface following)
- Ablations isolate the contribution of each of the three force-integration levels
- PDF: https://arxiv.org/pdf/2606.08555
Limitations and Future Work
- Requires a 6-axis F/T sensor at the end-effector; sensor noise, calibration drift, and tactile sensor integration are not extensively addressed
- The residual correction module is learned — it inherits any bias of the training distribution and may not generalize to out-of-distribution contact regimes
- Joint action-wrench prediction assumes a relatively short horizon; long-horizon contact-rich tasks with multi-stage contact transitions are not the focus
Relevance to Patrick's Research
FAWAM is relevant to Patrick's interest in world models for control because it shows that predicting auxiliary physical signals (wrenches) alongside actions materially improves performance on contact-rich tasks. This generalizes a useful principle: the value of a world model is not just in rolling out future observations, but in rolling out task-relevant physical quantities. The three-level integration pattern (perception / prediction / execution) is also a clean architectural template for any future work on multi-modal world models.