Abstract

Qwen-VLA is a unified Vision-Language-Action foundation model that addresses fragmentation in embodied intelligence research. Rather than specialized models for individual tasks (manipulation vs. navigation), Qwen-VLA unifies heterogeneous embodied decision-making problems within a single model. It extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. It introduces embodiment-aware prompt conditioning where robot-specific textual descriptions specify the current embodiment and control convention.

Key Contributions

- Unified VLA architecture: Single model handling manipulation, navigation, and trajectory prediction via a DiT-based action decoder

  • Multi-source joint pretraining: Trained on robotics manipulation, egocentric video, synthetic simulation, VLN navigation, and auxiliary VL data
  • Embodiment-aware prompt conditioning: Natural language descriptions of robot morphology enable cross-embodiment generalization
  • Unified action-and-trajectory framework: Casts manipulation, navigation, and trajectory prediction into a single formulation

    Method Details

    Qwen-VLA extends the Qwen vision-language model architecture to output executable robot actions through a DiT-based action decoder:

    1. Vision encoder: Processes RGB images from robot cameras (frozen or fine-tuned depending on config)

  • 2. Qwen LLM backbone: Handles vision-language understanding and reasoning 3. DiT-based action decoder: Generates continuous actions as discretized tokens in the same embedding space as language 4. Embodiment-aware prompt conditioning: Robot-specific textual descriptions encode embodiment constraints and control conventions

    Training recipe involves joint pretraining on diverse data sources:

  • Robotics manipulation trajectories (real and simulated)
  • Human egocentric demonstration video
  • Synthetic simulation data
  • Vision-and-language navigation (VLN) data
  • Auxiliary vision-language data for perception grounding

    The unified action-and-trajectory prediction framework enables transferable visual grounding and spatial reasoning across robot morphologies, task families, and environments.

    Key Results

    - LIBERO: 97.9% (instruction-following robotics manipulation)

  • Simpler-WidowX: 73.7% (manipulation)
  • RoboTwin-Easy/Hard: 86.1% / 87.2% (bimanual manipulation)
  • R2R: 69.0% OSR (Vision-and-Language Navigation)
  • RxR: 59.6% SR (RxR-VLN benchmark)
  • ALOHA real-world: 76.9% average OOD success
  • DOMINO: 26.6% zero-shot success on dynamic manipulation
  • Achieves consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment

    Limitations and Future Work

    - Zero-shot 26.6% on DOMINO suggests dynamic manipulation generalization remains limited

  • Action discretization granularity may lose fine-motor control nuance
  • Simulation-to-real transfer under varying lighting/background requires the full pretraining recipe
  • Long-horizon task composition not explicitly addressed

    Relevance to Patrick's Research

    Qwen-VLA represents the foundation model approach to embodied AI — a single model that perceives (vision-language) and acts (actions) across diverse embodiments. The 97.9% on LIBERO and strong OOD generalization numbers suggest VLAs are approaching practical utility. However, a key question for world model research: is Qwen-VLA learning a predictive world model, or just a highly capable visuomotor policy? The distinction matters — a world model should predict consequences of actions, not just execute them. Patrick should track whether Qwen-VLA's internal representations encode forward predictions or remain pure reactive policies.