Robix: A Unified Model for
Robot Interaction, Reasoning and Planning

Huang Fang*, Mengxi Zhang*, Heng Dong*, Wei Li*†,
Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li,
ByteDance Seed

* Equal Contribution Project Lead

Correspondence: liwei.85@bytedance.com, lihang.lh@bytedance.com


The video can also be accessed on Bilibili.

Abstract

We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g, GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.


Overview of Robix

The main features of Robix are summarized as follows:

  • 🌟 Unified model. Robix is a single vision-language model that unifies robot reasoning, task planning, and human-robot interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally in an end-to-end manner.
  • 🌟 Flexible interaction. Within this unified framework, Robix supports proactive dialogue to clarify ambiguity and infer user intent, real-time interruption handling that seamlessly incorporates feedback, and context-aware commonsense reasoning for complex, open-ended tasks.
  • 🌟 Robust Performance. We assess Robix in two setups: (i) on a curated interactive-task benchmark covering both in- and out-of-distribution scenarios with diverse instruction types, and (ii) across five real-world scenarios in a hierarchical robot system with both human teleoperation and an automatic VLA model as the low-level controller. These evaluations demonstrate that Robix consistently delivers strong performance across all settings.

MY ALT TEXT

A demo of Robix, showcasing (1) complex instruction understanding with commonsense reasoning; (2) real-time interruption handling; (3) task-status monitoring and dynamic replanning; and (4) proactive dialogue to clarify ambiguous instructions or infer user intent.


Architecture of Robix

The figure below illustrates the hierarchical robot system, where Robix serves as the high-level cognitive layer, interpreting tasks and reasoning over multimodal inputs to generate language responses and action plans, while the low-level controller—typically a vision–language–action (VLA) model—executes the atomic commands produced by Robix. This hierarchical design enables the robot to interact seamlessly with both humans and the physical environment.

MY ALT TEXT

Illustration of the hierarchical robot system.


At each iteration, Robix directly processes visual observations from robot-mounted cameras and user utterances, selectively producing atomic action commands for the low-level controller and appropriate verbal responses. This iterative reasoning-action loop allows Robix to perform deliberate reasoning and generate contextually grounded behaviors.

Fundamental Perception & Reasoning Evaluation

We first evaluate the fundamental perception and reasoning capabilities of Robix on a comprehensive suite of public benchmarks, comparing it against state-of-the-art multimodal models including Qwen-2.5-VL-7B & 32B, RoboBrain-2.0-32B, Cosmos-Reason1-7B, Gemini-2.5-Pro, GPT-4o, Seed-1.5-VL, and Seed-1.5-VL-Think. The evaluation covers (1) robotics-relevant embodied reasoning—including 3D spatial understanding, visual grounding, and task-centric reasoning—and (2) general multimodal understanding and reasoning.

MY ALT TEXT

Performance of Robix on public vision-language benchmarks compared to prior models. The left side shows Robix and state-of-the-art open-source baselines, while the right side presents closed-source large commercial models. The highest score in each benchmark is highlighted in bold within each group.

Results


Offline Evaluation of Robix

The offline evaluation enables fully automated assessment of planning and interaction capabilities using predefined evaluation sets. To thoroughly evaluate both interactive long-horizon planning and out-of-distribution (OOD) generalization, we design three dedicated evaluation sets:
  • AGIBot Evaluation Set: We manually select 16 high-frequency daily tasks from the AGIBot dataset and ensure none appear in the training data. This set primarily evaluates the model’s long-horizon task planning capability on OOD tasks.
  • Internal OOD Benchmark: We manually design 16 scripts covering task planning and diverse human–robot interaction scenarios, including table organization, dietary filtering, checkout packing, grocery shopping, and shoe cabinet organization. The benchmark includes tasks and items absent from the training data and is intended to evaluate interactive task execution in unseen scenarios.
  • Internal ID Benchmark: This evaluation set is randomly sampled from our synthesized data and categorized by task type and user instruction into six groups. Each category targets evaluation of the model’s corresponding instruction following and task planning capabilities.

MY ALT TEXT

Offline evaluation results. Robix-7B-SFT-wo-R refers to our SFT model without chain-of-thought reasoning, while Robix-7B-RL denotes the full trained policy obtained by applying RL after SFT. For AGIBot, Internal OOD, and Internal ID–MultiStage/Constrained/Interrupt/OpenEnded, we report plan accuracy; for Internal ID–Invalid/Replan, we report F1 score. The best result for each evaluation set is shown in bold, and the best among baselines is underlined.

Results


Online Evaluation of Robix

We deploy Robix within a hierarchical robot system across diverse real-world tasks, including:
  • Table Bussing: Clearing used tableware, utensils, and food items.
  • Checkout Packing: Organizing purchased goods during checkout and placing them into bags or boxes.
  • Dietary Filtering: Selecting or excluding food and beverages according to dietary restrictions (e.g., caffeine-free).
  • Grocery Shopping: Recommending and purchasing grocery items based on user instructions.
  • Tableware Organization & Shipment: Categorizing, packing tableware and transporting it to designated locations.

We design two sets of experiments:
(1) In the first set of experiments, VLMs serve as the high-level planning and interaction module, while human labelers equipped with a UMI device act as the low-level controller, enabling evaluation under a fully reliable control setting.

MY ALT TEXT

Online evaluation results with a human labeler operating a UMI device as the low-level controller.


(2) In the second set, we use our in-house VLA model GR-3, as the low-level controller and deploy the integrated VLM–VLA system on the ByteMini robot.

MY ALT TEXT

Online evaluation on the ByteMini robot with GR-3 model as the low-level controller.


Results


BibTeX

@article{fang2025robix,
    title = {Robix: A Unified Model for Robot Interaction, Reasoning and Planning},
    author = {Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li},
    journal = {arXiv preprint arXiv:2509.01106},
    year = {2025}
  }