Robix: A Unified Model for
Robot Interaction, Reasoning and Planning

Huang Fang^*, Mengxi Zhang^*, Heng Dong^*, Wei Li^*†,

Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li^†,

ByteDance Seed

^* Equal Contribution ^† Project Lead
Correspondence: liwei.85@bytedance.com, lihang.lh@bytedance.com

Paper Demo arXiv

The video can also be accessed on Bilibili.

Overview of Robix

The main features of Robix are summarized as follows:

🌟 Unified model. Robix is a single vision-language model that unifies robot reasoning, task planning, and human-robot interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally in an end-to-end manner.
🌟 Flexible interaction. Within this unified framework, Robix supports proactive dialogue to clarify ambiguity and infer user intent, real-time interruption handling that seamlessly incorporates feedback, and context-aware commonsense reasoning for complex, open-ended tasks.
🌟 Robust Performance. We assess Robix in two setups: (i) on a curated interactive-task benchmark covering both in- and out-of-distribution scenarios with diverse instruction types, and (ii) across five real-world scenarios in a hierarchical robot system with both human teleoperation and an automatic VLA model as the low-level controller. These evaluations demonstrate that Robix consistently delivers strong performance across all settings.

A demo of Robix, showcasing (1) complex instruction understanding with commonsense reasoning; (2) real-time interruption handling; (3) task-status monitoring and dynamic replanning; and (4) proactive dialogue to clarify ambiguous instructions or infer user intent.

Architecture of Robix

The figure below illustrates the hierarchical robot system, where Robix serves as the high-level cognitive layer, interpreting tasks and reasoning over multimodal inputs to generate language responses and action plans, while the low-level controller—typically a vision–language–action (VLA) model—executes the atomic commands produced by Robix. This hierarchical design enables the robot to interact seamlessly with both humans and the physical environment.

Illustration of the hierarchical robot system.

At each iteration, Robix directly processes visual observations from robot-mounted cameras and user utterances, selectively producing atomic action commands for the low-level controller and appropriate verbal responses. This iterative reasoning-action loop allows Robix to perform deliberate reasoning and generate contextually grounded behaviors.

Fundamental Perception & Reasoning Evaluation

We first evaluate the fundamental perception and reasoning capabilities of Robix on a comprehensive suite of public benchmarks, comparing it against state-of-the-art multimodal models including Qwen-2.5-VL-7B & 32B, RoboBrain-2.0-32B, Cosmos-Reason1-7B, Gemini-2.5-Pro, GPT-4o, Seed-1.5-VL, and Seed-1.5-VL-Think. The evaluation covers (1) robotics-relevant embodied reasoning—including 3D spatial understanding, visual grounding, and task-centric reasoning—and (2) general multimodal understanding and reasoning.

Performance of Robix on public vision-language benchmarks compared to prior models. The left side shows Robix and state-of-the-art open-source baselines, while the right side presents closed-source large commercial models. The highest score in each benchmark is highlighted in bold within each group.

Offline Evaluation of Robix

The offline evaluation enables fully automated assessment of planning and interaction capabilities using predefined evaluation sets. To thoroughly evaluate both interactive long-horizon planning and out-of-distribution (OOD) generalization, we design three dedicated evaluation sets:

AGIBot Evaluation Set: We manually select 16 high-frequency daily tasks from the AGIBot dataset and ensure none appear in the training data. This set primarily evaluates the model’s long-horizon task planning capability on OOD tasks.
Internal OOD Benchmark: We manually design 16 scripts covering task planning and diverse human–robot interaction scenarios, including table organization, dietary filtering, checkout packing, grocery shopping, and shoe cabinet organization. The benchmark includes tasks and items absent from the training data and is intended to evaluate interactive task execution in unseen scenarios.
Internal ID Benchmark: This evaluation set is randomly sampled from our synthesized data and categorized by task type and user instruction into six groups. Each category targets evaluation of the model’s corresponding instruction following and task planning capabilities.

Offline evaluation results. Robix-7B-SFT-wo-R refers to our SFT model without chain-of-thought reasoning, while Robix-7B-RL denotes the full trained policy obtained by applying RL after SFT. For AGIBot, Internal OOD, and Internal ID–MultiStage/Constrained/Interrupt/OpenEnded, we report plan accuracy; for Internal ID–Invalid/Replan, we report F1 score. The best result for each evaluation set is shown in bold, and the best among baselines is underlined.

Online Evaluation of Robix

We deploy Robix within a hierarchical robot system across diverse real-world tasks, including:

Table Bussing: Clearing used tableware, utensils, and food items.
Checkout Packing: Organizing purchased goods during checkout and placing them into bags or boxes.
Dietary Filtering: Selecting or excluding food and beverages according to dietary restrictions (e.g., caffeine-free).
Grocery Shopping: Recommending and purchasing grocery items based on user instructions.
Tableware Organization & Shipment: Categorizing, packing tableware and transporting it to designated locations.

We design two sets of experiments:
(1) In the first set of experiments, VLMs serve as the high-level planning and interaction module, while human labelers equipped with a UMI device act as the low-level controller, enabling evaluation under a fully reliable control setting.

Online evaluation results with a human labeler operating a UMI device as the low-level controller.

(2) In the second set, we use our in-house VLA model GR-3, as the low-level controller and deploy the integrated VLM–VLA system on the ByteMini robot.

Robix: A Unified Model for
Robot Interaction, Reasoning and Planning

Abstract

Overview of Robix

The main features of Robix are summarized as follows:

A demo of Robix, showcasing (1) complex instruction understanding with commonsense reasoning; (2) real-time interruption handling; (3) task-status monitoring and dynamic replanning; and (4) proactive dialogue to clarify ambiguous instructions or infer user intent.

Architecture of Robix

Illustration of the hierarchical robot system.

Fundamental Perception & Reasoning Evaluation

Performance of Robix on public vision-language benchmarks compared to prior models. The left side shows Robix and state-of-the-art open-source baselines, while the right side presents closed-source large commercial models. The highest score in each benchmark is highlighted in bold within each group.

Results

Offline Evaluation of Robix

Results

Online Evaluation of Robix

Online evaluation results with a human labeler operating a UMI device as the low-level controller.

Online evaluation on the ByteMini robot with GR-3 model as the low-level controller.

Results

BibTeX

Robix: A Unified Model for Robot Interaction, Reasoning and Planning

Abstract

Overview of Robix

The main features of Robix are summarized as follows:

A demo of Robix, showcasing (1) complex instruction understanding with commonsense reasoning; (2) real-time interruption handling; (3) task-status monitoring and dynamic replanning; and (4) proactive dialogue to clarify ambiguous instructions or infer user intent.

Architecture of Robix

Illustration of the hierarchical robot system.

Fundamental Perception & Reasoning Evaluation

Results

Offline Evaluation of Robix

Results

Online Evaluation of Robix

Online evaluation results with a human labeler operating a UMI device as the low-level controller.

Online evaluation on the ByteMini robot with GR-3 model as the low-level controller.

Results

BibTeX

Robix: A Unified Model for
Robot Interaction, Reasoning and Planning