Xiaomi Announces Xiaomi-Robotics-0, Its First-generation Robot Large-scale Model

Xiaomi is best known for smartphones, smart home gear, and the occasional electric vehicle update. Now it wants a place in robotics research too.

The company has announced Xiaomi-Robotics-0, an open-source vision-language-action (VLA) model with 4.7 billion parameters. It’s designed to combine visual understanding, language comprehension, and real-time action execution, which Xiaomi says are the core of “physical intelligence.” And according to the company, it’s already setting multiple state-of-the-art records in both simulations and real-world tests.

At a high level, robotics models like this solve a closed loop: perception, decision, and execution. A robot needs to see the world, understand what it’s being asked to do, decide on a plan, and then carry it out smoothly. Xiaomi says Robotics-0 was built specifically to balance broad understanding with fine motor control.

The Xiaomi-Robotics-0 model is built on two main components

To do that, the model uses what’s known as a Mixture-of-Transformers (MoT) architecture. It splits responsibilities between two main components.

The first is a Visual Language Model (VLM), which acts as the “brain.” It’s trained to interpret human instructions — including vague ones like “Please fold the towel” — and understand spatial relationships from high-resolution visual input. This part handles object detection, visual question answering, and logical reasoning.

The second component is what Xiaomi calls the Action Expert. This is built around a multi-layer Diffusion Transformer (DiT). Instead of producing a single action at a time, it generates something called an “Action Chunk,” — understand it as a sequence of movements — using flow-matching techniques to keep motion accurate and smooth.

One common issue with VLA models is that when they learn to perform physical actions, they tend to lose some of their original understanding capabilities. Xiaomi says it avoided that by co-training the model on both multimodal data and action data. The result, at least in theory, is a system that can still reason about the world while learning how to move within it.

How is it trained?

The training process happens in stages. First, an “Action Proposal” mechanism forces the VLM to predict possible action distributions while interpreting images. This aligns its internal representation of what it sees with how actions are performed. After that, the VLM is frozen, and the DiT is trained separately to generate accurate action sequences from noise, relying on key-value features rather than discrete language tokens.

Xiaomi also tackled another practical problem called inference latency. It is when delays between model predictions and physical movement can create awkward pauses or unstable behavior.

Xiaomi says it implemented an asynchronous inference, decoupling model computation from robot operation, so movements remain continuous even if the model takes extra time to think.

To improve stability, Xiaomi is using a “Clean Action Prefix” technique, which feeds the previously predicted action back into the model to ensure smooth, jitter-free motion over time.

Meanwhile, a Λ-shaped attention mask biases the model toward current visual input instead of relying too heavily on past states. The goal is to make the robot more responsive to sudden environmental changes.

Xiaomi-Robotics-0 Benchmark

In benchmark testing, Xiaomi-Robotics-0 reportedly achieved state-of-the-art results in LIBERO, CALVIN, and SimplerEnv simulations, outperforming around 30 other models.

More interestingly, Xiaomi deployed it on a dual-arm robot platform in real-world experiments. In long-horizon tasks like folding towels and disassembling building blocks, Xiaomi says the robot demonstrated steady hand-eye coordination and handled both rigid and flexible objects without obvious breakdowns.

Unlike earlier VLA systems that often sacrificed multimodal reasoning once action training began, the Robotics-0 model retains strong visual and language capabilities, especially in tasks that blend perception with physical interaction.

Gates News

About

Services

Popular Insights

Contact Us

The Xiaomi-Robotics-0 model is built on two main components

How is it trained?

Xiaomi-Robotics-0 Benchmark

Atlas Robot Powered by LG Energy Solution Battery, Industry Sources Say

Robots That can See Around Corners Using Radio Signals and AI

You may also like

Leave a Comment Cancel Reply

Gates News

About

Services

Popular Insights

Contact Us