Microsoft’s New AI Helps Robots Decide What To Do and Exactly Where To Act

Microsoft, along with a consortium of academic researchers, has built a new benchmark called GroundedPlanBench to tackle a persistent problem in robotics, as robots still struggle to decide what to do and where to do it at the same time.

Most current systems split these decisions into two steps. A vision-language model first creates a plan in natural language. Then another model turns that plan into actions. This split often leads to mistakes.

The issue shows up in simple tasks. A robot told to discard paper cups may confuse which cup to pick or even invent steps that were never asked for. In cluttered environments, these errors become more frequent.

This happens because planning and spatial reasoning are handled separately, allowing errors in one stage to affect the next.

Planning meets spatial grounding

To tackle this, the team developed GroundedPlanBench to test whether AI models can plan tasks while also identifying exactly where each action should happen.

Instead of relying only on text, each action is tied to a specific location in an image. Basic actions like grasp, place, open, and close are linked to objects or positions, forcing the system to connect decisions with the physical world.

The benchmark includes more than 1,000 tasks built from real robot interactions. Some instructions are direct, such as placing a spoon on a plate. Others are more open-ended, like tidying a table.

This mix is important because robots often fail when instructions are vague. Language that humans easily understand can be too ambiguous for machines, especially when multiple objects look similar.

In one example, a system was asked to put four napkins on a couch. It repeatedly chose the same napkin because the description did not clearly distinguish between them. Even more detailed phrases like “top-left napkin” were not precise enough for reliable execution.

The researchers note that “ambiguous language leads to non-executable actions,” highlighting a core limitation in current systems.

Learning from real tasks

To improve performance, the team also developed a training method called Video-to-Spatially Grounded Planning, or V2GP.

This system learns from videos of robots performing tasks. It detects when a robot interacts with objects, identifies those objects, and tracks their positions. The result is a structured plan that links every action to a specific location.

Using this approach, the team generated more than 40,000 grounded plans. These range from simple one-step actions to longer sequences involving up to 26 steps.

When models were trained on this data, their performance improved. They were better at choosing correct actions and linking them to the right objects. The system also reduced repeated mistakes, such as acting on the same item multiple times.

Still, challenges remain. Long and complex tasks are difficult, especially when instructions are indirect. The researchers said, “Models must reason over longer sequences of actions and maintain consistency across many steps.”

The study also compared this approach with traditional systems that separate planning and grounding. Those systems struggled with ambiguity, often mapping multiple actions to the same object or location.

By combining both steps into a single process, the new approach reduces this mismatch. It keeps decisions about actions and locations tightly connected.

The team suggests that future work could combine this method with predictive models that estimate the outcome of actions before they happen. This could help robots avoid mistakes in real time.

For now, the findings point to a clear direction for robotics. Systems that understand both actions and locations together are more likely to work in real-world environments.

The study was published in arXiv.

Planning meets spatial grounding

Learning from real tasks

Gates News

About

Services

Popular Insights

Contact Us

New Microsoft AI trains robots to understand actions and locations together, improving real-world task performance.

Planning meets spatial grounding

Learning from real tasks

Amazon Acquires Fauna Robotics, The Startup Building Kid-Sized Robots for the Home

Video-Based AI Gives Robots a Visual Imagination

You may also like

Leave a Comment Cancel Reply

Gates News

About

Services

Popular Insights

Contact Us