Microsoft Open-Sources OmniParser: Intelligent Agent for Controlling PCs and Mobile Devices
Discover OmniParser: Microsoft's open-source screen parsing tool enabling AI to control PCs and smartphones with unprecedented accuracy.
Is Large Model Controlling Computers Truly the Future?
In recent days, there has been an explosion of research and applications enabling large models to control computers (including desktops and smartphones).
First, Anthropic released the new Claude 3.5 Sonnet, which was capable of controlling computers. Then, Honor introduced a global intelligent agent in MagicOS 9.0. Yesterday, Zhipu launched AutoGLM with "full-stack tool usage capability," and Huawei unveiled LiMAC, a new research breakthrough allowing AI to operate smartphones like humans.
It's clear that this trend shows no sign of stopping.
Today, users discovered that Apple quietly released two versions of Ferret-UI (based on Gemma 2B and Llama 8B), a technology introduced in May that enables AI to understand smartphone screens.
Moreover, Microsoft discreetly open-sourced its research OmniParser, a large model-based screen parsing tool that converts UI screenshots into structured elements. Its ability to parse and understand UIs is reportedly the best so far, even surpassing GPT-4V.
With this tool, anyone might be able to create their own computer-controlling intelligent agent.
Let’s take a look at OmniParser in action. For a user task: “Save vegetarian-friendly restaurants in Johannesburg to my itinerary.”
OmniParser first parses all the elements on the Tripadvisor webpage. It successfully identifies the "Restaurants" option, clicks on it (action execution requires other models), and opens a search box.
After parsing again, it doesn’t find the required keyword, so it inputs "Johannesburg" into the search box. After parsing once more, it opens the relevant search results. Then, it identifies the vegetarian option, selects it, and finally clicks the appropriate button to save it to the itinerary. Task complete.
And if you want to see if you can visit Bryce Canyon National Park, OmniParser can easily assist with that too.