Control a Windows 11 VM with OmniParser + your vision model of choice.
- OmniParser V2 is 60% faster than V1 and now understands a wide variety of OS, app and inside app icons!
- OmniBox uses 50% less disk space than other Windows VMs for agent testing, whilst providing the same computer use API
- OmniTool supports out of the box the following vision models - OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL) or Anthropic Computer Use
There are three components:
![]() |
omniparserserver | FastAPI server running OmniParser V2. |
![]() |
omnibox | A Windows 11 VM running in a Docker container. |
![]() |
gradio | UI to provide commands and watch reasoning + execution on OmniBox |
OmniParser V2 | Watch Video |
---|---|
OmniTool | Watch Video |
- Though OmniParser V2 can run on a CPU, we have separated this out if you want to run it fast on a GPU machine
- The OmniBox Windows 11 VM docker is dependent on KVM so can only run quickly on Windows and Linux. This can run on a CPU machine (doesn't need GPU).
- The Gradio UI can also run on a CPU machine. We suggest running omnibox and gradio on the same CPU machine and omniparserserver on a GPU server.
-
omniparserserver:
a. If you already have a conda environment for OmniParser, you can use that. Else follow the following steps to create one
b. Ensure conda is installed with
conda --version
or install from the Anaconda websitec. Navigate to the root of the repo with
cd OmniParser
d. Create a conda python environment with
conda create -n "omni" python==3.12
e. Set the python environment to be used with
conda activate omni
f. Install the dependencies with
pip install -r requirements.txt
g. Continue from here if you already had the conda environment.
h. Ensure you have the V2 weights downloaded in weights folder (ensure caption weights folder is called icon_caption_florence). If not download them with:
rm -rf weights/icon_detect weights/icon_caption weights/icon_caption_florence for folder in icon_caption icon_detect; do huggingface-cli download microsoft/OmniParser-v2.0 --local-dir weights --repo-type model --include "$folder/*"; done mv weights/icon_caption weights/icon_caption_florence
h. Navigate to the server directory with
cd OmniParser/omnitool/omniparserserver
i. Start the server with
python -m omniparserserver
-
omnibox:
a. Install Docker Desktop
b. Visit Microsoft Evaluation Center, accept the Terms of Service, and download a Windows 11 Enterprise Evaluation (90-day trial, English, United States) ISO file [~6GB]. Rename the file to
custom.iso
and copy it to the directoryOmniParser/omnitool/omnibox/vm/win11iso
c. Navigate to vm management script directory with
cd OmniParser/omnitool/omnibox/scripts
d. Build the docker container [400MB] and install the ISO to a storage folder [20GB] with
./manage_vm.sh create
e. After creating the first time it will store a save of the VM state in
vm/win11storage
. You can then manage the VM with./manage_vm.sh start
and./manage_vm.sh stop
. To delete the VM, use./manage_vm.sh delete
and delete theOmniParser/omnitool/omnibox/vm/win11storage
directory. -
gradio:
a. Navigate to the gradio directory with
cd OmniParser/omnitool/gradio
b. Ensure you have activated the conda python environment with
conda activate omni
c. Start the server with
python app.py --windows_host_url localhost:8006 --omniparser_server_url localhost:8000
d. Open the URL in the terminal output, set your API Key and start playing with the AI agent!
To align with the Microsoft AI principles and Responsible AI practices, we conduct risk mitigation by training the icon caption model with Responsible AI data, which helps the model avoid inferring sensitive attributes (e.g.race, religion etc.) of the individuals which happen to be in icon images as much as possible. At the same time, we encourage user to apply OmniParser only for screenshot that does not contain harmful/violent content. For the OmniTool, we conduct threat model analysis using Microsoft Threat Modeling Tool. We advise human to stay in the loop in order to minimize risk.
Kudos to the amazing resources that are invaluable in the development of our code: Claude Computer Use, OS World, Windows Agent Arena, and computer_use_ootb. We are grateful for helpful suggestions and feedbacks provided by Francesco Bonacci, Jianwei Yang, Dillon DuPont, Yue Wu, Anh Nguyen.