AskUI is building the Vision Agent for enterprises, enabling secure automation of native devices.
Key features of AskUI include:
- Support for Windows, Linux, MacOS, Android and iOS device automation (Citrix supported)
- Support for single-step UI automation commands (RPA like) as well as agentic intent-based instructions
- In-background automation on Windows machines (agent can create a second session; you do not have to watch it take over mouse and keyboard)
- Flexible model use (hot swap of models) and infrastructure for reteaching of models (available on-premise)
- Secure deployment of agents in enterprise environments
Join the AskUI Discord.
AskUI_VisionAgentsforEnterprise.1.mp4
Agent OS is a device controller that allows agents to take screenshots, move the mouse, click, and type on the keyboard across any operating system.
Linux
curl -L -o /tmp/AskUI-Suite-Latest-User-Installer-Linux-x64-Full.run https://files.askui.com/releases/Installer/Latest/AskUI-Suite-Latest-User-Installer-Linux-x64-Full.run
bash /tmp/AskUI-Suite-Latest-User-Installer-Linux-x64-Full.run
curl -L -o /tmp/AskUI-Suite-Latest-User-Installer-Linux-ARM64-Full.run https://files.askui.com/releases/Installer/Latest/AskUI-Suite-Latest-User-Installer-Linux-ARM64-Full.run
bash /tmp/AskUI-Suite-Latest-User-Installer-Linux-ARM64-Full.run
MacOS
curl -L -o /tmp/AskUI-Suite-Latest-User-Installer-MacOS-ARM64-Full.run https://files.askui.com/releases/Installer/Latest/AskUI-Suite-Latest-User-Installer-MacOS-ARM64-Full.run
bash /tmp/AskUI-Suite-Latest-User-Installer-MacOS-ARM64-Full.run
pip install askui
Note: Requires Python version >=3.10.
AskUI INFO | Anthropic INFO | |
---|---|---|
ENV Variables | ASKUI_WORKSPACE_ID , ASKUI_TOKEN |
ANTHROPIC_API_KEY |
Supported Commands | click() |
click() , get() , act() |
Description | Faster Inference, European Server, Enterprise Ready | Supports complex actions |
To get started, set the environment variables required to authenticate with your chosen model provider.
Linux & MacOS
Use export to set an evironment variable:
export ANTHROPIC_API_KEY=<your-api-key-here>
Windows PowerShell
Set an environment variable with $env:
$env:ANTHROPIC_API_KEY="<your-api-key-here>"
You can test the Vision Agent with Huggingface models via their Spaces API. Please note that the API is rate-limited so for production use cases, it is recommended to choose step 3a.
Note: Hugging Face Spaces host model demos provided by individuals not associated with Hugging Face or AskUI. Don't use these models on screens with sensible information.
Supported Models:
AskUI/PTA-1
OS-Copilot/OS-Atlas-Base-7B
showlab/ShowUI-2B
Qwen/Qwen2-VL-2B-Instruct
Qwen/Qwen2-VL-7B-Instruct
Example Code:
agent.click("search field", model_name="OS-Copilot/OS-Atlas-Base-7B")
You can use Vision Agent with UI-TARS if you provide your own UI-TARS API endpoint.
-
Step: Host the model locally or in the cloud. More information about hosting UI-TARS can be found here.
-
Step: Provide the
TARS_URL
andTARS_API_KEY
environment variables to Vision Agent. -
Step: Use the
model_name="tars"
parameter in yourclick()
,get()
andact()
commands.
from askui import VisionAgent
# Initialize your agent context manager
with VisionAgent() as agent:
# Use the webbrowser tool to start browsing
agent.tools.webbrowser.open_new("http://www.google.com")
# Start to automate individual steps
agent.click("url bar")
agent.type("http://www.google.com")
agent.keyboard("enter")
# Extract information from the screen
datetime = agent.get("What is the datetime at the top of the screen?")
print(datetime)
# Or let the agent work on its own, needs an Anthropic key set
agent.act("search for a flight from Berlin to Paris in January")
Instead of relying on the default model for the entire automation script, you can specify a model for each click
command using the model_name
parameter.
AskUI | Anthropic | |
---|---|---|
click() |
askui-combo , askui-pta , askui-ocr |
anthropic-claude-3-5-sonnet-20241022 |
Example: agent.click("Preview", model_name="askui-combo")
Antrophic AI Models
Supported commands are: click()
, type()
, mouse_move()
, get()
, act()
Model Name | Info | Execution Speed | Security | Cost | Reliability |
---|---|---|---|---|---|
anthropic-claude-3-5-sonnet-20241022 |
The Computer Use model from Antrophic is a Large Action Model (LAM), which can autonomously achieve goals. e.g. "Book me a flight from Berlin to Rom" |
slow, >1s per step | Model hosting by Anthropic | High, up to 1,5$ per act | Not recommended for production usage |
Note: Configure your Antrophic Model Provider here
AskUI AI Models
Supported commands are: click()
, type()
, mouse_move()
Model Name | Info | Execution Speed | Security | Cost | Reliability |
---|---|---|---|---|---|
askui-pta |
PTA-1 (Prompt-to-Automation) is a vision language model (VLM) trained by AskUI which to address all kinds of UI elements by a textual description e.g. "Login button ", "Text login " |
fast, <500ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, can be retrained |
askui-ocr |
AskUI OCR is an OCR model trained to address texts on UI Screens e.g. "Login ", "Search " |
Fast, <500ms per step | Secure hosting by AskUI or on-premise | low, <0,05$ per step | Recommended for production usage, can be retrained |
askui-combo |
AskUI Combo is an combination from the askui-pta and the askui-ocr model to improve the accuracy. |
Fast, <500ms per step | Secure hosting by AskUI or on-premise | low, <0,05$ per step | Recommended for production usage, can be retrained |
askui-ai-element |
AskUI AI Element allows you to address visual elements like icons or images by demonstrating what you looking for. Therefore, you have to crop out the element and give it a name. | Very fast, <5ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, determinitic behaviour |
Note: Configure your AskUI Model Provider here
Huggingface AI Models (Spaces API)
Supported commands are: click()
, type()
, mouse_move()
Model Name | Info | Execution Speed | Security | Cost | Reliability |
---|---|---|---|---|---|
AskUI/PTA-1 |
PTA-1 (Prompt-to-Automation) is a vision language model (VLM) trained by AskUI which to address all kinds of UI elements by a textual description e.g. "Login button ", "Text login " |
fast, <500ms per step | Huggingface hosted | Prices for Huggingface hosting | Not recommended for production applications |
OS-Copilot/OS-Atlas-Base-7B |
OS-Atlas-Base-7B is a Large Action Model (LAM), which can autonomously achieve goals. e.g. "Please help me modify VS Code settings to hide all folders in the explorer view" . This model is not available in the act() command |
Slow, >1s per step | Huggingface hosted | Prices for Huggingface hosting | Not recommended for production applications |
showlab/ShowUI-2B |
showlab/ShowUI-2B is a Large Action Model (LAM), which can autonomously achieve goals. e.g. "Search in google maps for Nahant" . This model is not available in the act() command |
slow, >1s per step | Huggingface hosted | Prices for Huggingface hosting | Not recommended for production usage |
Qwen/Qwen2-VL-2B-Instruct |
Qwen/Qwen2-VL-2B-Instruct is a Visual Language Model (VLM) pre-trained on multiple datasets including UI data. This model is not available in the act() command |
slow, >1s per step | Huggingface hosted | Prices for Huggingface hosting | Not recommended for production usage |
Qwen/Qwen2-VL-7B-Instruct |
[Qwen/Qwen2-VL-7B-Instruct](https://github.com/QwenLM/Qwen2.5-VLB) is a Visual Language Model (VLM) pre-trained on multiple dataset including UI data. This model is not available in the act()` command available |
slow, >1s per step | Huggingface hosted | Prices for Huggingface hosting | Not recommended for production usage |
Note: No authentication required! But rate-limited!
Self Hosted UI Models
Supported commands are: click()
, type()
, mouse_move()
, get()
, act()
Model Name | Info | Execution Speed | Security | Cost | Reliability |
---|---|---|---|---|---|
tars |
UI-Tars is a Large Action Model (LAM) based on Qwen2 and fine-tuned by ByteDance on UI data e.g. "Book me a flight to rom " |
slow, >1s per step | Self-hosted | Depening on infrastructure | Out-of-the-box not recommended for production usage |
Note: These models need to been self hosted by yourself. (See here)
Under the hood, agents are using a set of tools. You can directly access these tools.
The controller for the operating system.
agent.tools.os.click("left", 2) # clicking
agent.tools.os.mouse(100, 100) # mouse movement
agent.tools.os.keyboard_tap("v", modifier_keys=["control"]) # Paste
# and many more
The webbrowser tool powered by webbrowser allows you to directly access webbrowsers in your environment.
agent.tools.webbrowser.open_new("http://www.google.com")
# also check out open and open_new_tab
The clipboard tool powered by pyperclip allows you to interact with the clipboard.
agent.tools.clipboard.copy("...")
result = agent.tools.clipboard.paste()
You want a better understanding of what you agent is doing? Set the log_level
to DEBUG. You can also generate a report of the automation run by setting enable_report
to True
.
import logging
with VisionAgent(log_level=logging.DEBUG, enable_report=True) as agent:
agent...
You have multiple monitors? Choose which one to automate by setting display
to 1 or 2.
with VisionAgent(display=1) as agent:
agent...
AskUI Vision Agent is a versatile AI powered framework that enables you to automate computer tasks in Python.
It connects Agent OS with powerful computer use models like Anthropic's Claude Sonnet 3.5 v2 and the AskUI Prompt-to-Action series. It is your entry point for building complex automation scenarios with detailed instructions or let the agent explore new challenges on its own.
Agent OS is a custom-built OS controller designed to enhance your automation experience.
It offers powerful features like
- multi-screen support,
- support for all major operating systems (incl. Windows, MacOS and Linux),
- process visualizations,
- real Unicode character typing
and more exciting features like application selection, in background automation and video streaming are to be released soon.