
Overview
Browser-Using Agent echoes the Computer-Using Agent (CUA) model popularized by OpenAI but extends it to browser environments. Traditional OpenAI-like CUA models usually combine vision capabilities of LLMs and reasoning systems to simulate controlling computer interfaces and perform tasks. Browser-Using Agents focus exclusively on the browser as the primitive interface for the agent to interact with. The reason for this is that browsers are a special type of computer interface where the performance of AI agents can be greatly improved by being given access to the DOM of the page. BUA is available through thebua/completions
endpoint.
How it works
Input-wise, on top of the traditional CUA Screenshot + Prompt approach, BUA also leverages the DOM of the page for improved understanding and reasoning of web pages. This is explained in the figure below.
1
Send a request to `bua/completions`
Include the computer tool as part of the available tools, specifying the display size and environment. You can also include in the first request a screenshot of the initial state of the environment.
2
Receive a response from the BUA model
The response will contain a list of actions to take to make progress towards the specified goal. These actions could be clicking at a given position, typing in text, scrolling, or even waiting.
3
Execute the requested action
Execute through code the corresponding action on your browser environment.
4
Capture the updated state
After executing the action, capture the updated state of the environment as a screenshot.
5
Repeat
Send a new request with the updated state as a computer_call_output, and repeat this loop until the model stops requesting actions or you decide to stop.
Setting up your environment
Before you can use BUA, you require a browser environment that can capture screenshots and DOM snapshots of a given web page. We advise usingplaywright
for this purpose.
You can check out the bua-playwright library for an example implementation, in particular:
computer.screenshot()
computer.dom()
Integrating the BUA loop
1. Send a request to the model
The first request will contain the initial state of the environment, which is a screenshot of the page and the DOM of the page.2. Receive a suggested action
The response will contain a list of actions to take to make progress towards the specified goal. These actions could be clicking at a given position, typing in text, scrolling, or even waiting.3. Execute the action in your environment
How you map a browser call to actions through code depends on your environment. If you are usingplaywright
as your browser automation library, we already have a library that maps the browser calls to playwright actions:
bua-playwright-agent.