Browser Using Agent (BUA)
Build a browser-using agent that can perform tasks on your behalf on the web
Overview
Browser-Using Agent echoes the Computer-Using Agent (CUA) model popularized by OpenAI but extends it to browser environments.
Traditional OpenAI-like CUA models usually combine vision capabilities of LLMs and reasoning systems to simulate controlling computer interfaces and perform tasks. Browser-Using Agents focus exclusively on the browser as the primitive interface for the agent to interact with. The reason for this is that browsers are a special type of computer interface where the performance of AI agents can be greatly improved by being given access to the DOM of the page.
BUA is available through the bua/completions
endpoint.
How it works
Input-wise, on top of the traditional CUA Screenshot + Prompt approach, BUA also leverages the DOM of the page for improved understanding and reasoning of web pages. This is explained in the figure below.
Send a request to `bua/completions`
Include the computer tool as part of the available tools, specifying the display size and environment. You can also include in the first request a screenshot of the initial state of the environment.
Receive a response from the BUA model
The response will contain a list of actions to take to make progress towards the specified goal. These actions could be clicking at a given position, typing in text, scrolling, or even waiting.
Execute the requested action
Execute through code the corresponding action on your browser environment.
Capture the updated state
After executing the action, capture the updated state of the environment as a screenshot.
Repeat
Send a new request with the updated state as a computer_call_output, and repeat this loop until the model stops requesting actions or you decide to stop.
Setting up your environment
Before you can use BUA, you require a browser environment that can capture screenshots and DOM snapshots of a given web page. We advise using playwright
for this purpose.
You can check out the bua-playwright library for an example implementation, in particular:
computer.screenshot()
computer.dom()
Integrating the BUA loop
1. Send a request to the model
The first request will contain the initial state of the environment, which is a screenshot of the page and the DOM of the page.
2. Receive a suggested action
The response will contain a list of actions to take to make progress towards the specified goal. These actions could be clicking at a given position, typing in text, scrolling, or even waiting.
3. Execute the action in your environment
How you map a browser call to actions through code depends on your environment. If you are using playwright
as your browser automation library, we already have a library that maps the browser calls to playwright actions: