Scrape any page and get formatted data

The Scrape API allows you to get the data you want from web pages using a single call. You can scrape page content and capture its data in various formats. For detailed usage, checkout the Scrape API Reference.

Basic Markdown Scraping

The simplest way to scrape a webpage is to extract its content as markdown. This is useful when you want to preserve the page’s structure and formatting.
simple_scrape.py
from notte_sdk import NotteClient

client = NotteClient()
markdown = client.scrape(
	url="https://www.notte.cc",
	only_main_content=True,
)
print(markdown)

Structured Data Extraction

For more sophisticated use cases, you can extract structured data from web pages by defining a schema using Pydantic models. This is particularly useful when you need to extract specific information like product details, pricing plans, or article metadata.

Example: Extracting Pricing Plans from notte.cc

Let’s say you want to extract pricing information from a website. First, define your data models then use these models to extract structured data:
structured_scrape.py
from pydantic import BaseModel
from notte_sdk import NotteClient

class PricingPlan(BaseModel):
    name: str
    price_per_month: int | None = None
    features: list[str]

class PricingPlans(BaseModel):
    plans: list[PricingPlan]

client = NotteClient()
data = client.scrape(
    url="https://www.notte.cc",
	instructions="Extract the pricing plans from the page",
    response_format=PricingPlans
)

# plans is a PricingPlans instance
# > note that the following line can raise an exception
# in case of a scraping error
plans = data.get()

Agent Scraping

Agent Scraping is a more powerful way to scrape web pages. It allows you to navigate through the page, fill forms, and extract data from dynamic content.
agent_scrape.py
from pydantic import BaseModel
from notte_sdk import NotteClient

class LinkedInConversation(BaseModel):
    recipient: str
    messages: list[str]

client = NotteClient()
vault = client.Vault(vault_id="<your-vault-id>")

with client.Session() as session:
    agent = client.Agent(session=session, vault=vault, max_steps=15)
    response = agent.run(
        task="Got to linkedin.com, login with the credentials and extract the last 10 messages from my most recent conversation",
        response_format=LinkedInConversation
    )
print(response.answer)

Topics & Tips

Scrape API vs Agent Scrape

Scrape API

Perfect for1. One-off scraping tasks2. Simple data extraction3. Static content

Agent Scrape

Perfect for1. Authentication or login flows2. Form filling and submission3. Dynamic content

Response Format Best Practices

Use response_format whenever possible to yield the best & most reliable results:
Tips for designing schemas:
  • Try a few different schemas to find what works best
  • If you ask for a company_name field but there is no company_name on the page, LLM scraping will fail
  • Design your schema carefully based on the actual content structure
  • Response format is available for both scrape and agent.run
Example of good schema design:
from pydantic import BaseModel

class Product(BaseModel):
    product_url: str
    name: str
    price: float | None = None
    description: str | None = None
    image_url: str | None = None

class ProductList(BaseModel):
    products: list[Product]