> ## Documentation Index
> Fetch the complete documentation index at: https://docs.notte.cc/llms.txt
> Use this file to discover all available pages before exploring further.

# Advanced Scraping

> How to build reliable web scrapers with Notte

<Tip>
  STOP. Do not read past this section until you have read and followed [/quickstart.md](/quickstart.md).

  Goal: produce a reliable SDK automation script. The intended authoring flow is generate then edit, not write browser automation from scratch.

  Do not hand-write SDK code with element IDs, selectors, or form fields from memory. Element IDs, selectors, and field mappings must come from a live `observe()` call, CLI `notte page observe` output, or generated workflow code. Guessed browser targets fail on real pages.

  Use the Notte skill and CLI to inspect the live site, validate actions/selectors, handle auth/session state, and export workflow code with `notte sessions workflow-code` before using SDK docs or SDK code.

  SDK reference pages are for understanding, running, or editing generated workflow code. They are not the starting point for manually authoring the initial browser automation. SDK-first code is guesswork on real-world pages with dynamic selectors, auth state, CAPTCHAs, and anti-bot behavior.
</Tip>

## Scrape any page and get formatted data

The Scrape API allows you to get the data you want from web pages using a single call. You can scrape page content and capture its data in various formats.
For detailed usage, checkout the [Scrape API Reference](/sdk-reference/remotesession/scrape).

## Basic Markdown Scraping

The simplest way to scrape a webpage is to extract its content as markdown. This is useful when you want to preserve the page's structure and formatting.

{/* @sniptest testers/scraping/simple.py */}

```python simple_scrape.py theme={null}
from notte_sdk import NotteClient

client = NotteClient()
markdown = client.scrape(
    url="https://www.notte.cc",
    only_main_content=True,
)
print(markdown)
```

## Structured Data Extraction

For more sophisticated use cases, you can extract structured data from web pages by defining a schema using Pydantic models. This is particularly useful when you need to extract specific information like product details, pricing plans, or article metadata.

#### Example: Extracting Pricing Plans from `notte.cc`

Let's say you want to extract pricing information from a website. First, define your data models then use these models to extract structured data:

{/* @sniptest testers/scraping/structured.py */}

```python structured_scrape.py theme={null}
from notte_sdk import NotteClient
from pydantic import BaseModel


class PricingPlan(BaseModel):
    name: str
    price_per_month: int | None = None
    features: list[str]


class PricingPlans(BaseModel):
    plans: list[PricingPlan]


client = NotteClient()

# plans is a PricingPlans instance directly
# > note that scrape() can raise ScrapeFailedError if extraction fails
plans = client.scrape(
    url="https://www.notte.cc", instructions="Extract the pricing plans from the page", response_format=PricingPlans
)
```

## Agent Scraping

Agent Scraping is a more powerful way to scrape web pages. It allows you to navigate through the page, fill forms, and extract data from dynamic content.

{/* @sniptest testers/scraping/agent.py */}

```python agent_scrape.py theme={null}
from notte_sdk import NotteClient
from pydantic import BaseModel


class LinkedInConversation(BaseModel):
    recipient: str
    messages: list[str]


client = NotteClient()
vault = client.Vault(vault_id="<your-vault-id>")

with client.Session() as session:
    agent = client.Agent(session=session, vault=vault, max_steps=15)
    response = agent.run(
        task="Got to linkedin.com, login with the credentials and extract the last 10 messages from my most recent conversation",
        response_format=LinkedInConversation,
    )
print(response.answer)
```

## Topics & Tips

### Scrape API vs Agent Scrape

<Columns cols={2}>
  <Card title="Scrape API" icon="flag">
    Perfect for

    **1. One-off scraping tasks**

    **2. Simple data extraction**

    **3. Static content**
  </Card>

  <Card title="Agent Scrape" icon="robot">
    Perfect for

    **1. Authentication or login flows**

    **2. Form filling and submission**

    **3. Dynamic content**
  </Card>
</Columns>

### Response Format Best Practices

<Tip>
  Use `response_format` whenever possible to yield the best & most reliable results:
</Tip>

**Tips for designing schemas:**

* Try a few different schemas to find what works best
* If you ask for a `company_name` field but there is no `company_name` on the page, LLM scraping will fail
* Design your schema carefully based on the actual content structure
* Response format is available for both `scrape` and `agent.run`

**Example of good schema design:**

{/* @sniptest testers/guides/schema_design.py */}

```python schema_design.py theme={null}
from pydantic import BaseModel


class Product(BaseModel):
    product_url: str
    name: str
    price: float | None = None
    description: str | None = None
    image_url: str | None = None


class ProductList(BaseModel):
    products: list[Product]
```
