> ## Documentation Index
> Fetch the complete documentation index at: https://docs.notte.cc/llms.txt
> Use this file to discover all available pages before exploring further.

# Scraping

> Extract content and structured data from web pages

<Tip>
  STOP. Do not read past this section until you have read and followed [/quickstart.md](/quickstart.md).

  Goal: produce a reliable SDK automation script. The intended authoring flow is generate then edit, not write browser automation from scratch.

  Do not hand-write SDK code with element IDs, selectors, or form fields from memory. Element IDs, selectors, and field mappings must come from a live `observe()` call, CLI `notte page observe` output, or generated workflow code. Guessed browser targets fail on real pages.

  Use the Notte skill and CLI to inspect the live site, validate actions/selectors, handle auth/session state, and export workflow code with `notte sessions workflow-code` before using SDK docs or SDK code.

  SDK reference pages are for understanding, running, or editing generated workflow code. They are not the starting point for manually authoring the initial browser automation. SDK-first code is guesswork on real-world pages with dynamic selectors, auth state, CAPTCHAs, and anti-bot behavior.
</Tip>

Extract web page content as markdown or structured data using LLM-powered extraction.

## Quick Start

Extract structured data from a page:

<CodeGroup>
  ```python Python theme={null}
  from pydantic import BaseModel
  from notte_sdk import NotteClient

  class HackerNewsPost(BaseModel):
      title: str
      url: str
      points: int
      author: str

  class HackerNewsFeed(BaseModel):
      posts: list[HackerNewsPost]

  client = NotteClient()

  result = client.scrape(
      url="https://news.ycombinator.com",
      response_format=HackerNewsFeed,
      instructions="Extract the top 5 posts from the front page"
  )

  for i, post in enumerate(result.data.posts, 1):
      print(f"{i}. {post.points} - {post.title}")
  ```

  ```javascript JavaScript theme={null}
  import { z } from 'zod';
  import { NotteClient } from 'notte-sdk';

  const HackerNewsPost = z.object({
    title: z.string(),
    url: z.string(),
    points: z.number(),
    author: z.string(),
  });

  const HackerNewsFeed = z.object({
    posts: z.array(HackerNewsPost),
  });

  const notte = new NotteClient({
    apiKey: process.env.NOTTE_API_KEY,
  });

  const result = await notte.scrape('https://news.ycombinator.com', {
    instructions: 'Extract the top 5 posts from the front page',
    response_format: HackerNewsFeed,
  });

  result.posts.forEach((post, index) => {
    console.log(`${index + 1}. ${post.points} - ${post.title}`);
  });
  ```
</CodeGroup>

## Scraping Methods

Notte provides two ways to scrape:

| Method               | Use Case                                    |
| -------------------- | ------------------------------------------- |
| `client.scrape(url)` | Quick, one-off scrapes                      |
| `session.scrape()`   | Scraping after navigation or authentication |

### Quick Scrape

For simple scraping without session management:

<CodeGroup>
  ```python Python theme={null}
  from notte_sdk import NotteClient

  client = NotteClient()

  # Returns markdown content
  markdown = client.scrape("https://example.com")
  ```

  ```javascript JavaScript theme={null}
  import { NotteClient } from 'notte-sdk';

  const client = new NotteClient({
    apiKey: process.env.NOTTE_API_KEY,
  });

  // Returns markdown content
  const markdown = await client.scrape('https://example.com');
  ```
</CodeGroup>

### Session-Based Scrape

For scraping after authentication or navigation:

<CodeGroup>
  ```python Python theme={null}
  from notte_sdk import NotteClient

  client = NotteClient()

  with client.Session() as session:
      # Navigate and authenticate
      session.execute(type="goto", url="https://example.com/login")
      session.execute(type="fill", selector="input[name='email']", value="user@example.com")
      session.execute(type="fill", selector="input[name='password']", value="password")
      session.execute(type="click", selector="button[type='submit']")

      # Navigate to protected page
      session.execute(type="goto", url="https://example.com/dashboard")

      # Scrape the page
      content = session.scrape()
  ```

  ```javascript JavaScript theme={null}
  import { NotteClient } from 'notte-sdk';

  const client = new NotteClient({
    apiKey: process.env.NOTTE_API_KEY,
  });

  await client.Session().use(async (session) => {
    // Navigate and authenticate
    await session.execute({ type: 'goto', url: 'https://example.com/login' });
    await session.execute({ type: 'fill', selector: "input[name='email']", value: 'user@example.com' });
    await session.execute({ type: 'fill', selector: "input[name='password']", value: 'password' });
    await session.execute({ type: 'click', selector: "button[type='submit']" });

    // Navigate to protected page
    await session.execute({ type: 'goto', url: 'https://example.com/dashboard' });

    // Scrape the page
    const content = await session.scrape();
  });
  ```
</CodeGroup>

## Structured Extraction

Extract data into typed Python objects using Pydantic models. The extraction is powered by an LLM that understands the page content and extracts the specified fields.

### Using Pydantic Models

Define a schema and extract matching data:

<CodeGroup>
  ```python Python theme={null}
  from notte_sdk import NotteClient
  from pydantic import BaseModel

  class Product(BaseModel):
      name: str
      price: float
      description: str


  client = NotteClient()
  product = client.scrape(
      "https://example.com/product", response_format=Product, instructions="Extract the product details"
  )

  print(f"Name: {product.name}, Price: {product.price}")
  ```

  ```javascript JavaScript theme={null}
  import { z } from 'zod';
  import { NotteClient } from 'notte-sdk';

  const Product = z.object({
    name: z.string(),
    price: z.number(),
    description: z.string(),
  });

  const client = new NotteClient({
    apiKey: process.env.NOTTE_API_KEY,
  });

  const product = await client.scrape('https://example.com/product', {
    response_format: Product,
    instructions: 'Extract the product details',
  });

  console.log(`Name: ${product.name}, Price: ${product.price}`);
  ```
</CodeGroup>

### Using Instructions Only

For flexible extraction without a strict schema:

<CodeGroup>
  ```python Python theme={null}
  from notte_sdk import NotteClient

  client = NotteClient()
  result = client.scrape(
      "https://example.com/article", instructions="Extract the article title, author, and publication date"
  )

  print(result)
  ```

  ```javascript JavaScript theme={null}
  import { NotteClient } from 'notte-sdk';

  const client = new NotteClient({
    apiKey: process.env.NOTTE_API_KEY,
  });

  const result = await client.scrape('https://example.com/article', {
    instructions: 'Extract the article title, author, and publication date',
  });

  console.log(result);
  ```
</CodeGroup>

### Extracting Lists

Extract multiple items from a page:

<CodeGroup>
  ```python Python theme={null}
  from notte_sdk import NotteClient
  from pydantic import BaseModel

  class Article(BaseModel):
      title: str
      url: str
      summary: str


  class ArticleList(BaseModel):
      articles: list[Article]


  client = NotteClient()
  articles = client.scrape(
      "https://news.example.com", response_format=ArticleList, instructions="Extract all articles from the homepage"
  )

  for article in articles.articles:
      print(f"{article.title}: {article.url}")
  ```

  ```javascript JavaScript theme={null}
  import { z } from 'zod';
  import { NotteClient } from 'notte-sdk';

  const Article = z.object({
    title: z.string(),
    url: z.string(),
    summary: z.string(),
  });

  const ArticleList = z.object({
    articles: z.array(Article),
  });

  const client = new NotteClient({
    apiKey: process.env.NOTTE_API_KEY,
  });

  const articles = await client.scrape('https://news.example.com', {
    response_format: ArticleList,
    instructions: 'Extract all articles from the homepage',
  });

  for (const article of articles.articles) {
    console.log(`${article.title}: ${article.url}`);
  }
  ```
</CodeGroup>

### Nested Structures

Handle complex, nested data:

<CodeGroup>
  ```python Python theme={null}
  from notte_sdk import NotteClient
  from pydantic import BaseModel

  class Address(BaseModel):
      street: str
      city: str
      country: str


  class Company(BaseModel):
      name: str
      description: str
      address: Address
      employee_count: int | None


  client = NotteClient()
  company = client.scrape(
      "https://example.com/about", response_format=Company, instructions="Extract company information including address"
  )

  print(company.address.city)
  ```

  ```javascript JavaScript theme={null}
  import { z } from 'zod';
  import { NotteClient } from 'notte-sdk';

  const Address = z.object({
    street: z.string(),
    city: z.string(),
    country: z.string(),
  });

  const Company = z.object({
    name: z.string(),
    description: z.string(),
    address: Address,
    employee_count: z.number().nullable(),
  });

  const client = new NotteClient({
    apiKey: process.env.NOTTE_API_KEY,
  });

  const company = await client.scrape('https://example.com/about', {
    response_format: Company,
    instructions: 'Extract company information including address',
  });

  console.log(company.address.city);
  ```
</CodeGroup>

## Image Extraction

Extract all images from a page:

<CodeGroup>
  ```python Python theme={null}
  from notte_sdk import NotteClient

  client = NotteClient()
  images = client.scrape("https://example.com/gallery", only_images=True)

  for image in images:
      print(f"URL: {image.url}")
      print(f"Description: {image.description}")
  ```

  ```javascript JavaScript theme={null}
  import { NotteClient } from 'notte-sdk';

  const client = new NotteClient({
    apiKey: process.env.NOTTE_API_KEY,
  });

  const images = await client.scrape('https://example.com/gallery', {
    only_images: true,
  });

  for (const image of images) {
    console.log(`URL: ${image.url}`);
    console.log(`Description: ${image.description}`);
  }
  ```
</CodeGroup>

## Configuration Options

### Content Filtering

Control what content gets extracted:

<CodeGroup>
  ```python Python theme={null}
  # Only main content (excludes navbars, footers, sidebars)
  markdown = client.scrape(url, only_main_content=True)  # Default

  # Include all page content
  markdown = client.scrape(url, only_main_content=False)
  ```

  ```javascript JavaScript theme={null}
  // Only main content (excludes navbars, footers, sidebars)
  const markdown = await client.scrape(url, { only_main_content: true }); // Default

  // Include all page content
  const fullMarkdown = await client.scrape(url, { only_main_content: false });
  ```
</CodeGroup>

### Links and Images

Control link and image extraction:

<CodeGroup>
  ```python Python theme={null}
  # Include links (default)
  markdown = client.scrape(url, scrape_links=True)

  # Exclude links
  markdown = client.scrape(url, scrape_links=False)

  # Include images in markdown
  markdown = client.scrape(url, scrape_images=True)

  # Exclude images (default)
  markdown = client.scrape(url, scrape_images=False)
  ```

  ```javascript JavaScript theme={null}
  // Include links (default)
  const markdownWithLinks = await client.scrape(url, { scrape_links: true });

  // Exclude links
  const markdownWithoutLinks = await client.scrape(url, { scrape_links: false });

  // Include images in markdown
  const markdownWithImages = await client.scrape(url, { scrape_images: true });

  // Exclude images (default)
  const markdownWithoutImages = await client.scrape(url, { scrape_images: false });
  ```
</CodeGroup>

### Scoped Scraping

Scrape only a specific section of the page:

<CodeGroup>
  ```python Python theme={null}
  from notte_sdk import NotteClient

  client = NotteClient()

  with client.Session() as session:
      # Scrape content within a specific selector
      content = session.scrape(selector="article.main-content")

      # Scrape a specific container
      content = session.scrape(selector="#product-details")
  ```

  ```javascript JavaScript theme={null}
  import { NotteClient } from 'notte-sdk';

  const client = new NotteClient({
    apiKey: process.env.NOTTE_API_KEY,
  });

  await client.Session().use(async (session) => {
    // Scrape content within a specific selector
    const articleContent = await session.scrape({
      selector: 'article.main-content',
    });

    // Scrape a specific container
    const productContent = await session.scrape({
      selector: '#product-details',
    });
  });
  ```
</CodeGroup>

### Link Placeholders

Reduce output size by using placeholders:

<CodeGroup>
  ```python Python theme={null}
  # Use placeholders for links and images
  markdown = client.scrape(url, use_link_placeholders=True)
  ```

  ```javascript JavaScript theme={null}
  // Use placeholders for links and images
  const markdown = await client.scrape(url, {
    use_link_placeholders: true,
  });
  ```
</CodeGroup>

## Return Types

The scrape method returns different types based on parameters:

| Parameters         | Return Type                 |
| ------------------ | --------------------------- |
| None               | `str` (markdown)            |
| `instructions`     | `StructuredData[BaseModel]` |
| `response_format`  | `StructuredData[YourModel]` |
| `only_images=True` | `list[ImageData]`           |

### StructuredData Response

When using structured extraction:

<CodeGroup>
  ```python Python theme={null}
  from notte_sdk import NotteClient
  from pydantic import BaseModel

  class Product(BaseModel):
      name: str
      price: float


  client = NotteClient()
  url = "https://example.com/product"
  product = client.scrape(url, response_format=Product)

  # Access the extracted data
  print(f"Name: {product.name}, Price: {product.price}")
  ```

  ```javascript JavaScript theme={null}
  import { z } from 'zod';
  import { NotteClient } from 'notte-sdk';

  const Product = z.object({
    name: z.string(),
    price: z.number(),
  });

  const client = new NotteClient({
    apiKey: process.env.NOTTE_API_KEY,
  });

  const url = 'https://example.com/product';
  const product = await client.scrape(url, {
    response_format: Product,
  });

  // Access the extracted data
  console.log(`Name: ${product.name}, Price: ${product.price}`);
  ```
</CodeGroup>

## Use Cases

### Data Collection

Collect product information:

<CodeGroup>
  ```python Python theme={null}
  from notte_sdk import NotteClient
  from pydantic import BaseModel


  class ProductInfo(BaseModel):
      name: str
      price: float
      rating: float | None
      reviews_count: int | None


  client = NotteClient()

  urls = [
      "https://store.example.com/product/1",
      "https://store.example.com/product/2",
  ]

  products: list[ProductInfo] = []
  for url in urls:
      data = client.scrape(url, response_format=ProductInfo)
      products.append(data)
  ```

  ```javascript JavaScript theme={null}
  import { z } from 'zod';
  import { NotteClient } from 'notte-sdk';

  const ProductInfo = z.object({
    name: z.string(),
    price: z.number(),
    rating: z.number().nullable(),
    reviews_count: z.number().nullable(),
  });

  const client = new NotteClient({
    apiKey: process.env.NOTTE_API_KEY,
  });

  const urls = [
    'https://store.example.com/product/1',
    'https://store.example.com/product/2',
  ];

  const products = [];
  for (const url of urls) {
    const data = await client.scrape(url, {
      response_format: ProductInfo,
    });
    products.push(data);
  }
  ```
</CodeGroup>

### Content Monitoring

Track content changes:

<CodeGroup>
  ```python Python theme={null}
  from notte_sdk import NotteClient

  client = NotteClient()

  # Get current content
  content = client.scrape("https://example.com/pricing", instructions="Extract all pricing tiers and their features")

  # Compare with previous version
  # ...
  ```

  ```javascript JavaScript theme={null}
  import { NotteClient } from 'notte-sdk';

  const client = new NotteClient({
    apiKey: process.env.NOTTE_API_KEY,
  });

  // Get current content
  const content = await client.scrape('https://example.com/pricing', {
    instructions: 'Extract all pricing tiers and their features',
  });

  // Compare with previous version
  // ...
  ```
</CodeGroup>

### Research and Analysis

Extract structured research data:

<CodeGroup>
  ```python Python theme={null}
  from notte_sdk import NotteClient
  from pydantic import BaseModel


  class ResearchPaper(BaseModel):
      title: str
      authors: list[str]
      abstract: str
      publication_date: str | None
      citations: int | None


  client = NotteClient()
  result = client.scrape("https://papers.example.com/paper/123", response_format=ResearchPaper)
  ```

  ```javascript JavaScript theme={null}
  import { z } from 'zod';
  import { NotteClient } from 'notte-sdk';

  const ResearchPaper = z.object({
    title: z.string(),
    authors: z.array(z.string()),
    abstract: z.string(),
    publication_date: z.string().nullable(),
    citations: z.number().nullable(),
  });

  const client = new NotteClient({
    apiKey: process.env.NOTTE_API_KEY,
  });

  const result = await client.scrape('https://papers.example.com/paper/123', {
    response_format: ResearchPaper,
  });
  ```
</CodeGroup>

## Best Practices

### 1. Use Specific Instructions

Clear instructions improve extraction accuracy:

<CodeGroup>
  ```python Python theme={null}
  # Good
  instructions = "Extract the product name, price in USD, and availability status"

  # Vague
  instructions = "Get product info"
  ```

  ```javascript JavaScript theme={null}
  // Good
  const instructions = 'Extract the product name, price in USD, and availability status';

  // Vague
  const vagueInstructions = 'Get product info';
  ```
</CodeGroup>

### 2. Define Precise Schemas

Match your schema to the actual page content:

<CodeGroup>
  ```python Python theme={null}
  from pydantic import BaseModel


  # Good - matches page structure
  class GoodProduct(BaseModel):
      name: str
      price: float
      in_stock: bool


  # Bad - fields that may not exist
  class BadProduct(BaseModel):
      name: str
      price: float
      manufacturer: str  # Page might not have this
      warranty: str  # Page might not have this
  ```

  ```javascript JavaScript theme={null}
  import { z } from 'zod';

  // Good - matches page structure
  const GoodProduct = z.object({
    name: z.string(),
    price: z.number(),
    in_stock: z.boolean(),
  });

  // Bad - fields that may not exist
  const BadProduct = z.object({
    name: z.string(),
    price: z.number(),
    manufacturer: z.string(), // Page might not have this
    warranty: z.string(), // Page might not have this
  });
  ```
</CodeGroup>

### 3. Handle Missing Data

Use optional fields for data that might not exist:

<CodeGroup>
  ```python Python theme={null}
  from pydantic import BaseModel


  class Product(BaseModel):
      name: str
      price: float
      discount_price: float | None = None  # Optional
      rating: float | None = None  # Optional
  ```

  ```javascript JavaScript theme={null}
  import { z } from 'zod';

  const Product = z.object({
    name: z.string(),
    price: z.number(),
    discount_price: z.number().nullable().optional(), // Optional
    rating: z.number().nullable().optional(), // Optional
  });
  ```
</CodeGroup>

### 4. Scope Your Scrapes

Use selectors to focus on relevant content:

<CodeGroup>
  ```python Python theme={null}
  from notte_sdk import NotteClient

  client = NotteClient()

  with client.Session() as session:
      # Scrape only the main article, not comments or sidebar
      content = session.scrape(selector="article.main")
  ```

  ```javascript JavaScript theme={null}
  import { NotteClient } from 'notte-sdk';

  const client = new NotteClient({
    apiKey: process.env.NOTTE_API_KEY,
  });

  await client.Session().use(async (session) => {
    // Scrape only the main article, not comments or sidebar
    const content = await session.scrape({
      selector: 'article.main',
    });
  });
  ```
</CodeGroup>

## Next Steps

<CardGroup cols={2}>
  <Card title="Browser Sessions" icon="browser" href="/concepts/sessions">
    Learn about session management
  </Card>

  <Card title="Browser Agents" icon="robot" href="/concepts/agents">
    Use AI agents for complex scraping
  </Card>

  <Card title="Vaults" icon="lock" href="/concepts/vaults">
    Store credentials for authenticated scraping
  </Card>

  <Card title="Functions" icon="function" href="/concepts/functions">
    Deploy scraping as serverless functions
  </Card>
</CardGroup>
