Skip to main content
POST
/
scrape
python
from notte_sdk import NotteClient
from pydantic import BaseModel


class CompanyInfo(BaseModel):
    name: str
    foundation_year: int
    location: str


notte = NotteClient()
company_info = notte.scrape(
    url="https://www.ycombinator.com/companies/nottelabs",
    instructions="Extract the company info from the webpage",
    response_format=CompanyInfo,
)
{
  "markdown": "<string>",
  "images": [
    {
      "url": "<string>",
      "category": "favicon",
      "description": "<string>"
    }
  ],
  "structured": {
    "success": true,
    "error": "<string>",
    "data": {}
  },
  "session": {
    "session_id": "<string>",
    "timeout_minutes": 123,
    "created_at": "2023-11-07T05:31:56Z",
    "closed_at": "2023-11-07T05:31:56Z",
    "last_accessed_at": "2023-11-07T05:31:56Z",
    "duration": "<string>",
    "status": "active",
    "steps": [
      {}
    ],
    "error": "<string>",
    "credit_usage": 123,
    "proxies": false,
    "browser_type": "chromium",
    "use_file_storage": false
  }
}

Authorizations

Authorization
string
header
required

The access token received from the authorization server in the OAuth 2.0 flow.

Headers

x-notte-request-origin
string | null
x-notte-sdk-version
string | null

Body

application/json
url
string
required
headless
boolean
default:true

Whether to run the session in headless mode.

solve_captchas
boolean
default:false

Whether to try to automatically solve captchas

timeout_minutes
integer
default:3

Session timeout in minutes. Cannot exceed the global timeout.

Required range: 0 < x <= 30
proxies

List of custom proxies to use for the session. If True, the default proxies will be used.

  • NotteProxy
  • ExternalProxy
browser_type
enum<string>
default:chromium

The browser type to use. Can be chromium, chrome or firefox.

Available options:
chromium,
chrome,
firefox,
chrome-nightly
user_agent
string | null

The user agent to use for the session

chrome_args
string[] | null

Overwrite the chrome instance arguments

viewport_width
integer | null

The width of the viewport

viewport_height
integer | null

The height of the viewport

cdp_url
string | null

The CDP URL of another remote session provider.

use_file_storage
boolean
default:false

Whether FileStorage should be attached to the session.

screenshot_type
enum<string>
default:last_action

The type of screenshot to use for the session.

Available options:
raw,
full,
last_action

Whether to scrape links from the page. Links are scraped by default.

scrape_images
boolean
default:false

Whether to scrape images from the page. Images are scraped by default.

ignored_tags
string[] | null

HTML tags to ignore from the page

only_main_content
boolean
default:true

Whether to only scrape the main content of the page. If True, navbars, footers, etc. are excluded.

only_images
boolean
default:false

Whether to only scrape images from the page. If True, the page content is excluded.

response_format
any

The response format to use for the scrape. You can use a Pydantic model or a JSON Schema dict (cf. https://docs.pydantic.dev/latest/concepts/json_schema/#generating-json-schema.)

instructions
string | null

Additional instructions to use for the scrape. E.g. 'Extract only the title, date and content of the articles.'

Whether to use link/image placeholders to reduce the number of tokens in the prompt and hallucinations. However this is an experimental feature and might not work as expected.

Response

Successful Response

markdown
string
required

Markdown representation of the extracted data

session
object
required

Browser session information

images
ImageData · object[] | null

List of images extracted from the page (ID and download link)

structured
object | null

Structured data extracted from the page in JSON format

CtrlI