Skip to main content
POST
/
scrape_from_html
Scrape From Html
curl --request POST \
  --url https://api.example.com/scrape_from_html \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "selector": "<string>",
  "scrape_links": true,
  "scrape_images": false,
  "ignored_tags": [
    "<string>"
  ],
  "only_main_content": true,
  "only_images": false,
  "response_format": "<unknown>",
  "instructions": "",
  "use_link_placeholders": false,
  "frames": [
    {
      "frameUrl": "<string>",
      "frameData": "<string>"
    }
  ]
}
'
{
  "model_schema": {
    "success": true,
    "model_schema": {},
    "error": "<string>"
  },
  "scrape": {
    "success": true,
    "error": "<string>",
    "data": {}
  }
}

Authorizations

Authorization
string
header
required

The access token received from the authorization server in the OAuth 2.0 flow.

Headers

x-notte-request-origin
string | null
x-notte-sdk-version
string | null

Body

application/json
selector
string | null

Playwright selector to scope the scrape to. Only content inside this selector will be scraped.

Whether to scrape links from the page. Links are scraped by default.

scrape_images
boolean
default:false

Whether to scrape images from the page. Images are scraped by default.

ignored_tags
string[] | null

HTML tags to ignore from the page

only_main_content
boolean
default:true

Whether to only scrape the main content of the page. If True, navbars, footers, etc. are excluded.

only_images
boolean
default:false

Whether to only scrape images from the page. If True, the page content is excluded.

response_format
any | null

The response format to use for the scrape. You can use a Pydantic model or a JSON Schema dict (cf. https://docs.pydantic.dev/latest/concepts/json_schema/#generating-json-schema.)

instructions
string
default:""

User description as to what needs to be scraped

Whether to use link/image placeholders to reduce the number of tokens in the prompt and hallucinations. However this is an experimental feature and might not work as expected.

frames
FrameData · object[]

Frame data for all frames found in the page. The main frame has to be the first one

Response

Successful Response

model_schema
SchemaGenerationResponse · object
required

Output schema generated by the llm according to the instructions

scrape
StructuredData[BaseModel] · object
required

Output from the scrape, validated upon the model schema