Skip to main content

Web crawler

Crawl any public or authenticated website to ingest documentation directly from the web.

ParameterDescription
URLsThe starting URL(s) for the crawler (e.g., https://docs.company.com).
URL prefix filtersThe domain(s) from which data should be ingested. For example, if crawling from docs.stripe.com/getting-started, set the prefix to docs.stripe.com. Without this, the crawl job will not complete successfully.
Enable recursive crawlingCrawls all reachable links from the starting URLs. Strongly recommended.
Include ImagesIngest and cite images found on crawled pages.
Ignore URL parametersStrips query parameters from URLs (e.g., ?ref=nav). Strongly recommended to avoid duplicate page ingestion.

Optional parameters

ParameterDescription
Include / Exclude RegexRegex patterns to include or exclude specific URLs during crawling.
Enable anchor tagsUses HTML anchor tags to better segment page content, enabling more precise citations.
Ignore HTML classesClass names for page elements to exclude from ingestion (e.g., navbars, footers).
Add headers to crawlersCustom HTTP headers for authenticated pages (e.g., Authorization: Bearer <token>).
Enable dynamic crawlingEnables JavaScript rendering before crawling. Use for pages that load content dynamically.
Bypass Cloudflare verificationExperimental option for Cloudflare-protected pages. May not always succeed.
Enhance your data sources with knowledge graphBuilds a knowledge graph from crawled pages to improve answer quality. Recommended when recursive crawling is enabled.