Web crawler
Crawl any public or authenticated website to ingest documentation directly from the web.
| Parameter | Description |
|---|---|
| URLs | The starting URL(s) for the crawler (e.g., https://docs.company.com). |
| URL prefix filters | The domain(s) from which data should be ingested. For example, if crawling from docs.stripe.com/getting-started, set the prefix to docs.stripe.com. Without this, the crawl job will not complete successfully. |
| Enable recursive crawling | Crawls all reachable links from the starting URLs. Strongly recommended. |
| Include Images | Ingest and cite images found on crawled pages. |
| Ignore URL parameters | Strips query parameters from URLs (e.g., ?ref=nav). Strongly recommended to avoid duplicate page ingestion. |
Optional parameters
| Parameter | Description |
|---|---|
| Include / Exclude Regex | Regex patterns to include or exclude specific URLs during crawling. |
| Enable anchor tags | Uses HTML anchor tags to better segment page content, enabling more precise citations. |
| Ignore HTML classes | Class names for page elements to exclude from ingestion (e.g., navbars, footers). |
| Add headers to crawlers | Custom HTTP headers for authenticated pages (e.g., Authorization: Bearer <token>). |
| Enable dynamic crawling | Enables JavaScript rendering before crawling. Use for pages that load content dynamically. |
| Bypass Cloudflare verification | Experimental option for Cloudflare-protected pages. May not always succeed. |
| Enhance your data sources with knowledge graph | Builds a knowledge graph from crawled pages to improve answer quality. Recommended when recursive crawling is enabled. |