web-crawler | RunLLM Documentation

Crawl any public or authenticated website to ingest documentation directly from the web.

Parameter	Description
URLs	The starting URL(s) for the crawler (e.g., `https://docs.company.com`).
URL prefix filters	The domain(s) from which data should be ingested. For example, if crawling from `docs.stripe.com/getting-started`, set the prefix to `docs.stripe.com`. Without this, the crawl job will not complete successfully.
Enable recursive crawling	Crawls all reachable links from the starting URLs. Strongly recommended.
Include Images	Ingest and cite images found on crawled pages.
Ignore URL parameters	Strips query parameters from URLs (e.g., `?ref=nav`). Strongly recommended to avoid duplicate page ingestion.

Optional parameters

Parameter	Description
Include / Exclude Regex	Regex patterns to include or exclude specific URLs during crawling.
Enable anchor tags	Uses HTML anchor tags to better segment page content, enabling more precise citations.
Ignore HTML classes	Class names for page elements to exclude from ingestion (e.g., navbars, footers).
Add headers to crawlers	Custom HTTP headers for authenticated pages (e.g., `Authorization: Bearer <token>`).
Enable dynamic crawling	Enables JavaScript rendering before crawling. Use for pages that load content dynamically.
Bypass Cloudflare verification	Experimental option for Cloudflare-protected pages. May not always succeed.
Enhance your data sources with knowledge graph	Builds a knowledge graph from crawled pages to improve answer quality. Recommended when recursive crawling is enabled.