Data Sources
RunLLM can learn from a wide variety of data sources, including your documentation, customer conversations, and codebases. Each data connector is backed by a custom-built data pipeline that reads, classifies, and annotates the information as its ingested into your assistant's knowledge base.
Configuring data sources is easy. Navigate to your the config tab of your assistant page, click Add New
next to Data, and select the data source you would like to add. Fill out the form on the next page, give RunLLM any necessary permissions, and you're done!
Once you add a data source, you can track its ingestion progress from the RunLLM dashboard. When you register the source, you should see a new entry under the data table on the config tab. Clicking on the entry will show you how much data RunLLM's ingested and what work is left.
RunLLM supports the following data sources:
- Web crawler
- Slack channels
- Discord channels
- GitHub Issues, Discussions, and PRs
- GitHub Repos
- Notion documents
- Confluence documents
- Sharepoint documents
- Zendesk tickets
- Jira tickets
- Intercom tickets
- Guru cards
- YouTube videos
- Discourse forums
- File uploads
If you find that something you'd like to see is missing, please don't hesitate to reach out. Here are some of the data sources we're currently working on adding:
- Linear issues
- Salesforce tickets
- Google Docs
- Slab documents
Data Source Configuration
There are a handful of parameters that are shared across every data source:
- Name: Every data source requires a name. This should hopefully be self-explanatory.
- Update schedule: By default, every data source will be ingested monthly. You can set up update schedule under the advanced configuration for each data source. A daily update schedule will update every midnight, a weekly schedule on Sunday at midnight, and a monthly schedule on the 1st at midnight. If you would like to specify a Cron string, you can use a
Custom
schedule. - Data source group: By default, this should be left blank. This field can be used if there are priorities amongst different categories of data (e.g., prioritize documentation over Slack messages). **Please contact our team if you'd like to set this up.
Sometimes, data needs to be updated between scheduled ingestion jobs. To update your data source, go to the RunLLM dashboard, click on the data source you'd like to update, and click the Run Now
button. Within a few minutes, your data source should be updated!
Web crawler
RunLLM can crawl any static or dynamic website, and RunLLM's web crawler uses the structure of your webpage to better understand the content you've provided.
Required parameters
- URLs: The URL(s) from which you'd like the crawler to start searching.
- Enable recursive crawling: Tells the crawler to ingest all links that are reachable from the URL(s) provided above. We strongly recommend enabling recursive crawling.
- Include images: Ingest images in addition to text when crawling your documentation. These images will be used to supplement answers to customer questions.
- Ignore URL parameters: Any query parameters (values that come after
?
in the URL) will be ignored. We strongly recommend ignoring query parameters. - URL prefix filters: The domains from which data should be ingested. For example, if you start crawling from
docs.stripe.com/getting-started
, the URL prefix should bedocs.stripe.com
. If you do not provide specific prefix filters, your ingestion job will not succeed.
Optional parameters
- Enable anchor tags: Uses header tags in the structure of your page to better segment and ingest your data. This also enables the RunLLM assistant to cite specific parts of your documentation.
- Ignore HTML classes: If your website includes HTML classes for supplementary components like headers, footers, or navigation bars, removing, include those classes here. Removing these componets helps improve the quality of the data ingested.
- Add headers to crawler: If your website requires specific authentication, you can add specific request headers (e.g.,
Authorization: Bearer <token>
) as key-value pairs. - Enable dynamic crawling: If your website requires Javascript functions to run before fully rendering, enable dynamic crawling.
- Bypass Cloudflare verification (experimental): If your website is protected by Cloudflare's crawling prevention feature, check this box. Note that this feature is experimental and may not always result in accurate data being ingested.
- Enable knowledge graph: Use RunLLM's knowledge graph implementation when constructing the assistant's knowledge base. We strongly recommend enabling the knowledge graph.
Slack channels
To ingest Slack messages as a data source, you will need to set:
- Slack Channel Names: The name of the channels in your workspace that you'd like to read from. Once you install the app, you will need to add the RunLLM app to each one of these channels.
- Message history: The length (in days) of the message history that should be ingested. Note that this is subject to Slack free-plan limitation, so entering a longer history on a free plan will not unlock access to historical messages.
Once you hit Save
, you will be redirected to the Slack app installation page. Select the right workspace and install the RunLLM app; this requires admin permissions.
Slack does not allow apps to read from channels they're not in. After installation, you will need to add the RunLLM app to every channel you would like to ingest messages from. Otherwise, the data ingestion will fail.
Discord channels
To ingest Discord messages as a data source, you will need to set:
- Discord Channel Names: The name of the channels in your workspace that you'd like to read from.
- Message history: The length (in days) of the message history that should be ingested.
Once you hit Save
, you will be redirected to the Discord app installation page. Select the right server and install the RunLLM app; this requires admin permissions.
GitHub Issues, Discussions, and PRs
RunLLM can ingest issues, pull requests, and discussions from GitHub repositories. All three are configured via the same data connector. To ingest from a GitHub repository, you will need to set:
- Repo owner: The user or organization name associated with the repository.
- Repo name: The name of the repository.
- Public vs. private repoistory: Whether this repository is publicly visible or not. If the repo is private, you will need to authorize the RunLLM GitHub application to have permissions to access this repository.
- Number of issues to ingest: In reverse chronological order, the total number of issues to be ingested. If this is set to 0, no issues will be ingested.
- Number of PRs to ingest: In reverse chronological order, the total number of PRs to be ingested. If this is set to 0, no PRs will be ingested.
- Number of discussions to ingest: In reverse chronological order, the total number of discussions to be ingested. If this is set to 0, no discussions will be ingested.
- Only ingest closed topics: If checked, only issues and PRs that are marked as closed and discussions that are marked as answered will be ingested. If un-checked, open issues, PRs, and discussions will be ingested.
GitHub Repos
RunLLM can ingest both code and documentation files from GitHub repositories. To ingest files from a GitHub repository, you will need to set:
- Repo owner: The user or organization name associated with the repository.
- Repo name: The name of the repository.
- Public vs. private repoistory: Whether this repository is publicly visible or not. If the repo is private, you will need to authorize the RunLLM GitHub application to have permissions to access this repository.
- Include images: Whether image files stored in the repository should be ingested (if checked) or ingored (if unchecked).
- Only load documentation files: If checked, only files with the
.md
,.mdx
, and.rst
file suffixes will be ingested. - Paths (optional): An optional list of directories; if specified, only files in these directories will be ingested.
- File suffixes (optional): An optional list of file suffixes to use. If specified, only files ending in these suffixes (e.g.,
py
) will be included. Note: Do not include the.
before the file suffix; only specifypy
not.py
.
Notion documents
RunLLM can crawl from both individual pages and from databases in Notion. To ingest Notion documents, you will need to set:
- Notion page URLs: Links to the Notion pages you would like to crawl from.
- Notion database URLs: Links to the Notion databases you would like to crawl from. All entries in this database will be crawled.
- Enable recursive search: If checked, nested pages in the Notion pages and the Notion databases linked above will be crawled.
Once you hit Save
, you will be redirected to the Notion app installation page. Select the right workspace and install the RunLLM app; this requires admin permissions.
Confluence documents
RunLLM can learn from your Confluence documents. To ingest Confluence documents, you will need to set:
- Confluence domain name: The name of your Confluence workspace; this is typically the URL prefix you see for Confluence links.
- *Space keys: The IDs for each of the space(s) from which you want to ingest pages.
- Page IDs: The IDs of the pages you'd like to ingest.
- Ingest child pages: If checked, nested pages in the Confluence pages specified above will be crawled.
Once you hit Save
, you will be redirected to the Confluence app installation page. Select the right Confluence site and install the RunLLM app; this requires admin permissions.
Zendesk tickets
To ingest Zendesk tickets as a data source, you will need to set:
- Zendesk subdomain name: The name of your Zendesk organization. This will typically be the URL prefix that you see in Zendesk links (e.g., runllm.zendesk.com).
- Ticket history: The length (in days) of the ticket history that should be ingested.
Once you hit Save
, you will be redirected to the Zendesk app installation page. Select the right workspace and install the RunLLM app; this requires admin permissions.
Jira tickets
To ingest Jira tickets as a data source, you will need to set:
- Jira domain name: The name of your Jira workspace; this is typically the URL prefix you see for Jira links.
- Projects: A list of the nmaes of Jira projects from which you'd like to ingest tickets.
Once you hit Save
, you will be redirected to the Jira app installation page. Select the right Jira site and install the RunLLM app; this requires admin permissions.
Intercom tickets
To ingest Intercom tickets as a data source, you will need to set:
- Intercom app ID: The unique ID of your Intercom workspace. This is unique ID in any Intercom link. For example, if your inbox link is
app.intercom.io/a/inbox/abc123
, your app ID isabc123
. - Ticket history: The length (in days) of the ticket history that should be ingested.
Once you hit Save
, you will be redirected to the Intercom app installation page. Select the right workspace and install the RunLLM app; this requires admin permissions.
Guru cards
To ingest Guru cards as a data source, you will first need to create an API key that has read permissions for the set of cards you would like to ingest. You will then need to set:
- Guru user name: The user ID associated with the API key you've created.
- Guru API key: The API key you've created with access to the relevant set of cards.
When you hit save, RunLLM will ingest all the Guru cards that the API key you've created has access to.
YouTube videos
RunLLM can learn from the descriptions and links associated with your YouTube videos. You simply need to provide the ID of the YouTube channel you'd like to ingest from, and RunLLM will do the rest!
We are working on adding support for ingesting the transcripts from your YouTube videos in addition to the descriptions. When this is implemnented, you will need to provide permissions to the RunLLM app. Please reach out if you're interested in this feature!
Discourse forums
To ingest Discourse forum posts as a data source, you will first need to create a read-only API key that has access to your Discourse forum. Once that's done, you will need to set:
- Base URL: The baseline domain of your Discourse forum. For example, you should enter
forum.datahubproject.io
, without a link to a specific post. - API username: The username associated with the API key you've created.
- API key: The API key you created above with access to your forum.
File uploads
To upload individual files to RunLLM, you can select the files from your local filesystem or simply drag and drop them into RunLLM. The following file formats are supported: .csv
, .json
, .md
, .pdf
, .txt
, .xlsx
, .xml
, .yaml
, .yml
.
We are adding support for .docx
and .pptx
files soon.
Sharepoint documents
Please reach out if you need access to the Sharepoint connector.