Skip to main content

Under the Hood

RunLLM focuses on delivering the highest-quality answers possible, which means that our architecture goes far beyond the generic RAG-based systems that back the chat-with-your docs applications you see out on the internet. RunLLM builds on a variety of techniques developed in our co-founder Joey Gonzalez' research group at the UC Berkeley Sky Lab.

There are three key techniques we use to deliver gold-standard technical support answers: (1) custom data engineering, (2) fine-tuned LLMs, and (3) multi-LLM agents. At a high-level, here's how the full RunLLM architecture looks:

Architecture Overview

This architecture allows us to provide clear, concise, and actionable answers while innovating on our chat UX and helping you understand your customers' problems & goals. Let's dive into each one of these.

Custom Data Engineering

No matter how advanced LLMs gets, most AI systems remain garbage in, garbage out. That means that RunLLM's first order of business is understanding your product as deeply as possible. In order to do that, RunLLM has a custom data pipeline for each of the supported data sources. Each data pipeline reads the data that's being ingested and annotates each of the documents with metadata like its topic, the questions it might be used to answer, and how recently it was updated. Different data sources are engineered and searched differently, as the information density and

The data and the metadata are ingested into an index that uses a mix of vector search, text search, graph search, and predicate search to find the right information to answer each question.

Fine-tuned LLMs

Once RunLLM's ingested your product documentation, it fine-tunes a custom LLM that's a narrowly tailored expert on your product. Unlike other systems that claim to fine-tune models by simply using RAG, RunLLM creates trains a custom version of Llama 3 for each assistant that we create.

The fine-tuning process itself works by first generating a large volume of sythetic question-answer pairs on the data that you've provided. This synthetic data is then used to instruction-tune Llama 3. This expert model is then able to deeply understand your product's vocabulary, nuances, and best practices. Most importantly, it makes sure that RunLLM has the information it needs to answer questions and that RunLLM's search process finds the most relevant information for each question.

Multi-LLM Agents

While off-the-shelf LLMs are powerful, they can also be high-variance. The common wisdom is that GPT-4 is like a smart college graduate. If you give a smart college graduate 8,000 words and 47 instructions and ask them to answer a question, they'll probably get the answer wrong half the time too — either because they didn't read document number 6 or because they forgot instruction number 34.

In order to build a technical support agent that you can trust enough to put in front of your customers, RunLLM enforces strong guardrails on the outputs. RunLLM uses 20-40 different LLM calls to generate each answer; each LLM call has a narrowly defined task, like determining whether a question is relevant to this product or checking whether there's enough data to answer this question. Each step has strong guardrails that are enforce don the output to ensure that we provide the highest-quality answers possible.