Avoiding factual errors with GenAI responses: An intro to retrieval-augmented generation

Ben ColbornMember of Knowledge Staff

Very shortly after GPT-3.5 was released in mid-2022, everyone interested quickly realized that it had a problem with making stuff up, known as hallucination or confabulation. With some statistical cleverness an LLM can produce meaningful and (usually) grammatical responses to prompts. They are, however, not reliably factual.

A lot of the time hallucinations are just humorous, fodder for AI skeptics and other curmudgeons. There are, however, very real AI use cases in business, and very real consequences for hallucinations. It can be as common and private as loss of individual customer satisfaction, or as specific and public as the recent judgment against Air Canada. In that case, the court found that their AI agent “negligently misrepresented” an airline policy. A court found that Air Canada was required to honor the policy as described by the AI agent, even though it contradicted other policy documentation.

It is this sort of “negligent misrepresentation” that retrieval-augmented generation, or RAG, aims to address. In addition, RAG supplies LLMs with organization- or domain-specific context and information to enhance their performance.

LLM training

Large Language Models, as their name indicates, are huge: they contain several billion parameters. This means that their training requires a lot of resources. Now, suppose we have an LLM that is trained on a set of documents. Since it has full knowledge of the documents, it is easily able to answer queries related to the documents. Now what if we want to add or remove a particular document in our set?

Both of these tasks—adding new knowledge and removing old knowledge—require re-training the entire LLM from scratch, which is a costly process. Estimates vary from around half a million to five million dollars.

Now consider chatbots and copilots. Their prime function is to use a set of documents to assist users in their queries. In today’s world, these documents can change very rapidly. We can have a new feature update and would want our bot to use that while answering queries. We can mark a feature as deprecated and would want our bot not to use that.
Apart from this, what if we want our chatbot to be user-specific? What if we want certain documents to be used only if the user has a certain level of authorization? Or what if we only want to use documents that are relevant for a particular user’s licensing level?

With the large overhead of training, we can not afford to have a LLM trained for each dataset. Fine-tuning is a much less-costly option than retraining. However, research has indicated that using RAG with an external knowledge base (KB) is more effective and efficient than fine-tuning.

In RAG, the KB can be updated on demand. Whenever a query comes, we fetch relevant pieces of information from our knowledge base and provide that to our LLM. The LLM now has one simple task: text generation, which it can perform very well.

RAG also helps us to avoid hallucinations. Since we are the ones providing the information to be used while answering the query, there are fewer chances of the answer being hallucinated. This reduces, although does not entirely remove, some randomness which is present in the LLMs.

First attempt

DevRev’s very first attempt at RAG, which we call Turing, had a simple architecture.

User submits a query.
Search fetches relevant information from a knowledge base
The LLM generates an answer from the query and context.

This simple model ended up having a lot of loopholes. To begin with, not every query needs to go through RAG, and we need to be able to distinguish those cases. For those queries that do need to go through RAG, the user’s query might not be the best search query for our retriever. It can contain redundant information that might affect search. Ideally, we should take the user’s query and then generate a search query for the retriever.

The R and the G

Two key processes, retrieval and generation, comprise RAG itself. Indexing is a necessary preliminary step.

Indexing is analyzing the KB that resides outside the LLM. Chunking, extracting keywords, and creating embeddings are all part of indexing. Chunking can be based on various criteria, such as a fixed number of words or organizational units like paragraphs, lists, and sections. Embeddings are representations of the chunks in a vector DB that can be used to find similarities. Syntactic and semantic search rely on the indexes created in this step. It’s more efficient and faster to index content when it is added or changed rather than when a search is executed.

Retrieval involves searching through the indexes to find relevant information. The results of the syntactic and semantic searches have to be combined into some aggregate score of their relevance. In the context of text generation, this might mean pulling in existing sentences, paragraphs, or documents that are similar to what you want to create.

Then after retrieving this relevant information, it is used as a foundation to generate new content. This could involve rearranging, consolidating, summarizing, modifying, or combining the retrieved information to create something fresh and contextually appropriate.

Validating context

Semantic search itself has limitations and might not fetch the most-relevant context. If we retrieve a wrong context, it will still be used for generating the response, which might lead to inaccuracies or hallucinations.

We tried updating our answer generation prompt by instructing it to generate an answer only if the retrieved information is relevant. It turned out that the LLM was not able to handle two different tasks—validation in addition to generation—at the same time. Asking it to do so led to an increase in hallucinations and a decrease in ‌response quality. It is for this reason that we decided to separate validation from generation.
We took the output of our retriever and passed it to a validation module which returned only the valid contexts that were relevant to our query. The validation step consists of comparing the query to the items returned from retrieval and then filtering out the items the LLM deems irrelevant to the query.

Current approach

From the results of our first attempt, stages were added to the RAG pipeline.

User submits a query.
LLM rephrases the query.
Syntactic search and semantic search fetch relevant information from a knowledge base.
The search results are ranked for relevance.
Validation checks that the most-relevant results apply to the context.
LLM generates an answer using the validated search results.

For applying LLMs to business problems using a KB specific to a particular business, an LLM alone is not adequate. It requires a good deal of scaffolding around it, and the information transformed at each stage needs to be handled with care.

Furthermore, to continuously optimize Turing and even customize it for particular customers, we need to be able to trace the results of each step in the process, known as observability. Observability will be discussed in the second post in this series.