Observing the provenance of GenAI responses

Ben ColbornMember of Knowledge Staff

The first article in this series explained DevRev’s RAG pipeline, Turing, and how we got there. Now that we are applying Turing to customers’ business challenges, we need to continuously assure that the responses provided to customers are accurate and relevant. To do so, we need to be able to investigate what happens at each stage through analytics.

To give an example of why Turing analytics are important, recently I was going through actual queries and responses with my colleagues on the support team. We noticed that a lot of the unanswered queries were about the API. Since moving to a new platform for developer docs some weeks previously, we hadn’t yet fully implemented search. Nevertheless, we got the API docs loaded into the DevRev knowledge base (KB) as articles so our AI agent could start answering queries about the APIs.

Once I saw the articles were present in the KB, I went to PLuG on our website and entered the query “How do I delete an article with the API?” To my great joy, it came back with exactly the right answer of articles.delete and a link to the reference documentation. Before announcing it on Slack, I wanted a more common topic for the screenshot so tried “How can I create a ticket with the API?” No relevant answer. Then “How can I create an account in a workspace with the DevRev API?” No relevant answer. Then back to the original query that had worked less than an hour earlier. No relevant answer.

So I took it up with the dev team. Through the Turing analytics, they traced back the answer to find why it was generated and which particular pieces of information from the KB were used to generate it.

Questions & answers in the knowledge base

In the preceding blog, we discussed the KB as if it were synonymous with a set of articles, which may be from multiple sources. It’s a bit more complicated than that. While probably around 90% of the words in the KB are from articles, DevRev also has a question & answer (QnA) object. Turing includes QnAs in the KB. QnAs differ from articles in some important ways.

Structure is the first difference. Articles tend to cover a topic in some detail. So for example, the DevRev article about the support portal has about 1,000 words and contains an overview, lists of features and benefits, explanation of permissions, instructions for adding support articles, and customization options. Length and complexity are the reasons that articles are pre-processed into chunks before being indexed for semantic search.
QnAs are simpler than articles. They consist, as the name indicates, mainly of a question and a brief answer to that question, along with basic metadata like part association and visibility.

Unlike articles, which are created by a person, Turing itself creates QnAs based on customer questions that are not answered in the KB. Once Turing creates a QnA, a person has to verify the answer and confirm that it has been reviewed before Turing will use it in responses. There is also a “suggest-only” mode, where Turing runs but stops short of conveying the response to the customer. This mode is so that support can monitor the quality of responses before exposing them to customers.

Finally, QnAs are not accessed on their own as articles are: there is no catalog of QnAs that a customer can browse through. Turing is the only way that a customer gains access to the content of a QnA. For the most part, we would want questions to be answered by articles, so QnAs tend to have a short lifespan. There are, however, certain instances where a piece of information persists as a QnA.

Tracing a query and response

Now let’s dig into the example that starts this post. In the DevRev app, we have a Turing analytics dashboard (not yet released to customers) that shows all queries and responses. I can find these two exact queries.

Same query, same items retrieved from the KB, yet one is answered and one is not. From these records I can see that for one query, the context was considered to be invalid. But I can’t see exactly what happened to try to figure out why.

So then we have to go one level deeper. Turing analytics contain the following parameters for each query (in addition to event context like time and user):

The query as entered by the user (same as “Query” in the in-product analytics).
The query rephrased by the LLM.
The KB items (articles and QAs) retrieved by search (same as “Articles Retrieved” and “QAs Retrieved”).
The KB items that appear to be relevant to the rephrased query (same as “Valid Sources”).
The answer that was generated, if any (same as “Generated Answer”).

I already knew from the in-product analytics that items 1 and 3 were the same between the two queries. I also knew the valid sources (item 4) and generated answer (item 5) for the successful case and the lack of those in the failure case. The question was why the context was invalid sometimes and valid others.

It turns out that the failure was in the rephrasing step. The query was being rephrased in two different ways (emphasis mine):

“What is the method to delete an article using the API?”
“What is the process to delete an article using the API?”

The first way led to the correct response, the second to no response. While the search retrieved the same articles and QnAs, the different rephrased queries caused validation to fail in the second instance. What’s interesting is that the LLM introduced context—process vs. method—that was not in the original query. The significant terms are “API”, “article”, and “delete”. In common usage “method” and “process” are similar in meaning; however, “method” by chance has a particular meaning in the context of APIs.

Remediation

Shortly after I brought this issue to the attention of the development team, they deployed an update to the RAG pipeline. Queries on API documentation now work reliably. One of the changes was to the rephrasing prompt, and it was this change that resolved this particular issue.

That’s very helpful, but what about the issues that we may uncover tomorrow or next week or next month? How can we continue to address specific issues while avoiding breaking something else? That’s the subject of the next and final post of the series.