---
Title: "Why your AI won't tell you when it's wrong and what to do about it"
Url: "https://devrev.ai/blog/why-your-ai-wont-tell-you-when-its-wrong"
Published: "2026-05-27"
Last Updated: "2026-05-27"
Author: "Jeff Smith"
Category: "Engineering"
Excerpt: "Your AI answered confidently. It was wrong. And it didn't tell you. A look at why general-purpose AI tools silently fail at enterprise scale — and what a knowledge graph architecture does differently."
Reading Time: 6
---

# Why your AI won't tell you when it's wrong and what to do about it

Last week, during a live benchmarking stream, something unexpected happened.

We asked Claude – connected to live Jira and Salesforce instances via MCP – a realistic enterprise query:

"Find all high-priority open engineering issues, identify the customer accounts that have open support tickets in those product areas, and give me a detailed breakdown."

Claude thought for about six minutes. Then it gave us a confident, well-formatted answer. Complete with issue IDs, account names, ticket counts.

There was one problem: its Jira MCP authentication had timed out silently overnight.

Claude had no access to engineering issues at all. It never told us. It just went ahead, pulled what it could from Salesforce, inferred the rest, and returned an answer that looked authoritative. When we asked how it arrived at it, it admitted – only when pressed – that it had never actually queried Jira.

A confident answer. Wrong data. No warning.

That moment captures something important about where enterprise AI is right now.

### The benchmark

We've been running a structured comparison between DevRev Computer and Claude Code, using the same model (Sonnet 4.6) against the same dataset – 33 engineering issues, 124 support tickets, 37 product areas, 39 accounts.

The query is the kind a VP of Support or a CTO might ask on a Monday morning: cross-reference your open high-priority issues with the customer accounts affected. Not a complex analytical task. Just a join across two systems.

Here's what we saw across multiple runs:

|  | Claude + MCP | Computer, By DevRev |
| --- | --- | --- |
| Average tokens per run  | ~3.2 million | ~157,000 |
| Time to answer | ~ 8-9 minutes | ~ 1.5 minutes |
| Token reduction | — | ~95% fewer |
| Speed | — | ~5.5× faster |

Same model. Same query. Same data. The variable was how context was delivered to the model.

### What Claude actually does  


Even when MCP is working correctly, Claude faces a fundamental problem before it can answer: it doesn't know your schema.

It doesn't know what Jira calls "high priority" (is it High? Highest? Critical? P0? P1?). It doesn't know which custom field links issues to product areas. It doesn't know how Salesforce has structured your account records.

So it explores. It fetches a sample issue to inspect the fields. It tries a JQL query, sees the results, refines it. It pulls a Salesforce object to find the right field name. It may paginate through multiple pages of results – or stop early if the context window starts filling up.

Each of these discovery calls returns large payloads. Most of that data never appears in the final answer – it was just the AI figuring out where to look.

In our worst-case run, this exploration consumed **6 million tokens** and cost **$4.28** – for a single query, on a dataset of 33 issues. The answer it returned still wasn't fully reliable.

And here's the deeper problem: Claude pays this schema exploration cost every session. It has no persistent memory of your data model. Tomorrow morning, it starts from zero again.

### Why the non-determinism matters  


Across our test runs, Claude took different paths to the same answer. Sometimes it called Jira first. Sometimes Salesforce. Sometimes it used JQL; sometimes it fetched individual records. The outputs generally looked similar – but "generally similar" is not the standard you want for decisions about your highest-priority customers and most critical engineering issues. Precision is essential.

With SQL, you get the same answer every time, from the same source of truth, via the same query. You can inspect that query. You can share it. You can build a dashboard from it.

We asked Computer how it arrived at its answer. It showed us the SQL – a clean set of joins, filters, and ordering. Reproducible and auditable.

We asked Claude how it arrived at its answer. It described a journey of exploration – and in the failed run, that journey had quietly excluded the core portion of the required data.

### The architectural gap

This isn't a criticism of Claude, or of MCP. Both are genuinely impressive. The issue is structural.

General-purpose AI tools are built to retrieve and reason. They aren't built to remember. Every query starts from scratch. Every session pays the schema exploration tax. Every answer is only as good as what the model managed to pull into its context window before it ran out of time or tokens.

DevRev Computer is built differently. The schema is known before any question is asked. Relationships between engineering issues, product areas, support tickets, and customer accounts are typed, pre-mapped, and maintained continuously. When you ask "which customers are affected by this issue?", Computer doesn't explore – it traverses. The join key is a first-class entity in the knowledge graph.

Token cost scales with the size of the answer, not the size of your data. That's why we saw 95% fewer tokens on a small dataset – and why that gap widens as data grows. As enterprises continue to utilize AI more broadly, efficiency is an ever increasingly important concern.

### The silent failure problem

The live failure during our stream wasn't just embarrassing. It illustrated something critical about MCP-based architectures at enterprise scale.

You're managing connections. Each MCP server can fail, expire, or lose authentication – silently. The AI may not know it's missing data. It may not tell you. It will simply do its best with what it has, and return a confident-sounding answer.

At enterprise scale – where you might have five, ten, fifteen integrations – this becomes a serious governance problem. Who is responsible for ensuring the AI had access to the right data when it made that recommendation?

With a knowledge graph architecture, data is ingested and kept current through managed sync pipelines. If a sync fails, you know. The answer doesn't silently degrade.

### What this means in practice

The **95% token reduction and 5.5× speed improvement** are real and meaningful – especially as token costs accumulate across hundreds of employees running dozens of queries per day. But the more important number is the one harder to measure: how often does your AI give you a confident wrong answer because it couldn't reach the right data? The trust and safety of actions and decisions based on these answers are what will determine the success or failure of your AI projects.  


> [!INFO]
> Enterprise AI needs three things to be trustworthy:
> 
> 1. A known, pre-mapped data model – so it doesn't spend tokens discovering what it should already know.  
> 2. Persistent shared memory – so context isn't rebuilt from scratch every session.  
> 3. Deterministic retrieval for structured data – so you get the same answer every time, from a source you can audit.

  
MCP is a useful protocol. But connecting tools to an AI doesn't give it memory. It gives it reach – and reach without memory is how you end up with a $4.28 answer or an answer built on data the AI never actually accessed.

Prompt caching is what some will raise their hand to suggest as a solution. But that won't solve the problem. I'll walk through this in a future post.