The world's first benchmark for enterprise AI agents. Know more

Most enterprise AI fails three tests. Here's the framework

7 min read

—Updated Jul 03, 2026

Most enterprise AI fails three tests. Here's the framework

Patrick van de Werken

Head of EMEA, DevRev

Test 1: Precision - can you take this answer to your CFO?

Not "is this answer plausible?" That's a remarkably low bar for enterprise software. The real test: when the AI gives an answer, can you show exactly where it came from?

Most AI systems generate responses that sound right. At enterprise scale, that's not good enough. Your AI tells a sales rep that a renewal is at risk - can they trace that judgment to the exact support tickets, engineering delays, and contract terms that informed it? Or is it "based on patterns in the data" - which is another way of saying "we're guessing confidently"?

The architectural divide:

Precision isn't a feature you add on top. It's a consequence of when your data gets structured.

If the AI is assembling context at query time - fetching fragments from five systems and hoping it gets the right pieces - precision is probabilistic. Same question, different answer, depending on what the model happened to retrieve. If data relationships are mapped before the question is asked - customer to ticket to product to contract to engineering status - then every answer has explicit provenance. Same question, same answer, every time. Deterministic, cited, auditable.

What to ask: "Show me the same business question answered twice. Are the answers identical? Can you trace both to their source records, at the field level, respecting my permission model?"

Test 2: Efficiency - what happens when your data doubles?

Here's a question that rarely appears in vendor evaluations but determines the entire economics of your AI investment: as your data grows, does the cost per query stay flat - or does it grow with the data?

The dominant architecture today is brute-force context loading. Ingest raw data into a prompt window. Process it. Generate a response. At demo scale, this is invisible. At enterprise scale, it creates a perverse incentive: the more data your organisation generates - and it will generate more - the more expensive every AI interaction becomes.

Where we see this breaking:

A European professional services client calculated that employees spend 20% of their time searching for information across systems - over 12,000 euros per employee per year in lost productivity. Their instinct was to buy a better search tool. But layering AI on top of fragmented data just means the AI is doing the same expensive work the humans were doing. Faster, yes. But at increasing token cost, with no structural improvement.

This is also why the "build vs. buy" trap burns so much runway in European enterprises. We've watched a data reconciliation firm in London explore this for 18 months. A financial services group take 400 days to close because five internal teams couldn't align on a shared data approach. An Italian energy company with 178 days in cycle because regulatory complexity kept surfacing new requirements that each required a new data pipeline.

The root cause is always the same: without shared memory, each new use case requires its own sync logic, its own maintenance burden, its own cost centre. The second use case costs as much as the first. The third costs more, because now you're coordinating.

The architectural divide:

When the AI navigates pre-structured relationships instead of re-learning your data model with every query, cost scales with the complexity of the answer, not the volume of your data. The graph absorbs new information without requiring more processing per question. That's the difference between a flat cost curve and an exponential one.

What to ask: "What's my cost per query today? What will it be when my data volume doubles in 18 months? Show me the architecture that makes that possible - not the pricing model that hides it."

Test 3: Safety - if it goes wrong at 2 AM, can you undo it by 2:01?

AI is moving from answering questions to taking actions. That's where the real value lives - the support AI that resolves a ticket, the sales AI that updates a pipeline field, the operations AI that routes work and notifies customers. But every autonomous action is a potential failure point.

The question isn't whether the AI will make a mistake. It will. The question is: what's the blast radius when it does?

The non-negotiable requirements:

Every action staged before execution. Review before it fires.
Every action versioned. See what changed, when, and why.
Every action reversible. Undo in seconds, not days.

This is the same transactional governance your database has had for decades. The question is whether your AI platform extends those guarantees to AI-driven actions - or whether agents operate outside them entirely.

Why European enterprises can't compromise here

GDPR requires demonstrable control over automated decisions affecting customers. EU AI Act provisions are tightening auditability requirements for AI-driven actions. The buying committees I sit in across Europe now route every agentic AI evaluation through the CISO, through procurement, through legal. Not because they don't believe in the technology - because nobody has answered their four questions convincingly: How do I stop it? How do I scope it? How do I audit it? How do I revert it?

The firms we work with aren't debating whether safety is "nice to have." They're asking whether they can deploy at all without creating regulatory exposure.

What to ask: "Show me the audit trail. Show me the rollback. If your AI agent writes a wrong value to my CRM at 2 AM on Saturday, show me exactly how my team undoes it by 2:01 AM without escalating to your engineering team."

The test behind the tests: does the intelligence compound?

Precision, efficiency, and safety are necessary. But they're not sufficient on their own. The real differentiator is what happens when all three are solved on one shared foundation.

Consider what most enterprises look like today: support uses one AI tool, sales uses another, operations uses a third, IT uses a fourth. Each has its own data silo, its own cost structure, its own limitations. The support AI doesn't know engineering shipped the fix. The sales AI doesn't know the account has three escalated tickets. Operations can't answer "how will a 2-week delay affect our commitments?" because the data spans four disconnected systems.

This is the paradox: organisations are investing more in AI than ever, yet getting less compounding value from it. Every point solution starts from zero. Every context window closes after the response. Nothing carries forward.

The alternative: one shared memory that connects customer interactions, engineering status, sales pipeline, and operational metrics. When support resolves a ticket, that resolution enriches knowledge available to sales. When engineering ships a fix, the customer is notified. When a rep prepares for a meeting, the briefing includes support health, engineering delivery, and contract risk - from one query, in minutes.

The compounding effect is the economic moat. The first use case is sold. The second sells itself. The graph gets richer. The cost per additional capability drops. After 18 months of accumulated organizational intelligence, you've built something no competitor can replicate by starting from raw context loading. They'd need to replay years of your decisions to catch up.

The evaluation framework

If you're in the middle of a platform decision, here's what we'd challenge you to ask:

	Precision	Efficiency	Safety
The test	Same question returns same answer? Cited, traceable, permission-aware?	Cost trajectory flat as data doubles? Pre-structured traversal, not re-learning?	Every action staged, versioned, and reversible?
The proof	Prove accuracy in a 4-week POC. Not a demo. A POC on your data.	Show the cost curve at 2x and 10x data volume.	Demonstrate rollback of an AI-taken action in under 60 seconds.
The compounding test	Does the answer get better because of what support resolved last week?	Does use case #2 cost less than use case #1?	Does the audit trail span all AI actions across the platform, not just one tool?

Start where the pain is loudest. Prove value in weeks, not quarters. Then pay attention to what happens when the second use case arrives - and the memory is already there.

This is the executive summary of The AI Platform Paradox - a full exploration of the three-layer architecture (continuous synchronisation, contextual memory, and foundational services) that makes precision, efficiency, and safety possible on a single platform. If these questions resonated, the deeper perspective goes into the architectural how. Download it here.