The Purity of Uselessness: Why Academic AI Benchmarks Don't Pay the Bills

Really. The AI Benchmark Industry Just Needs to Stop. Please Stop.

Dec 03, 2025

Let me tell you something about purity.

There’s a certain kind of person—you’ve met them—who gets genuinely excited when a new benchmark drops on Arxiv. Their eyes light up. They forward it to the team Slack channel with a note that says “we should try this.” And here’s what I don’t understand: at no point do they ask the only question that matters, which is

“Does this measure anything our customers actually care about?”

I build AI agents that search large information repositories—think 10,000-table data lakes on Oracle’s AI Data Platform—and helps customers find answers to questions they’ve been asking data engineers for years. Real questions. Questions like “Why did revenue dip in Q3?” and “Which suppliers have delivery issues?” The kind of questions where, if you get the answer wrong, someone notices. Someone important.

To evaluate this agent, I did what any reasonable engineer would do: I built tests based on the actual questions customers ask and manually evaluate the answers, tweaking my prompts and tools in generic ways to get my customers happy.

AI Fatigue, Fake Adoption, and Why Leaderboards Won't Save You

Novel concept, I know. And then my colleagues got involved. And some of them—not all of them—wanted to throw this new thing called DABStep at it. It was my first experience with this thing, so I took a deep look at it.

DABStep. A benchmark for “multi-step reasoning” across a handful of tables.

You know what? Let me be precise about this, because precision matters.

Here’s What DABStep Actually Is

DABStep—Data Agent Benchmark for Multi-step Reasoning—came out of Adyen and Hugging Face in June 2025. It’s 450 questions derived from real financial operations. That part’s good. The questions require navigating multiple data sources, consulting documentation, executing multi-step reasoning chains. Also good.

And the results are genuinely illuminating. On easy tasks—the ones you can nearly solve in a single shot—o4-mini hits 76% accuracy. On hard tasks requiring genuine multi-step reasoning? Same model drops to 14.5%. Claude 3.5 Sonnet manages 12%.

Here’s the thing, though: DABStep uses a single fixed dataset for all 450 questions. One dataset. The same tables, over and over. It evaluates only text-based, factoid-style outputs. And it focuses narrowly on payments and financial data.

Now. I’m solving a problem where customers have tens of thousands of tables. Where the hard part isn’t reasoning about data you’ve already found—it’s finding the right data in the first place. Where 80% of a data practitioner’s time goes to locating, cleaning, and organizing data, leaving only 20% for actual analysis.

DABStep measures depth. Customers want me to solve for breadth. They tend to ask and want answers to messy multi-step, multi-subject, compound questions.

Grok 4 is "#1" But Real-World Users Ranked it #66—Here's the Gap

These are not the same problem. They’re not even adjacent problems. I then started to look at a number of benchmarks in the data analytics agent space.

I have had the benefit of spending time with customers over the last two months, and I can’t say that any one of these benchmarks actually mirrors a customer environment.

The Uncomfortable Truth About Benchmark Theater

Here’s what’s happening in this industry, and I want to be clear about it because clarity seems to be in short supply:

Andrej Karpathy—OpenAI co-founder, former Tesla AI director, not exactly a fringe voice—said in October that he’s become “suspicious” of benchmark rankings. His words:

“Public demos, benchmark competitions, chatbot conversations, and code-generation tests tend to reflect narrow optimizations, rather than addressing the hardest unsolved problems.”

Sara Hooker at Cohere Labs analyzed 2.8 million model comparison records. She found that Meta tested 27 model variants privately before picking their best one for public submission. Google tested 10. Amazon tested 7. This is what she called “score gaming.” This is what I call the opposite of useful information.

An Oxford Internet Institute study—and Oxford is not known for being cavalier about methodology—found that of 445 LLM benchmarks analyzed, only 16% use rigorous scientific methods.

Half of them claim to measure concepts like “reasoning” or “harmlessness” without bothering to define what those words mean.

And then there was the Meta Llama 4 situation in April, where Meta submitted a specially-crafted “experimental” variant to LMArena, optimized specifically for human preference voting, and rocketed to number two on the leaderboard. LMArena’s response was diplomatic. The industry’s response was less so. “Benchmark hacking” was the polite term.

Fortune Magazine—not exactly a radical publication—put it plainly:

“For companies in the market for enterprise AI models, basing decisions on these leaderboards alone can lead to costly mistakes.”

And by the way? The performance gap between benchmarks and reality isn’t subtle. Models achieve 86-94% accuracy on public benchmarks. On actual enterprise data? 6-38%. SAP found that LLMs scoring 0.94 F1 on public benchmarks dropped to 0.07 on real customer data. Customer-defined columns scored “near zero.”

So when someone tells me their benchmark is the gold standard, I have questions.

And one of the leaders in my org actually brought up a good point - overfitting. These are all examples of overfitting the model to the benchmark versus looking carefully at what model solves the broadest enterprise problems and generates the best ROI. I would rather have access to seven variants of the model that give me options to solve a customer problem than one variant that passed an arbitrary benchmark that doesn’t help customers.

It makes me wonder if we should rather just do the standard thumbs-up/down UX for chatbot responses and fit to that signal.

Single-Shot Testing Is Testing the Wrong Thing

Here’s what really bothers me about these benchmarks: they test whether an AI gets the right answer on the first try.

You know what? That’s not how AI creates value anymore. That’s not even close.

Modern AI agents excel through iteration. Through self-correction. Through failure recovery. Through trying something, realizing it’s wrong, and trying something better. Research from deepsense.ai found that a baseline single-request approach achieved 53.8% success on code generation. A multi-step agentic approach? 81.8%. Analytics Vidhya documented GPT-3.5 with an agent workflow outperforming GPT-4 zero-shot.

Read that again. The architecture mattered more than the model.

Anthropic’s engineering team explicitly recommends designing agents that “check and improve their own output” because they “catch mistakes before they compound, self-correct when they drift, and get better as they iterate.” Their multi-agent research architecture outperformed a single Claude Opus 4 agent by 90.2% on complex research tasks.

The Sequence Radar #534: The Leaderboard Illusion: The Paper that Challenges Arena-Based AI Evaluations

The UIUC-Kang Lab found that benchmark flaws lead to “up to 100% relative performance misrepresentation.” They recommend process-based evaluation alongside outcome-based metrics. Because single-outcome metrics don’t tell you whether the agent reasoned correctly or just got lucky.

And here’s my question: when you’re navigating a 10,000-table data lake, when you’re helping a customer understand why their supply chain metrics look wrong, when the stakes are real and the data is messy—do you want an AI that gets it right on the first try 14% of the time? Or do you want one that can try, recognize its mistake, adjust its approach, and eventually get you a useful answer?

Because I know which one my customers want. They want the one that works.

Sorry tech leaders, here’s the basic math you need to think through:

Higher Accuracy/Response Quality/Creativity ==
Increased Response Output Token Limits ==
Latency

This is the bottom line truth of transformer based large language models. There is no way around it.

“LLM Backspace” == latency - this is an O(nlog(n)) problem at scale. Get over it.

The Latency Fallacy

Single-shot benchmarks have another problem, and it’s embedded so deep in their assumptions that people don’t even see it anymore: they optimize for latency.

Fast answers. Quick responses. Minimal thinking time.

And look, I understand the appeal. We’ve built AI into chatbots, and chatbots feel wrong when they take too long. There’s a UX expectation at play. But here’s what I need you to understand:

When the ROI of an AI tool is that 15 minutes with a bot saves 6 hours with a data engineer—does the difference between 15 minutes and 1 minute actually matter?

The documented time savings in enterprise AI are staggering. Legal medical chronology drafting: 480 minutes saved per case. Teachers using AI weekly: 5.9 hours saved per week. One franchise network: 140 consultant hours saved monthly.

These aren’t latency wins. These are existence wins. The AI either solves the problem or it doesn’t. Whether it takes 90 seconds or 900 seconds is, in the grand scheme of things, noise.

We’re placing a higher emphasis in these tests on the UX than on the ROI. The ROI is what matters. UX problems can be solved with UX. You can build an interface that says “I’m working on this, come back in ten minutes.” You can send an email when the answer’s ready. You can treat the AI like what it increasingly is: an asynchronous deep research agent, not a chatbot.

What you can’t do is fake ROI. Either the AI delivers value or it doesn’t. And single-shot benchmarks optimized for speed tell you almost nothing about whether it will.

The LLM Is Becoming the Customer

Here’s where it gets interesting. And I mean genuinely interesting, not “interesting” in the way people say when they mean “I have no idea what you’re talking about.”

The differentiation in AI is shifting. Better prompts? Everyone has those. Better models? Everyone has access to the same frontier models. What’s going to matter—what already matters—is the quality of the tools and data you give the AI to work with.

And here’s the paradigm shift that most people haven’t internalized yet: the LLM is becoming your customer.

Think about it. The LLM sits between your systems and your end user. It advocates on your behalf. It interprets your data, navigates your tools, presents your insights. And just like any customer, it has preferences. It has quirks. It has ways it likes to receive information.

A Towards Data Science article put it perfectly:

API responses consumed by LLMs are “in essence a reverse prompt.”

When you return an empty array for no results, you’ve created a dead end. When you return detailed guidance with suggested next steps, you’ve given the agent a thread to pull.

Stytch’s engineering team proposed using LLM agents as DX smoke tests: “The success rate of an LLM agent can be a direct reflection of the clarity of your docs and error design.”

And yet. Look at MCP—the Model Context Protocol. Anthropic discovered agents processing 150,000 tokens just to load tool definitions before reading the user’s request. Functionality achievable in 2,000 tokens. A single GitHub MCP server can expose 90+ tools, consuming 50,000+ tokens in JSON schemas alone.

MCP is a protocol. It doesn’t speak to what I’d call the pseudo-neurology of the model. We get blunt guidance like “send prior thinking messages so it does better” rather than “here’s how you store the strategy of how questions were answered in the past and pass that strategy back.”

The future isn’t a bigger pile of tools. It’s a better way of relating to tools. And we’re not building tests to evaluate that relationship at all.

What We Should Be Testing Instead

So here’s where I land on this, and I want to be unambiguous about it:

Academic benchmarks measure what’s easy to measure, not what matters. Single-shot accuracy on clean, small datasets tells us almost nothing about an AI system’s ability to navigate massive data landscapes, recover from errors, iterate toward solutions, and deliver measurable business ROI.

The mature enterprise AI teams are figuring this out. Samsung created TRUEBench because “existing benchmarks focus on academic or general knowledge tests.” Salesforce built internal benchmarks for CRM-specific tasks. OpenAI’s GDPval initiative acknowledges that “classic academic benchmarks like MMLU” don’t capture “real-world knowledge work.”

The path forward requires evaluation frameworks built around iterative agentic behavior. Enterprise-scale data navigation. Failure recovery capabilities. Business outcome measurement.

Gartner found that 49% of business leaders cite “proving generative AI’s business value” as the biggest hurdle to adoption. Only 15% have established formal metrics for measuring AI returns. We’re 97% failing to demonstrate business value from AI efforts—and the problem isn’t AI capability. It’s measurement.

As one practitioner summarized: “Don’t let vendors’ benchmark theater guide your AI strategy. Build your own evaluation frameworks, test relentlessly on real tasks.”

The Purity Problem

I started by talking about purity. Let me finish there.

There’s a certain appeal to academic benchmarks. They’re clean. They’re standardized. They come from papers with citations and methodology sections. They feel rigorous. They feel pure.

But purity doesn’t pay bills. Happy customers pay bills.

And customers don’t care whether your AI scores 76% on DABStep. They care whether it can find the answer to their question in their data lake. They care whether it saves them six hours with a data engineer. They care whether it works.

So the next time someone forwards you a benchmark from Arxiv and suggests you should “try it,” ask them a question first:

“What customer problem does this help us solve?”

If they can’t answer that—if the honest answer is “none, but it would be interesting”—then you’ve learned something important. Not about the benchmark. About your priorities.

Build tests that measure what matters. Test against the questions your customers actually ask. Evaluate the tools, not just the tools-plus-model. Measure ROI, not just accuracy. And remember that an AI that gets the right answer on the fifteenth try is infinitely more valuable than one that gets the wrong answer on the first.

Automation isn’t nirvana either. Your test - for now - may involve hand reading LLM responses to see whether they match ground truth.

That’s not a complicated insight. It’s just one that gets lost in the pursuit of purity.

Wired for Scale: Sid Rao's Musings

Discussion about this post

Ready for more?