AI agent evaluations: The hidden cost of deployment

Organizations deploying AI agents may be in for a nasty surprise when it comes to the cost of tuning their performance.

According to some surveys, nearly 80% of enterprises have deployed AI agents, but most don’t understand the cost of training them and evaluating their outputs, which can result in costs far exceeding expectations, experts say.

Many organizations are still experimenting to find the best ways to catch agent problems before they cause chaos after deployment, says Lior Gavish, cofounder and CTO at AI observability vendor Monte Carlo.

Because many organizations use a second large language model to vet the outputs of an LLM-powered agent, agent testing can be many times more expensive than testing traditional software, he says. Moreover, this method, called LLM as a judge, can be more expensive than running the agent itself, as the cost of running an LLM over an extended period can add up quickly.

“It’s tricky to test or monitor these outputs,” Gavish says. “People basically ask another LLM to rate the performance of an LLM based on various criteria, and the criteria vary wildly between different use cases.”

Monte Carlo saw this problem itself when the company left an LLM-powered eval running for days and ended up with a five-figure bill, Gavish notes. “An LLM call usually is orders of magnitude more expensive than anything that we would do in traditional software,” he says.

LLMs rating LLMs

Using a second LLM to review the outputs of an agent can also be problematic because it assumes the second LLM’s conclusions are accurate, Gavish says. Questions about accuracy can add to costs if organizations keep running tests to verify results.

“These checks are non-deterministic and not even repeatable,” he says. “You might get different answers and different runs if you’re not careful, so it’s different from more traditional software monitoring or testing where it either passed or it failed.”

The cost of agent evals can vary wildly depending on the complexity of the agent, says Russell Twilligear, head of AI R&D at AI-generated content provider BlogBuster. For example, an evaluation for a small, well-scoped agent can run into the thousands of dollars, while evals for more complex agents can cost tens of thousands of dollars, he says.

“You have to factor in all of the test runs, logging, and human reviews,” Twilligear notes. “Every single change means they have to rerun the evals, and that adds up pretty fast.”

Agent evals can be complicated because they test for several possible metrics, including agent reasoning, execution, data leakage, response tone, privacy, and even moral alignment, according to AI experts.

Good evals incorporate a human element, with subject-matter experts needed to check agent outputs, says Paul Ferguson, founder of Clearlead AI Consulting. A major challenge in agent evals is establishing what “correct” means in ambiguous use cases, he adds.

Most IT leaders budget for obvious costs — including compute time, API calls, and engineering hours — but miss the cost of human judgment in defining what Ferguson calls the “ground truth.”

“When evaluating whether an agent properly handled a customer query or drafted an appropriate response, you need domain experts to manually grade outputs and achieve consensus on what ‘correct’ looks like,” he adds. “This human calibration layer is expensive and often overlooked.”

Software evals can be straightforward when organizations are checking for code to compile and pass all unit tests, he says. “But for the vague queries like, ‘Help me understand this data,’ or ‘Draft a response to this customer,’ defining what constitutes a correct answer becomes genuinely difficult,” he adds. “Even humans can disagree in some cases.”

Agent evaluation advice

The sticker shock of agent evals rarely comes from the compute costs of the agent itself, but from the “non-deterministic multiplier” of testing, adds Chengyu “Cay” Zhang, founding software engineer at voice AI vendor Redcar.ai. He compares training agents to training new employees, with both having moods.

“You can’t just test a prompt once; you have to test it 50 times across different scenarios to see if the agent holds up or if it hallucinates,” he says. “Every time you tweak a prompt or swap a model, you aren’t just running one test; you’re rerunning thousands of simulations.”

There are several ways to run agent evals, including low-cost unit testing, synthetic grading using another AI model, red-team simulations, and high-cost human shadowing, in which a human expert runs alongside an agent for a week or more, Zhang says.

Organizations often look for shortcuts, usually by relying entirely on other AI models to do the grading, he says, recommending against that route.

“My view is that evaluations are an insurance policy,” he says. “Shortcuts in evals are just deferred technical debt that you pay with interest when the agent hallucinates in front of a VIP client. You might save $10,000 on evals today, but if your financial agent hallucinates a transaction, that cost is negligible compared to the brand damage.”

If an organization wants to save money, the better alternative is to narrow the agent’s scope, instead of cutting back on testing, Zhang adds.

“If you skip the expensive steps — like human review or red-teaming — you’re relying entirely on probability,” he says.

To limit eval costs, Clearlead AI Consulting’s Ferguson recommends organizations start with use cases that have clear right and wrong answers, like code compilation, before tackling more subjective scenarios, he says.

Organizations should also use LLM evaluation frameworks such as LangSmith, PromptLayer, or Ragas rather than building their own tools from scratch, he advises.

IT teams should also start testing early, he adds. “Building evaluations before production is far cheaper than retrofitting them later,” Ferguson says.

Monte Carlo’s Gavish offers other ways to keep costs down, such as setting spending limits for evals and performing due diligence on which LLMs they use to test agents.

“You can rightsize the model a little bit,” he says. “Of course, you can use the latest and greatest ChatGPT for every evaluation, but you probably shouldn’t.”

Read More from This Article: AI agent evaluations: The hidden cost of deployment
Source: News

AI agent evaluations: The hidden cost of deployment

LLMs rating LLMs

Agent evaluation advice

Related posts