New IT roles emerge to tackle AI evaluation

New IT jobs are emerging to help organizations better evaluate AI outputs as they move from AI pilots to full-scale deployments.

Many organizations are now considering assembling or hiring AI evaluation teams, with some experts calling these recently created roles an essential safety net for companies rolling out AI tools.

The rapid rise of AI agents is spurring this trend, with AI evaluation teams beginning to take shape in recent months, says Yasmeen Ahmad, managing director of product management, data, and AI cloud at Google Cloud.

“Up until now, we weren’t really at the stage of having multi-step reasoning, complex agents that are autonomous,” she says. “As customers look at how agents are behaving in the wild, so to speak, there’s a realization that evaluation isn’t a gate, it has to be a continuous practice.”

At Google, evaluation teams are embedded with agent development groups so that the two functions happen simultaneously, Ahmad says.

“As the agent builders are building, there’s eval happening right alongside so that there’s that fast iteration loop,” she says.

Other organizations have begun to create AI evaluation task forces within their larger AI and IT departments, says Maksim Hodar, CIO at software development firm Innowise. In some cases, companies are combining data architects, security officers, and compliance leads into the new team, instead of recruiting from scratch, he notes.

Evaluation becomes necessary

AI evaluation team members take a hybrid role, sitting between raw coding and ethical business practices, he adds.

“It is safe to say that we are witnessing the evolution of the AI evaluation team from ‘nice-to-have’ to a necessity,” Hodar says. “We’ve observed that companies are moving away from blind AI adoption and embracing a more measured approach to the so-called ‘safety net.’”

While an emerging set of tools, including observability and governance products, focus on preventing AI slop, technology isn’t a complete answer, he adds. Humans will be needed to decide if the IT tool is aligned with company values and regulations such as GDPR, he says.

“While technology can identify technical errors, it cannot evaluate context,” Hodar adds. “Technology helps provide information, but the evaluation team still gives the green light. You cannot automate accountability.”

Human evaluation teams need the data that observability tools provide, but the technology itself cannot provide the context that AI models and agents need to repair bad outputs, adds Google’s Ahmad. AI agents have gotten very good at passing output checks in testing environments, but evaluation teams are needed to track their output in real-world situations, she says.

“Agentic apps might pass the initial unit test of this specific scenario that you were outlining,” she says. “But agentic systems are non-deterministic decision-makers, so it doesn’t behave; you’re not testing for all the potential ways it could behave out in the real world.”

While an observability tool may be able to provide data on token usage, tool usage, tool failures, and reasoning errors, human evaluators are needed to fix many of the problems, she adds. Evaluation teams can provide context for the commonly encountered reasoning errors that agents have, she adds.

“When our internal eval teams are spending a lot of time on our agents, a big chunk of time is, ‘Why did the reasoning logic fail here?’” Ahmad says. “It’s because the agent doesn’t have access to enough context. The solve to that is providing the right context at the right layers in the agent so that it can make better reasoning decisions.”

Testing in a complex environment

A good evaluation team also addresses several other issues, including governance, cultural readiness, organizational workflow alignment, and measurable business impact of AI tools, adds Noe Ramos, vice president of AI operations at contract lifecycle management vendor Agiloft. Technology alone can’t deal with all those issues, she says.

“The biggest hurdle isn’t technical — it’s human,” she adds. “You can buy powerful tools and still struggle if people don’t trust them, understand them, or see how they fit into their work.”

Like Hodar and Ahmad, Ramos also sees a growing demand for AI evaluation teams, although the roles are emerging more as a capability than as formalized titles.

“As organizations move beyond experimentation, they’re realizing AI can’t be deployed based on excitement alone,” she adds.

A formal evaluation discipline becomes essential as organizations scale AI, she stresses.

“Ultimately, AI evaluation isn’t just about safety — it’s about ensuring AI drives clarity and action rather than adding noise,” Ramos says. “Or, as we frame it internally, we’re using AI to drive clarity and action — not overwhelming teams with more dashboards.”

Ramos was recently promoted from vice president of IT to vice president of AI operations, and her team includes an AI operations lead, an AI agent engineer, and a GPT and AI systems lead, she notes. The goal is to embed evaluation into Agiloft’s AI operating model.

As organizations mature in their AI uses, a shift from enthusiasm to disciplined evaluation is creating the need for a structured evaluation function, she adds.

“In my experience, one of the biggest risks is that AI initiatives become driven by the squeakiest wheels rather than real operational priorities,” she adds. “I don’t think that AI development should rely on the loudest voices; it should be around the most sound being amplified for the impact in the organization.”

In most enterprises, the evaluation role should sit at the intersection of IT, security, data leadership, and operational stakeholders, Ramos says, adding that evaluation leaders also need to have a deep understanding of how the organization functions.

“One of the reasons AI evaluation fails is that companies don’t always understand their own workflows,” Ramos says. “You can’t intelligently evaluate AI against workflows you haven’t mapped, bottlenecks you haven’t identified, or priorities you haven’t aligned.”

Read More from This Article: New IT roles emerge to tackle AI evaluation
Source: News

New IT roles emerge to tackle AI evaluation

Evaluation becomes necessary

Testing in a complex environment

Related posts