AI is scoring your job candidates. Can you explain how?

Somewhere in your organization’s hiring stack, there is probably an AI system producing candidate scores. If you’re a leader who helped evaluate or approve that system, here’s a question worth sitting with: If one of those scores got challenged, by a candidate, an internal audit or a regulator, could your team explain how it was produced?

Not “the vendor said it’s accurate.” Not “the model was trained on historical data.” A specific, documented explanation of what criteria were evaluated, how the candidate performed against them and why those criteria are job-relevant.

For a growing number of organizations using AI video interview scoring tools, the honest answer is no. And as regulatory frameworks targeting employment AI move from guidance to enforcement, that answer is a risk.

What the system is actually optimizing for

Before asking how accurate an AI scoring system is, the right question is what it is optimizing for.

Many video interview scoring platforms evaluate tone of voice, pace, eye contact, facial expressions and fluency alongside or in some cases instead of, the actual content of candidate responses. The underlying assumption is that these signals correlate with job performance or cultural fit. The evidence for that assumption is weak. The evidence that measuring these signals introduces systematic, legally significant bias is much stronger.

Several major players in this space removed facial analysis features after regulatory pressure and public scrutiny. That acknowledgment — that criteria advertised as objective were neither reliable nor fair — should raise a harder question. If those criteria were in production and no one caught it until outside pressure forced a change, what else is still being measured that shouldn’t be?

This is not a hypothetical risk. The EEOC has made it clear that employers are liable under Title VII for discriminatory outcomes from AI hiring tools, regardless of whether those tools were built in-house or purchased from a vendor. New York City’s Local Law 144 requires annual independent bias audits of automated employment decision tools and public disclosure of results. Illinois requires notice and consent before AI is used to evaluate video interviews. The EU AI Act, whose high-risk AI provisions take full effect this August, explicitly classifies employment AI as high-risk, with binding requirements for transparency, explainability and human oversight.

The common thread: Can you explain what your AI is measuring, and can you demonstrate that it’s measuring the right things?

The accountability problem at the executive level

For technology leaders, this is where the conversation becomes concrete.

Consider the scenario: A hiring decision gets challenged by a candidate, an internal audit or a regulator. The question is how the decision was made. “The AI scored them lower” is not a defensible answer in any of those contexts. It can’t be traced to specific job-relevant criteria. It can’t be explained to the candidate. It won’t satisfy an auditor. And if the system’s logic is proprietary and opaque, the organization has no way to produce a satisfying answer even if it wants to.

The organizations that adopt black-box scoring tools often do so with the right intentions: To reduce human bias and create a more consistent process. Those are legitimate goals. But a system whose internal logic can’t be questioned, explained or audited just obscures bias. It doesn’t reduce it. And when bias becomes difficult to see, it becomes more difficult to address.

This is a pattern you’ll recognize from other domains. When a system produces outcomes that look plausible but are wrong in ways that aren’t immediately visible, the failure compounds before it surfaces. The cost of discovering it late is almost always higher than the cost of building it right from the start.

What a defensible architecture looks like

There is a meaningful difference between AI that scores interviews and AI that scores interviews in a way that can be explained and defended. The distinction is structural.

Defensible scoring starts before any candidate records a response. It starts with the job. What competencies does this role require, and what does strong performance against each competency look like? From those answers, explicit rubrics are developed. Criteria that describe what high-quality, adequate and weak responses look like for each dimension being evaluated. Those rubrics are reviewed and approved by the hiring team before scoring begins.

When responses come in, the AI evaluates what candidates actually said against those pre-defined criteria. Not tone. Not pacing. Not facial expression. What they communicated, measured against a standard the hiring team set, and can explain. Criterion-level scores roll up to an overall assessment, and every part of that chain is visible and auditable.

This architecture has an important secondary property: The human remains meaningfully in the loop. The AI generates a starting point by identifying relevant competencies and drafting rubric criteria from the job description, but the standard is owned by the people responsible for the hire. If a hiring manager can’t look at a scoring rubric and explain what it’s evaluating and why, it should not be deployed. That is not a burden on the tool. It is the minimum condition for using it responsibly.

Four questions for the governance conversation

For leaders evaluating or overseeing AI video interview tools, four questions surface most of what matters.

What specifically is the system scoring? Request an explicit list of evaluation criteria. If the answer includes anything beyond the content of candidate responses, ask for the validation data that connects those criteria to job performance outcomes.
Are the criteria derived from job requirements? Generic rubrics applied uniformly across roles create standardized evaluation, not structured evaluation, which is different. Legitimate scoring starts from the specific competencies required for the specific role.
Can the criteria be reviewed, modified and approved before scoring begins? If the rubrics are fixed and opaque, the organization is not in control of its own evaluation standard. That is a governance gap.
Can any score be explained to a candidate or a regulator? This is the accountability test. If the explanation requires “the AI said so” rather than pointing to specific, documented criteria and how a candidate performed against them, the process will not withstand scrutiny.

Well-designed systems answer these questions directly. The ones that can’t are telling you something important about the tradeoffs their creators made.

Why this moment matters

The EU AI Act deadline is in August, forcing organizations with global operations or EU-based candidates to evaluate their tech. But getting this right isn’t just regulatory, it’s practical.

When hiring teams can see exactly how a score was produced, they use it. When they can’t explain it, they override it or work around it, the efficiency gains disappear. The tools that will last in enterprise hiring stacks are the ones that make decisions transparently enough that the humans responsible for those decisions trust them.

That’s not a high bar. But it requires being precise about what any given AI system is really measuring. And honest about whether that’s what you actually want to know.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Read More from This Article: AI is scoring your job candidates. Can you explain how?
Source: News