When it comes to AI, not all data is created equal

Gen AI is becoming a disruptive influence on nearly every industry, but using the best AI models and tools isn’t enough. Everybody’s using the same ones but what really creates competitive advantage is being able to train and fine-tune your own models, or provide unique context to them, and that requires data.

Your company’s extensive code base, documentation, and change logs? That’s data for your coding agents. Your library of past proposals and contracts? Data for your writing assistants. Your customer databases and support tickets? Data for your customer service chatbot.

But just because all this data exists, doesn’t mean it’s good.

“It’s so easy to point your models to any data that’s available,” says Manju Naglapur, SVP and GM of cloud, applications, and infrastructure solutions at Unisys. “For the past three years, we’ve seen this mistake made over and over again. The old adage garbage in, garbage out still holds true.”

According to a Boston Consulting Group survey released in September, 68% of 1,250 senior AI decision makers said the lack of access to high-quality data was a key challenge when it came to adopting AI. Other recent research confirms this. In an October Cisco survey of over 8,000 AI leaders, only 35% of companies have clean, centralized data with real-time integration for AI agents. And by 2027, according to IDC, companies that don’t prioritize high-quality, AI-ready data will struggle scaling gen AI and agentic solutions, resulting in a 15% productivity loss.

Losing track of the semantics

Another problem using data that’s all lumped together is that the semantic layer gets confused. When data comes from multiple sources, the same type of information can be defined and structured in many ways. And as the number of data sources proliferates due to new projects or new acquisitions, the challenge increases. Even just keeping track of customers — the most critical data type — and basic data issues are difficult for many companies.

Dun & Bradstreet reported last year that more than half of organizations surveyed have concerns about the trustworthiness and quality of the data they’re leveraging for AI. For example, in the financial services sector, 52% of companies say AI projects have failed because of poor data. And for 44%, data quality is their biggest concern for 2026, second only to cybersecurity, based on a survey of over 2,000 industry professionals released in December.

Having multiple conflicting data standards is a challenge for everybody, says Eamonn O’Neill, CTO at Lemongrass, a cloud consultancy.

“Every mismatch is a risk,” he says. “But humans figure out ways around it.”

AI can also be configured to do something similar, he adds, if you understand what the challenge is, and dedicate time and effort to address it. Even if the data is clean, a company should still go through a semantic mapping exercise. And if the data isn’t perfect, it’ll take time to tidy it up.

“Take a use case with a small amount of data and get it right,” he says. “That’s feasible. And then you expand. That’s what successful adoption looks like.”

Unmanaged and unstructured

Another mistake companies make when connecting AI to company information is to point AI at unstructured data sources, says O’Neill. And, yes, LLMs are very good at reading unstructured data and making sense of text and images. The problem is not all documents are worthy of the AI’s attention.

Documents could be out of date, for example. Or they could be early versions of documents that haven’t been edited yet, or that have mistakes in them.

“People see this all the time,” he says. “We connect your OneDrive or your file storage to a chatbot, and suddenly it can’t tell the difference between ‘version 2’ and ‘version 2 final.’”

It’s very difficult for human users to maintain proper version control, he adds. “Microsoft can handle the different versions for you, but people still do ‘save as’ and you end up with a plethora of unstructured data,” O’Neill says.

Losing track of security

When CIOs typically think of security as it relates to AI systems, they might consider guardrails on the models, or protections around the training data and the data used for RAG embeddings. But as chatbot-based AI evolves into agentic AI, the security problems get more complex.

Say for example there’s a database of employee salaries. If an employee has a question about their salary and asks an AI chatbot embedded into their AI portal, the RAG embedding approach would be to collect only the relevant data from the database using traditional code, embed it into the prompt, then send the query off to the AI. The AI only sees the information it’s allowed to see and the traditional, deterministic software stack handles the problem of keeping the rest of the employee data secure.

But when the system evolves into an agentic one, the AI agents can query the databases autonomously via MCP servers, and since they need to be able to answer questions from any employee, they require access to all employee data, and keeping it from getting into the wrong hands becomes a big task.

According to the Cisco survey, only 27% of companies have dynamic and detailed access controls for AI systems, and fewer than half feel confident in safeguarding sensitive data or preventing unauthorized access.

And the situation gets even more complicated if all the data is collected into a data lake, says O’Neill.

“If you’ve put in data from lots of different sources, each of those individual sources might have its own security model,” he says. “When you pile it all into block storage, you lose that granularity of control.”

Trying to add the security layer in after the fact can be difficult. The solution, he says, is to go directly to the original data sources and skip the data lake entirely.

“It was about keeping history forever because storage was so cheap, and machine learning could see patterns over time and trends,” he says. “Plus, cross-disciplinary patterns could be spotted if you mix data from different sources.”

In general, data access changes dramatically when instead of humans, AI agents are involved, says Doug Gilbert, CIO and CDO at Sutherland Global, a digital transformation consultancy.

“With humans, there’s a tremendous amount of security that lives around the human,” he says. “For example, most user interfaces have been written so if it’s a number-only field, you can’t put a letter in there. But once you put in an AI, all that’s gone. It’s a raw back door into your systems.”

The speed trap

But the number-one mistake Gilbert sees CIOs making is they simply move too fast. “This is why most projects fail,” he says. “There’s such a race for speed.”

Too often, CIOs look at data issues as slowdowns, but all those things are massive risks, he adds. “A lot of people doing AI projects are going to get audited and they’ll have to stop and re-do everything,” he says.

So getting the data right isn’t a slowdown. “When you put the proper infrastructure in place, then you speed through your innovation, you pass audits, and you have compliance,” he says.

Another area that might feel like an unnecessary waste of time is testing. It’s not always a good strategy to move fast, break things, and then fix them later on after deployment.

“What’s the cost of a mistake that moves at the speed of light?” he asks. “I would always go to testing first. It’s amazing how many products we see that are pushed to market without any testing.”

Putting AI to work to fix the data

The lack of quality data might feel like a hopeless problem that’s only going to get worse as AI use cases expand.

In an October AvePoint report based on a survey of 775 global business leaders, 81% of organizations have already delayed deployment of AI assistants due to data management or data security issues, with an average delay of six months.

Meanwhile, not only the number of AI projects continues to grow but also the amount of data. Nearly 52% of respondents also said their companies were managing more than 500 petabytes of data, up from just 41% a year ago.

But Unisys’ Naglapur says it’s going to become easier to get a 360-degree view of a customer, and to clean up and reconcile other data sources, because of AI.

“This is the paradox,” he says. “AI will help with everything. If you think about a digital transformation that would take three years, you can do it now in 12 to 18 months with AI.” The tools are getting closer to reality, and they’ll accelerate the pace of change, he says.

Read More from This Article: When it comes to AI, not all data is created equal
Source: News