When is data too clean to be useful for enterprise AI?

Once the province of the data warehouse team, data management has increasingly become a C-suite priority, with data quality seen as key for both customer experience and business performance. But along with siloed data and compliance concerns, poor data quality is holding back enterprise AI projects. And while most executives generally trust their data, they also say less than two thirds of it is usable.

For many organizations, preparing their data for AI is the first time they’ve looked at data in a cross-cutting way that shows the discrepancies between systems, says Eren Yahav, co-founder and CTO of AI coding assistant Tabnine.

Addressing that might mean starting with basic data hygiene like making sure the right fields are in the database to cover the needs of different teams, or pruning the data you use with AI to reflect the outcomes you want. “We’re trying to get the AI to have the same knowledge as the best employee in the business,” he says. “That requires curation and cleaning for hygiene and consistency, and it also requires a feedback loop.”

Organizations using their own codebase to teach AI coding assistants best practices need to remove legacy code with patterns they don’t want repeated, and a large dataset isn’t always better than a small one. “One customer was creating new projects by copying an existing one and modifying it,” Yahav says. “They had a hundred copies of the same thing with minor variations and no way to distinguish if it’s important or not because it’s drowned in the repetition.”

Good data governance has always involved dealing with errors and inconsistencies in datasets, as well as indexing and classifying that structured data by removing duplicates, correcting typos, standardizing and validating the format and type of data, and augmenting incomplete information or detecting unusual and impossible variations in the data. That’s still important, but not always as relevant to the unstructured and semi-structured data gen AI deals with, which will also have a lot more variation. Data quality for AI needs to cover bias detection, infringement prevention, skew detection in data for model features, and noise detection.

Common data management practices are too slow, structured, and rigid for AI where data cleaning needs to be context-specific and tailored to the particular use case. For AI, there’s no universal standard for when data is ‘clean enough.’

Even for more traditional machine learning (ML), the large-scale data cleaning efforts that pay dividends for business intelligence and finance rarely meet the needs of data science teams who are probably already doing their own data engineering for AI — and creating more siloes of ungoverned data in the process, says Kjell Carlsson, head of AI strategy at Domino Data Lab.

Not cleaning your data enough causes obvious problems, but context is key. Google suggests pizza recipes with glue because that’s how food photographers make images of melted mozzarella look enticing, and that should probably be sanitized out of a generic LLM. But that’s exactly the kind of data you want to include when training an AI to give photography tips. Conversely, some of the other inappropriate advice found in Google searches might have been avoided if the origin of content from obviously satirical sites had been retained in the training set.

“Data quality is extremely important, but it leads to very sequential thinking that can lead you astray,” Carlsson says. “It can end up, at best, wasting a lot of time and effort. At worst, it can go in and remove signal from your data, and actually be at cross purposes with what you need.”

In a relative sense

Different domains and applications require different levels of data cleaning. You can’t treat data cleaning as a one-size-fits-all way to get data that’ll be suitable for every purpose, and the traditional ‘single version of the truth’ that’s been a goal of business intelligence is effectively a biased data set. “There’s no such thing as ‘clean data,’” says Carlsson. “It’s always relative to what it is you’re using it for. What clean looks like is very different across all of these different use cases.”

Take the data quality of employee records you might use for both salary processing and an internal mailing campaign with company news. “Those should be looked at differently, and the quality determined differently for those,” says Kunju Kashalikar, senior director of product management at Pentaho, a wholly owned subsidiary of Hitachi Ltd.

AI needs data cleaning that’s more agile, collaborative, iterative and customized for how data is being used, adds Carlsson. “The great thing is we’re using data in lots of different ways we didn’t before,” he says. “But the challenge is now you need to think about cleanliness in every one of those different ways in which you use the data.” Sometimes that’ll mean doing more work on cleaning, and sometimes it’ll mean doing less.

An organization can undermine itself by trying to get its data ready for AI before starting work on understanding and building out its AI use cases, Carlsson cautions. So, before embarking on major data cleaning for enterprise AI, consider the downsides of making your data too clean.

Diminishing returns

CIOs ask how to get data clean, but they should ask how far to take it, says Mark Molyneux, EMEA CTO at software developer Cohesity. “You could, in theory, be cleaning forever, depending on the size of your data,” he says.

Case in point is Syniti EMEA managing director Chris Gorton, who spent considerable time early in his career cleaning customer addresses for a vending machine company, only to discover what they actually needed was either email addresses to send invoices, or the specific locations of the equipment for servicing.

Many organizations are hoarding large datasets that don’t have operational usefulness, he cautions, and it’s important to establish what value cleaner data is going to deliver before embarking on large and expensive data cleaning programs. “If you can’t describe how the activity or the outcome you need with data links back to some value into the business, then it probably doesn’t need to be done,” says Gorton.

As so often, the 80/20 rule applies and the marginal gains, especially from cleaning older data, may not be worth the work. That applies whatever you’re using data for. If it costs more to detect and remove incorrect phone numbers in your dataset than it costs to make that number of wasted calls or send that many undeliverable text messages, then there’s no ROI in fixing the numbers in advance.

“A lot of organizations spend a lot of time discarding or improving zip codes, but for most data science, the subsection in the zip code doesn’t matter,” says Kashalikar. “We’re looking at a general geographical area to see what the trend might be. That’s a classic example of too much good is wasted.”

To understand if you’re getting value from data cleaning, start by defining success and understanding the point of the model, says Howard Friedman, adjunct professor of health policy and management at Columbia University. Begin with basic data triaging and standard quality checks around missing data, range checks, distribution, and correlation. Not all columns are equal, so you need to prioritize cleaning data features that matter to your model, and your business outcomes. Instead of cleaning the data, automate the basics, look for patterns that explain missing data and consider transforming features since scaling can compress values or increase variance.

But before you even pursue more advanced methods of data quality improvement, asses what the incremental model improvement will be. “What if I could get 90% of my model value with data that’s had only a few hours of effort and a few thousand dollars of investment, versus if I have to spend a quarter of a million dollars to get the data perfect?” asks Friedman. Getting the extra 10% may not be worth it for small improvements in the model.

“Think about it as a business problem of where I put my investments of time and money, and what do I expect to see in returns,” he says.

Investigate existing projects to see what impact data quality issues actually have. There may be other sources you can use rather than investing in cleaning a low-quality dataset. That might be data you buy or a golden dataset you build. “If you have limited budget for data cleaning, it’s worth spending that to create a high-quality data set of inputs and gold standard outputs curated by humans,” says Akshay Swaminathan, Knight-Hennessy scholar in biomedical data at Stanford University School of Medicine. “In the generative AI world, the notion of accuracy is much more nebulous.” A golden dataset of questions paired with a gold standard response can help you quickly benchmark new models as the technology improves.

Opportunity cost

Not only can too much data cleaning be a waste of time and money, it also might even remove data that’s useful even if it appears to be incomplete.

“If you had a million records originally available and you got 500,000 records supplied with the best quality, what you really want to know is of the missing 500,000, how many were of sufficient quality that you didn’t get,” says Kashalikar. “If you had 250,000 that had sufficient but not pristine quality, then either you lost a quarter of your potential data or you spent time cleaning a quarter of the records when you didn’t need to.”

It’s also important not to clean data so much that it loses its distinctiveness, also known as over-normalizing. Excessive standardization or homogenization of the dataset can remove valuable variations and nuances that are important features for an AI model, degrading its ability to generalize. For example, normalizing address spellings without considering regional variations could erase important demographic insights.

Losing outliers is a similar problem to over-normalizing, but for individual data points rather than the entire dataset. Aggressively removing outliers and extreme cases takes out important edge cases. “One person’s trash is another person’s treasure,” as Swaminathan puts it.

Some impossible values in a dataset are easy and safe to fix, like prices aren’t likely to be negative or human ages over 200, but there might be errors from manual data collection or badly designed databases. “Maybe the data was entered during an emergency in a hospital and the person switched the height and weight,” says Tabnine’s Yahav. One product database he dealt with, for instance, didn’t have a field for product serial numbers, so staff put them in the weight field. “Suddenly you have products weighing five tons in a toy store,” he adds.

But some outliers or seemingly “dirty” data points will be genuine signals rather than errors, and may indicate interesting areas to explore. “Somebody spent five hours in traffic because it was raining? That’s an interesting outlier for traffic information,” says Yahav.

If you’re training a model to de-identify medical data, it needs to be robust to outliers like unique names, variant formats for addresses, and identification numbers so they’re detected correctly, which means you need those in the training set. Especially when dealing with legacy systems where code isn’t likely to get updated, your data pipeline needs to validate and clean known issues. But Yahav suggests some of this requires human judgment to differentiate genuine errors as opposed to meaningful signal for generalization.

Adding bias

Overly aggressive cleaning that removes records that fail validation can introduce bias to your dataset because you’re losing records with specific characteristics. Removing records that don’t have middle initials will remove people from certain areas of the Indian subcontinent, warns Kashalikar. Similarly, removing unusual names or insisting that all names are longer than two letters could lead to biased models that perform poorly on diverse populations.

“The data scientist creating a model may not understand the business implications of what it means to not have data,” he points out. It’s important to have someone who understands the context of the problem you’re trying to solve involved in decisions about data cleaning.

Removing context

Clean a dataset too thoroughly and you can strip out contextual information that’s crucial to the full picture. Some phishing messages deliberately include poor spelling and grammar to select for less cautious and less informed victims, and fake links will include URLs that are close to real domain names. Cleaning that data up — or cleaning up the language in messages from frustrated customers — can remove valuable clues about how to react. And LLMs use data in a different way from more traditional ML; the semantics of the data can be critically important.

The clean data set for a medical transcription model clearly shouldn’t include common phrases in YouTube videos asking users to ‘like and subscribe,’ seeing that a general purpose model like OpenAI’s Whisper frequently hallucinates those phrases when dealing with garbled audio, making it unsuitable for medical transcription. But that data would be critical to create a model for transcribing videos.

Standard data cleaning would also remove pauses, sighs, hesitations, and words that speakers don’t bother finishing, but those cues would be useful in trying to predict willingness or intent to buy, Carlsson points out. “It would be useful to have a model that detected customer interest and told the customer representative you should probably stop trying to do the hard sell because this person is clearly not interested,” he says. This is why it’s so important to know what you’re going to use data for before you clean it.

Missing real world mess

Traditional ML is fragile with messy data, so it’s tempting to take it out. But making data too uniform can lead to models that perform well on clean, structured data like their training set, but struggle with real-world messy data, giving you poor performance in production environments.

LLMs can pass the bar exam or the medical board because those tests are too clean to be useful benchmarks, explains Swaminathan. “It’s giving you a patient vignette with all the pertinent information already there for you,” he says. “It tells you that the patient tells you their vital signs, and imaging and lab results. In the real world, it’s up to the doctor to elicit all of those pieces of information separately.” Similarly, if you’re creating a golden dataset for customer support, avoid the temptation to make the customer requests too clean and informative.

There’s an obvious tension here, admits Friedman. “The dirtier the data set you’re training on, the tougher it is for that model to learn and achieve success,” he says. “Yet, at the same time, for it to be fully functional in the real world, it’s going to need to be able to operate in those dirtier environments.”

LLMs in particular need to be able to respond to incorrect inputs. Removing colloquialisms, misspellings, or regional language differences can hinder a model’s ability to handle real-world language use. “Understanding how to respond to dirty data as well as ideally clean data — it’s nice to start with the clean data, but eventually it has to be robust,” adds Friedman.

Missing trends

Cleaning old and new data in the same way can lead to other problems. New sensors are likely to be more precise and more accurate, customer support requests will be about newer versions of your products, or you’ll get more metadata about new prospects from their online footprint. Whatever the data source, there may be new information to capture or the features in the data may change over time. In India, for example, divorce has been only recently officially acknowledged. You can’t add that to old records, but you shouldn’t scrub it out of new ones for consistency. So take care that data cleaning doesn’t disguise the difference between old and new data, leading to models that don’t account for evolving trends.

“Even for the same use case, the underlying data can shift over time,” warns Swaminathan. “A golden benchmark that we make in October of 2024 for answering client questions, for example, might become outdated in three months when a natural disaster hits, and all of a sudden there’s a shortage of toilet paper. Even on the same task at the same company for the same clients, the benchmark can become outdated with time.”

You might lose signals in data as trends change, too. When contact numbers for customers shifted from landline to mobile phones, organizations lost the ability to extract the customer location from the number. “If you were using area codes to validate locality, you lost a lot of records,” Kashalikar adds. Two companies you work with might also merge, so deciding whether to treat them as the same entity or keep them separate in your golden master record of companies depends on the use case.

Even without major changes, the underlying data itself might have drifted. “The relationships between the outcome variables of interest and your features may have changed,” Friedman says. “You can’t simply lock in and say, ‘This dataset is absolutely perfect’ and lift it off the shelf to use for a problem a year from now.”

To avoid all these problems, you need to involve people with the expertise to differentiate between genuine errors and meaningful signals, document the decisions you make about data cleaning and the reasons for them, and regularly review the impact of data cleaning on both model performance and business outcomes.

Rather than doing masses of data cleaning up front and only then starting development, take an iterative approach with incremental data cleaning and quick experiments.

“What we’ve seen to be successful is onboard data incrementally,” says Yahav. “There’s a huge temptation to say let’s connect everything and trust that it works. But then when it hits you, you don’t know what’s broken, and then you have to start disconnecting things.”

So start with small amounts of recent data, or data you trust, see how that works, and build more sources or volume of data from there and see where it breaks. “It’s going to eventually break because something you forgot is going to reach the main pipeline, and something’s going to surprise you,” he says. “You want this process to be gradual enough for you to understand what caused that.”

Read More from This Article: When is data too clean to be useful for enterprise AI?
Source: News