In the race to build the smartest LLM, the rallying cry has been “more data!” That same mantra has made its way to company boardrooms, too. As businesses hurry to harness AI to gain a competitive edge, finding and using as much company data as possible may feel like the most reasonable approach.
After all, if more data leads to better LLMs, shouldn’t the same be true for AI business solutions?
The short answer is no. A mad rush to throw data at AI is shortsighted. Instead, your business needs to understand the challenges of existing data and the steps needed to ensure you have and use good data to power your AI solutions. The data reckoning has arrived, and you must reckon not only with how much data you use, but also with the quality of that data.
The urgency of now
The rise of artificial intelligence has forced businesses to think much more about how they store, maintain, and use large quantities of data. One of the realities businesses quickly face when implementing AI solutions is that once data is used in an LLM or SLM, there is no going back.
Traditionally, companies struggling with large amounts of data used data lakes to store and process it. While the data was stored, there was often no significant management of sources, recent updates, and other key governance measures to ensure data integrity.
That approach to data storage is a problem for enterprises today because if they use outdated or inaccurate data to train an LLM, those errors get baked into the model. The consequence is not hallucinating—the model is working properly—instead, the data training the model is wrong.
Equally concerning, since the data is within the LLM’s black box, will anyone even know that the answer is wrong? If users have nothing else to compare the answer to, they often just take the answer for granted. This example drives home that we may need more data to power AI, but not if the data is wrong.
Today’s challenges
There are several major challenges with business data today:
1. Provenance
Housing mass amounts of data in data lakes has caused much uncertainty about enterprise data. Who created this data? Where did it come from? When was it last updated? Is it a trusted source? Knowing the lineage of a dataset is a crucial first step in trusting and using the data with confidence.
2. Data classification
As data gets housed in data lakes and other increasingly connected ways, another challenge is classification. Who is allowed to look at particular data? From government security classifications to confidential HR information, data shouldn’t be accessible to everyone. Data must be properly classified, and those categories and the limits they entail must be maintained and live on as companies integrate and harness data in new ways.
3. Stability
A lot of data is transient. If you’re taking data from sensors, for example, you need to understand how often you’ll refresh the data based on sensor readings. This is an issue of data stability, as constantly changing data may lead to different results.
Data is also aging. For example, imagine you had a specific process for raising a job requisition for a new employee for nine years, but you revised the process last year. If you use all 10 years’ worth of data to train a model and then ask how to open a job requisition, most of the time, you will get a wrong answer because most of the data is outdated.
This is a clear example of how more data is not always better. Ten years’ worth of data spanning major process changes is less valuable than a smaller chunk of data that accurately captures existing processes.
4. Replicating bias
As you start using data to train AI, you run the risk of training your models on how things are now rather than the desired outcome. For example, imagine your HR department is using AI to screen job applicants. If you use your company’s existing data to train the model on what an ideal candidate would look like, your model may end up replicating existing biases in your workforce related to age or gender, for example.
You want to train the model not based on the reality in the dataset, but on the outcome you want to achieve, which starts with a clear understanding of the data and its limitations.
Dangers of problematic data
Using problematic data to train your LLMs can have serious dangers. At a basic level, it can increase hallucinations and undermine your confidence in the results. You may get inaccuracies or systems that don’t function the way you want them to. When that happens, employee trust and willingness to use systems may decline.
Using bad data could even cause reputational damage. If you use data to train a customer-facing tool that performs poorly, you may hurt customer confidence in your company’s capabilities.
Using compromised data to produce reports on the company or other public information may even become a government and compliance issue. And if data gets misclassified, you risk exposing personal information. All these scenarios can be costly, both financially and reputationally.
Act today
Your business can take the following data management steps today to capitalize on the AI revolution:
1. Strengthen your data governance process
Every enterprise needs a robust data governance process. You must define the rules around handling, storing, and updating your data by answering questions such as:
- Who is responsible for the classification of data?
- Who is responsible for looking at the access rights of your data?
- Who is going to control the stewardship of that data?
- Will you appoint a chief data officer, an analytics team, or someone else?
- How long will you keep data, and who makes those decisions?
Your business will benefit by answering these questions before you start using company data for AI solutions.
2. Ensure your compliance processes
Your enterprise should partner your robust governance processes with equally strong compliance processes. When data is being targeted for consumption, do you have a compliance process to confirm that the person submitting it has gone through the appropriate governance checks?
As you start adopting AI tools, properly storing data isn’t enough. You must ensure your policies and procedures around data integrity extend to everywhere data is being accessed and used.
Taken together, governance and compliance processes are central to maintaining data integrity, and they will only grow in importance given the staggering amounts of data companies are amassing.
For example, as Brian Eastwood notes: “[t]he average hospital produces roughly 50 petabytes of data every year. That’s more than twice the amount of data housed in the Library of Congress, and it amounts to 137 terabytes per day.” When data is critical to your company, especially when it is also growing rapidly, you need clear planning and role responsibilities to protect, manage, and harness it.
3. Know your data
The question of how much data to use shouldn’t be based on how much data you have, but instead on understanding your data and your goals. In the early days of AI, the conventional wisdom was that more data meant a better LLM. Then there was a trend toward small language models that were highly tuned using more accurate data. Deciding which approach to take will depend on the situation at hand. But you cannot make an informed decision if you don’t first have a strong understanding of your data and its limitations.
Agentic AI’s data reckoning
The next great frontier is how to use data with agentic AI. Will it be more effective to have AI agents using LLMs or one master agent coordinating multiple AI agents, each with its own SLM?
It’s exciting to think about the possibilities that agentic AI will deliver for businesses. Regardless of which approach wins out, agentic AI will rest on the back of strong data governance and compliance processes. Strong data integrity will enable AI to truly deliver.
In the rush to train AI models, we cannot just yell, “More data!” Instead, let’s demand quality data, knowing that setting high standards now will deliver optimized results in the future.
Read More from This Article: Thanks to AI, the data reckoning has arrived
Source: News