Training data: The key to successful AI models

Remember the phrase “big data?” It was the mainstay of tech articles, talk shows and webinars for at least a decade before AI took over and completely supplanted it in the minds of tech enthusiasts.

But that doesn’t mask the fact that AI models rely on large amounts of data. The patterns and interdependencies that machine learning (ML) algorithms identify and apply form the basis of their use case. Depending on the stage of development of the AI model, the data used falls into one of three categories: training data, test data and validation data.

From personal experience, I’d say that training data is the most crucial, in both quantity and quality. An AI model is only as good as the data it’s trained on. Without a large volume of relevant and accurate training data, the model will either not learn what it’s supposed to, or it will learn the wrong things.

Conversely, the higher the volume and diversity of your data and the more reliable your data sources, the better and more accurately your AI model functions. Whether you’re developing large language models (LLMs), computer vision systems or specialized industry applications, the breadth and depth of training data directly impact a model’s capabilities, reliability, performance and consistency.

Do we have enough data?

According to recent analysis published by MIT, AI models’ data requirements may be outpacing the supply of suitable, usable data available today. The median training dataset contained about 3,300 datapoints in 2020. This figure grew dramatically to over 750,000 datapoints in just the three years that followed. Even though the total data generated is expected to hit 180 ZB by the end of this year, it might not be enough to feed the AI monster.

This may well slow down our ability to train LLMs and other large models. Further, these models might lack accuracy and scope due to insufficient breadth and depth in the data, slowing down innovation in sectors that depend significantly on AI adoption.

While synthetic data solutions do have some appeal, relying on them too much can lead to model collapse.

Where do we find more data?

At present, AI companies source their data in multiple ways.

One is internal data: We frequently help our clients use their own data to train small AI models to enable a greater range of functions within their platform, especially those related to marketing and customer service.

Social networks are good examples of this—they use their own algorithms to push more content that amplifies the user’s echo chamber. Ecommerce retailers such as Amazon do something similar with their product recommendation algorithm. Netflix and Spotify use your watch/listen history to keep you glued. The list goes on.

Another source is companies that collect or harness data via their platform. These could be specialized data vendors such as Datos (acquired by Semrush in 2023) that gather clickstream data, package it and sell it to ad or analytics companies such as ours. You can build customized AI-based predictive analytics models once you have access to a steady stream of clickstream data.

Then, there are platforms that spawn tons of user-generated data — such as Reddit, which struck a licensing deal with Google that allowed forum comments to be fed into Google’s AI models.

Finally, you have open datasets provided by government organizations, academic institutions, and market research companies that share or sell their data to any interested entity.

However, all these sources are still limited in their size and scope. The best (albeit a little controversial) source is public web data—a vast, diverse, and constantly updating repository of human knowledge and interaction. No doubt, the very best (or largest) in the business — OpenAI and Google — crawl and index publicly available content from websites, forums, social media and other online sources to then use them in training their LLMs and other AI models.

Public web data in turn comes in two flavors. One is a public repository such as Common Crawl, which is a free, open-source storehouse of historical and current web crawl data available to pretty much anyone on the internet.

The other is a web data collection service, which can help you get timely data right off the best online data sources, in a variety of ways, for a variety of purposes.

For instance, Bright Data allows developers to use residential proxy IPs and on-demand APIs at scale to extract data from publicly available, ethically sourced web pages in real time. Then there’s Apify, which has a marketplace of scrapers and AI agents. Apify also allows you to build your own custom scrapers with serverless tools called “actors.” Another web data extractor of note is Zyte — it has legal compliance built in and a pay-as-you-go pricing model to minimize the costs of data.

The advantages of using public web data

The biggest plus? Cost efficiency.

The cost implications of data sourcing cannot be overstated. Public web data, when properly collected and filtered, eliminates the need for many expensive proprietary datasets.

The recent disruption caused by DeepSeek has demonstrated the remarkable gains that intelligent web data utilization brings to the table. The success of DeepSeek has shown that high-quality and timely data trumps complex algorithms and brute computing power when it comes to training AI models.

There are more positives to using public web data beyond cost and complexity:

Diverse (but relevant) web data results in faster convergence during training, which in turn reduces computational requirements.
More edge cases and unusual scenarios are covered without proportional cost increases.
The dynamic nature of web data leads to continuous improvement and ongoing model refinement without repeated investments.

Let’s look at a few industry-specific applications.

Financial services

Financial institutions developing AI solutions face unique challenges requiring specialized data. Public web sources provide access to:

Real-time market commentary and analysis
Regulatory updates and compliance documentation
Consumer sentiment on financial products and services
Economic indicators and forecasts
Corporate disclosures and earnings reports

Integrating these diverse data sources can enhance predictive analytics capabilities while significantly reducing the need for expensive proprietary financial data services.

Adtech

For ad tech companies, understanding consumer behavior and preferences is paramount. Web-sourced training data offers:

Consumer reviews and product discussions
Social media engagement patterns
Content consumption trends
Cultural references and evolving language usage
Visual design preferences and engagement metrics

By leveraging public web data, ad tech AI models can develop a nuanced understanding of audience segments at a fraction of the cost of traditional market research.

Travel and hospitality

AI applications in travel benefit from the rich multimedia content available across the web:

Destination imagery and descriptions
Reviews with personalized recommendations and plans
Seasonal trends and preference patterns
Real-time info on hotel availability, flight prices and weather
Cultural contexts and local information
Transportation logistics and optimization data

Travel companies implementing models trained on diverse web sources report enhanced personalization capabilities and more contextually aware customer service automation.

Challenges in using public web data

Sourcing data from the public web mandates careful consideration of legal, ethical, and technical factors. You must ensure compliance with terms of service, copyright laws and data privacy regulations such as GDPR and CCPA. Google and OpenAI have both been sued by data owners, media houses and content creators for using their copyrighted material to train their AI models, and failing to notify or compensate them.

China had proactively revised its official AI policy in 2023 to unify data standards and expedite data sharing across different industries. Less than a year later, the outcome was visible in the form of DeepSeek.

However, many social platforms and content media properties are actively campaigning to block bots from collecting information from their pages for use in AI models, as they fear their research would be used without appropriate compensation or awareness for their brand. Having said that, Big Tech wants free and full access to public data while blocking new entrants and startups from doing the same. Google and OpenAI have openly called for weakening US copyright rules, purportedly to “support AI innovation.”

On the other hand, Bright Data has led the charge in accessing public data from major media and social platforms while stopping Big Tech from obstructing access. A U.S. Federal court summarily dismissed Meta’s claim against the company for accessing public data from Facebook and Instagram.

“Public information should remain public,” said Or Lenchner, CEO of Bright Data and a thought leader in the realm of ethical data sourcing. “It falls to us to uphold the highest ethical standards and compliance measures, ensuring all practices that lead to the collection of public data are transparent and beneficial. We will continue to raise the bar as we develop new technologies to make accessing data feasible for the world.”

Other than intellectual property violations and copyright infringement, using public web data in training AI models involves the following common stumbling blocks:

The quality of web content varies dramatically, requiring sophisticated filtering mechanisms.
Web data reflects societal biases that must be identified and addressed during training.
It’s hard to separate factual and technically correct content from speculation, noise, misinformation, disinformation, and simply inaccurate information.
Information becomes outdated at varying rates across domains.
Comprehensive models require balanced representation across languages and cultures.

Industry leaders addressing these challenges have found that the investment in robust data processing pipelines pays dividends in model performance and reliability.

Best practices when using public web data to train AI models

The public web remains a rich and ever-expanding resource for AI training data. However, the strategic use of publicly available data is not just a cost-cutting measure; it’s a potential competitive advantage. Here are a few pointers for AI-driven organizations on how to acquire and use public data with ethical and legal responsibility:

Respect copyright, terms of service, and data privacy regulations when collecting and using web content. Use web scraping and API integrations responsibly, adhering to IETF and W3C guidelines.
Maintain detailed and transparent records of data sources, collection methods, and preprocessing techniques. Ensure adherence to data protection laws by avoiding personal data collection without consent.
Develop specialized preprocessing for industry-specific terminology and contexts to enhance relevance. Build domain-specific web crawlers for particular industries and applications.
Apply rigorous data validation, deduplication, and cleansing processes to remove inaccuracies and inconsistencies.
Implement robust mechanisms to identify high-quality content and filter out low-value or misleading information. Use collaborative filtering systems that leverage user feedback to continuously improve data quality.
Regularly audit and refine datasets to minimize biases that may distort AI model predictions. Source data from varied channels and in varied formats to ensure comprehensive coverage, improve diversity and reduce bias.
Create rigorous testing protocols to verify the impact of web-sourced data on model performance. Establish pipelines for ongoing data collection to capture evolving language, trends, and information.
Try hybrid approaches combining web data with proprietary datasets for competitive advantage.
Implement federated learning systems that learn from distributed web data without centralized collection.

Organizations that master these evolving approaches will gain significant advantages in both model performance and development economics.

For LLM and AI developers, the message is clear: strategic sourcing and utilization of public web data represent one of the most powerful levers for enhancing model capabilities while controlling development costs. Models like DeepSeek have demonstrated that with a smart approach to web data, it is possible to develop sophisticated AI solutions that deliver exceptional performance without exponential increases in expense.

As the competitive landscape intensifies, only those AI models that can maximize the quantity and quality of their training data will remain relevant.

Read More from This Article: Training data: The key to successful AI models
Source: News