Synthetic data’s fine line between reward and disaster

Up to 20% of the data used for training AI is already synthetic — that is, generated rather than obtained by observing the real world — with LLMs using millions of synthesized samples. That could reach up to 80% by 2028 according to Gartner, adding that by 2030, it’ll be used for more business decision making than real data. Technically, though, any output you get from an LLM is synthetic data.

AI training is where synthetic data shines, says Gartner principal researcher Vibha Chitkara. “It effectively addresses many inherent challenges associated with real-world data, such as bias, incompleteness, noise, historical limitations, and privacy and regulatory concerns, including personally identifiable information,” she says.

Generating large volumes of training data on demand is appealing compared to slow, expensive gathering of real-world data, which can be fraught with privacy concerns, or just not available. Synthetic data ought to help preserve privacy, speed up development, and be more cost effective for long-tail scenarios enterprises couldn’t otherwise tackle, she adds. It can even be used for controlled experimentation, assuming you can make it accurate enough.

Purpose-built data is ideal for scenario planning and running intelligent simulations, and synthetic data detailed enough to cover entire scenarios could predict future behavior of assets, processes, and customers, which would be invaluable for business planning. This kind of advanced use requires simulation engines, and the simulation equivalent of digital twins still in development outside some early adoption areas.

Materials science, pharmaceutical research, oil and gas, and manufacturing are obvious markets, but interest is growing in supply chain and insurance industries. Sufficiently accessible and accurate tools could deliver operational improvements and revenue, as well as optimized costs and reduced risks in many areas of business decision making.

Also, marketing and product design teams could create simulated customers based on purchase data and existing customer surveys, and then interview them for feedback on new products and campaigns. One global supply chain company is experimenting with simulating disruptions like natural disasters, pandemics, and geopolitical shifts to improve resilience. That’s a multi-stage process of building simulation engines that generate datasets of the impact these scenarios will have on supply and delivery routes, and then training AI models to analyze those scenarios and suggest how to harden supply chains.

More immediate uses for synthetic data may be more prosaic. Indeed, organizations are probably already using it in limited ways outside AI. Web and application developers rely on synthetic monitoring that simulate user interactions at scale to measure performance and availability for different scenarios, locations, and devices instead of waiting for real users to hit problem areas — or to test new apps and features before launch.

Accurate amplification

Created properly, synthetic data mimics statistical properties and patterns of real-world data without containing actual records from the original dataset, says Jarrod Vawdrey, field chief data scientist at Domino Data Lab. And David Cox, VP of AI Models at IBM Research suggests viewing it as amplifying rather than creating data. “Real data can be extremely expensive to produce, but if you have a little bit of it, you can multiply it,” he says. “In some cases, you can make synthetic data that’s much higher quality than the original. The real data is a sample. It doesn’t cover all the different variations and permutations you might encounter in the real world.”

It’s most useful where there’s no personal data and no threat model. For example, synthesizing multiple examples to improve your LLM-based agents called functions and APIs in your own environment demonstrably makes the models better.

For those scenarios, Cox maintains turnkey tools from vendors like IBM are both safe and powerful. “Synthetic data is your friend here,” he says. “It helps you make the model better at something. It’s not associated with real people or data you worry about leaking. It’s completely innocuous and safe.”

Infusing domain knowledge and ensuring the true distribution of traits and properties and features in the synthetic data actually makes the models better than they would’ve been if they were only trained on real data.

“Most issues you see in production are because of boundary conditions, but the real data doesn’t represent all those conditions,” says Rahul Rastogi, chief innovation officer of real-time data platform SingleStore.

Manufacturers wanting to detect products on an assembly line with damage or blemishes, for instance, are unlikely to have images of all possible combinations they want computer vision models to detect. Fraud detection and cybersecurity can do more extreme testing with synthetic data, he says. “It’s probably best practice to do threat modeling and generate as much synthetic data as you can, because you can’t afford to wait for your model to have leaks, or generate incorrect results or too many false positives,” he says.

The EU AI Act may encourage more use of synthetic data, because if organizations want to use personal data in an AI regulatory sandbox for workloads meeting the public interest criteria — energy sustainability or protecting critical infrastructure, for example — they have to prove synthetic data couldn’t be used instead. Showing that requires experimenting with synthetic data, which may mean it gets more widely adopted where it is, in fact, useful enough.

Even for organizations not affected by the EU AI Act, Gartner recommends synthetic data where possible because of how likely it is that gen-AI models can retain personal data included (directly or indirectly) in a prompt. Patterns of language use, topics of interest, or just the user profile can be enough to risk re-identifying individual. But despite the potential advantages, getting synthetic data right isn’t always straightforward.

“Synthetic data can be a force for good but you can really mess up with it, too,” says Gartner VP Analyst Kjell Carlsson. “We could improve most of our use cases by using synthetic data in some way but it carries risks, and people aren’t familiar with it. You need people who know what they’re doing, and you need to be careful about what you’re doing.”

Replicating too much reality

Healthcare, where privacy protections block data sharing that could improve AI, is an obvious customer for synthetic data, but it’s helpful for any organization where customer data is particularly valuable.

Although he can’t name the company for whom he ran global reporting, analytics, and data services while at Apple, Rastogi says that despite initial scepticism, having first checked dimensionality, data distribution, and Cartesian relationships with the data, his former team successfully used synthetic customer data for bakeoffs, evaluating new technology to avoid giving vendors access to real customer data.

“We were sensitive about using our real data,” he says. “While real data will give you the best results, we were always very hesitant.” That was five years ago, but he believes enterprises face similar challenges today using their data for AI.

“Real data is low-grade radioactive material,” IBM Research’s Cox adds. “You’re not moving it outside the walls of your company, but you don’t want to move it around at all if you can help it.” And data copied for developers is data that can get stolen. There’s enormous opportunity there as many enterprises sit on a gold mine of data they’re very cautious about, and don’t get full value out of. Making a copy of the customer database and putting it somewhere else is a major risk, so it’s much safer to create a synthetic surrogate.”

Synthetic data promises to do that in a privacy-preserving way, Carlsson says, since you create synthetic versions of the dataset, which shouldn’t include any real individuals. But that can misfire. “You might have made a mistake and oversampled an individual too frequently, so you ended up replicating that person and didn’t sanitize it afterward to remove anyone corresponding to real people,” he says. “Or someone can just reverse engineer it, because the relationships between your different fields are strong enough you can figure that out.” Reidentification is even more likely when you combine multiple datasets.

Vawdrey calls that kind of inadvertent replication model leakage. “This risk has evolved alongside generation techniques,” he says. “Modern GAN and LLM-based methods can sometimes memorize and reproduce sensitive training examples, so enterprises should implement rigorous privacy-preserving methods like differential privacy to mathematically guarantee protection against re-identification.”

Say you have a database with customer demographics and buying habits. Differential privacy lets you guarantee privacy by adding noise, but it’s a trade-off that can reduce accuracy. “The more noise you add, the less your data looks like data,” Cox warns.

Synthetic data already requires expertise, and advanced techniques like differential privacy raise the bar even higher so many organizations will rely on AI platforms or work with a sophisticated partner rather than internal capabilities.

The limits of debiasing

All datasets are effectively biased, says Carlsson. It’s just by how much. Adding underrepresented populations back to the dataset can debias the model.

In theory, synthetic data can deliver models that perform better with diverse populations, or in difficult situations. For audio, you might add more examples of edge cases, accents, noisy conditions like retail environments, rare terminology you need to get right, or conversations that shift from one language to another.

“You can create synthetic, additional versions with variations of underrepresented groups in your data,” Carlsson says. “In my clinical trial, I don’t have enough people who are a certain ethnicity, age, or gender.” Increasing representation with sufficient variety rebalances the dataset. “I can create synthetic versions of these individuals with additional variations around them, and make this model actually perform better for that group. I can also completely mess it up and oversample too small of a group of people, and just end up duplicating the same individual over and over again, which is both bad from a privacy point of view and also doesn’t help you because that person might not have been terribly representative of this group. You can easily go astray and worsen the problems with your data, and make it even more biased than it was before.”

A recent study from IT database ACM Digital Library shows even tools promising unbiased datasets without offering guidance or controls based on demographic data can produce a dramatically unbalanced racial dataset that appears diverse but completely omits some groups making up a significant proportion of the real population. If generated data is based on a very small number of base data samples without knowing how some specific features of those samples are distributed in the real population, you can have statistical diversity that isn’t representative.

“You’ve lulled yourself into a false sense of security that the model would work,” Carlsson says.

So the obvious danger is synthetic data might be poor quality, or just wrong, so using the right techniques to make data for each use case is as vital as checking it thoroughly.

“With tabular data, statistical correlations may be oversimplified while synthetic images might lack subtle variations present in real-world visual data,” says Vawdrey. “Text generation faces challenges with factual accuracy and coherence.” Problems also happen when synthetic data fails to capture the true complexity and nuances of real-world data, leading to models that perform well on synthetic tests but fail in production environments.”

Build on your expertise

Like LLMs, synthetic data need stringent grounding in real-world contexts, such as through RAG to avoid hallucinating or spouting nonsense, says Nikhil Pareek, CEO of AI lifecycle platform Future AGI. Plausible looking synthetic data can cause problems if the distribution is inaccurate, with class imbalance or correlation mismatches.

Iterative validation and semantic clustering to anchor generated data in actual observed patterns can help here, and that requires domain expertise, so you can spot the data that’s wrong, especially if you venture into simulation.

The good news is this gives organizations an opportunity for differentiation, Cox says. “The domain expertise you have about your business, your customers, and how your business works, that’s the most essential piece,” he says.

The trick is involving the right experts inside the business and acquiring the right technical expertise. But there are few experienced synthetic data engineers for enterprises to hire. “Generating high-quality, fit-for-purpose data requires specialized knowledge and expertise, posing a barrier for many organizations today,” warns Chitkara. And until organizations can trust synthetic data and the governance around it, adoption will be slow.

“For the business stakeholder looking at applying AI, the most important skill to develop today is benchmarking and evaluation,” Cox continues. “You need to have that baseline of what does good mean and how am I going to test the system to understand if it’s doing better than it was before I added synthetic data.” Monitoring and evaluation needs to be continuous and tied to business goals.

Running out of space

Because synthetic data is often easier to produce than real data, and because the whole point is creating many examples to cover multiple scenarios, enterprises are likely to end up with much larger datasets. They may also underestimate infrastructure required to make synthetic data.

“Early approaches like rule-based generation or SMOTE required minimal computational resources, while modern deep learning approaches like GANs demand substantial GPU capacity,” Vawdrey says. “The latest LLM-based synthetic data generation can require enterprise-grade infrastructure, especially for large-scale image or video synthesis.”

Once generated, enterprises also need to retain synthetic datasets and model artefacts for auditing; clear documentation trails must show how synthetic data was created, validated, and used.

Synthetic data can be structured and compact, without the noise, redundancies, and unstructured elements of messy, real-world data. But scenario exploration and intelligent simulations demand significant compute resources and storage capacity due to the large volumes of data generated, Chitkara says. Synthetic media datasets can run into petabytes.

“It’s an embarrassment of riches situation,” Cox adds. “You can easily create more things than you know what to do with. Just because it’s synthetic data doesn’t mean you don’t have to keep it around, audit it, and understand how you created it and how you used it. You still have to deal with it.”

Read More from This Article: Synthetic data’s fine line between reward and disaster
Source: News