Skip to content
Tiatra, LLCTiatra, LLC
Tiatra, LLC
Information Technology Solutions for Washington, DC Government Agencies
  • Home
  • About Us
  • Services
    • IT Engineering and Support
    • Software Development
    • Information Assurance and Testing
    • Project and Program Management
  • Clients & Partners
  • Careers
  • News
  • Contact
 
  • Home
  • About Us
  • Services
    • IT Engineering and Support
    • Software Development
    • Information Assurance and Testing
    • Project and Program Management
  • Clients & Partners
  • Careers
  • News
  • Contact

Synthetic data’s fine line between reward and disaster

Up to 20% of the data used for training AI is already synthetic — that is, generated rather than obtained by observing the real world — with LLMs using millions of synthesized samples. That could reach up to 80% by 2028 according to Gartner, adding that by 2030, it’ll be used for more business decision making than real data. Technically, though, any output you get from an LLM is synthetic data.

AI training is where synthetic data shines, says Gartner principal researcher Vibha Chitkara. “It effectively addresses many inherent challenges associated with real-world data, such as bias, incompleteness, noise, historical limitations, and privacy and regulatory concerns, including personally identifiable information,” she says.

Generating large volumes of training data on demand is appealing compared to slow, expensive gathering of real-world data, which can be fraught with privacy concerns, or just not available. Synthetic data ought to help preserve privacy, speed up development, and be more cost effective for long-tail scenarios enterprises couldn’t otherwise tackle, she adds. It can even be used for controlled experimentation, assuming you can make it accurate enough.

Purpose-built data is ideal for scenario planning and running intelligent simulations, and synthetic data detailed enough to cover entire scenarios could predict future behavior of assets, processes, and customers, which would be invaluable for business planning. This kind of advanced use requires simulation engines, and the simulation equivalent of digital twins still in development outside some early adoption areas.

Materials science, pharmaceutical research, oil and gas, and manufacturing are obvious markets, but interest is growing in supply chain and insurance industries. Sufficiently accessible and accurate tools could deliver operational improvements and revenue, as well as optimized costs and reduced risks in many areas of business decision making.

Also, marketing and product design teams could create simulated customers based on purchase data and existing customer surveys, and then interview them for feedback on new products and campaigns. One global supply chain company is experimenting with simulating disruptions like natural disasters, pandemics, and geopolitical shifts to improve resilience. That’s a multi-stage process of building simulation engines that generate datasets of the impact these scenarios will have on supply and delivery routes, and then training AI models to analyze those scenarios and suggest how to harden supply chains.

More immediate uses for synthetic data may be more prosaic. Indeed, organizations are probably already using it in limited ways outside AI. Web and application developers rely on synthetic monitoring that simulate user interactions at scale to measure performance and availability for different scenarios, locations, and devices instead of waiting for real users to hit problem areas — or to test new apps and features before launch.

Accurate amplification

Created properly, synthetic data mimics statistical properties and patterns of real-world data without containing actual records from the original dataset, says Jarrod Vawdrey, field chief data scientist at Domino Data Lab. And David Cox, VP of AI Models at IBM Research suggests viewing it as amplifying rather than creating data. “Real data can be extremely expensive to produce, but if you have a little bit of it, you can multiply it,” he says. “In some cases, you can make synthetic data that’s much higher quality than the original. The real data is a sample. It doesn’t cover all the different variations and permutations you might encounter in the real world.”

It’s most useful where there’s no personal data and no threat model. For example, synthesizing multiple examples to improve your LLM-based agents called functions and APIs in your own environment demonstrably makes the models better.

For those scenarios, Cox maintains turnkey tools from vendors like IBM are both safe and powerful. “Synthetic data is your friend here,” he says. “It helps you make the model better at something. It’s not associated with real people or data you worry about leaking. It’s completely innocuous and safe.”

Infusing domain knowledge and ensuring the true distribution of traits and properties and features in the synthetic data actually makes the models better than they would’ve been if they were only trained on real data.

“Most issues you see in production are because of boundary conditions, but the real data doesn’t represent all those conditions,” says Rahul Rastogi, chief innovation officer of real-time data platform SingleStore.

Manufacturers wanting to detect products on an assembly line with damage or blemishes, for instance, are unlikely to have images of all possible combinations they want computer vision models to detect. Fraud detection and cybersecurity can do more extreme testing with synthetic data, he says. “It’s probably best practice to do threat modeling and generate as much synthetic data as you can, because you can’t afford to wait for your model to have leaks, or generate incorrect results or too many false positives,” he says.

The EU AI Act may encourage more use of synthetic data, because if organizations want to use personal data in an AI regulatory sandbox for workloads meeting the public interest criteria — energy sustainability or protecting critical infrastructure, for example — they have to prove synthetic data couldn’t be used instead. Showing that requires experimenting with synthetic data, which may mean it gets more widely adopted where it is, in fact, useful enough.

Even for organizations not affected by the EU AI Act, Gartner recommends synthetic data where possible because of how likely it is that gen-AI models can retain personal data included (directly or indirectly) in a prompt. Patterns of language use, topics of interest, or just the user profile can be enough to risk re-identifying individual. But despite the potential advantages, getting synthetic data right isn’t always straightforward.

“Synthetic data can be a force for good but you can really mess up with it, too,” says Gartner VP Analyst Kjell Carlsson. “We could improve most of our use cases by using synthetic data in some way but it carries risks, and people aren’t familiar with it. You need people who know what they’re doing, and you need to be careful about what you’re doing.”

Replicating too much reality

Healthcare, where privacy protections block data sharing that could improve AI, is an obvious customer for synthetic data, but it’s helpful for any organization where customer data is particularly valuable.

Although he can’t name the company for whom he ran global reporting, analytics, and data services while at Apple, Rastogi says that despite initial scepticism, having first checked dimensionality, data distribution, and Cartesian relationships with the data, his former team successfully used synthetic customer data for bakeoffs, evaluating new technology to avoid giving vendors access to real customer data.

“We were sensitive about using our real data,” he says. “While real data will give you the best results, we were always very hesitant.” That was five years ago, but he believes enterprises face similar challenges today using their data for AI.

“Real data is low-grade radioactive material,” IBM Research’s Cox adds. “You’re not moving it outside the walls of your company, but you don’t want to move it around at all if you can help it.” And data copied for developers is data that can get stolen. There’s enormous opportunity there as many enterprises sit on a gold mine of data they’re very cautious about, and don’t get full value out of. Making a copy of the customer database and putting it somewhere else is a major risk, so it’s much safer to create a synthetic surrogate.”

Synthetic data promises to do that in a privacy-preserving way, Carlsson says, since you create synthetic versions of the dataset, which shouldn’t include any real individuals. But that can misfire. “You might have made a mistake and oversampled an individual too frequently, so you ended up replicating that person and didn’t sanitize it afterward to remove anyone corresponding to real people,” he says. “Or someone can just reverse engineer it, because the relationships between your different fields are strong enough you can figure that out.” Reidentification is even more likely when you combine multiple datasets.

Vawdrey calls that kind of inadvertent replication model leakage. “This risk has evolved alongside generation techniques,” he says. “Modern GAN and LLM-based methods can sometimes memorize and reproduce sensitive training examples, so enterprises should implement rigorous privacy-preserving methods like differential privacy to mathematically guarantee protection against re-identification.”

Say you have a database with customer demographics and buying habits. Differential privacy lets you guarantee privacy by adding noise, but it’s a trade-off that can reduce accuracy. “The more noise you add, the less your data looks like data,” Cox warns.

Synthetic data already requires expertise, and advanced techniques like differential privacy raise the bar even higher so many organizations will rely on AI platforms or work with a sophisticated partner rather than internal capabilities.

The limits of debiasing

All datasets are effectively biased, says Carlsson. It’s just by how much. Adding underrepresented populations back to the dataset can debias the model.

In theory, synthetic data can deliver models that perform better with diverse populations, or in difficult situations. For audio, you might add more examples of edge cases, accents, noisy conditions like retail environments, rare terminology you need to get right, or conversations that shift from one language to another.

“You can create synthetic, additional versions with variations of underrepresented groups in your data,” Carlsson says. “In my clinical trial, I don’t have enough people who are a certain ethnicity, age, or gender.” Increasing representation with sufficient variety rebalances the dataset. “I can create synthetic versions of these individuals with additional variations around them, and make this model actually perform better for that group. I can also completely mess it up and oversample too small of a group of people, and just end up duplicating the same individual over and over again, which is both bad from a privacy point of view and also doesn’t help you because that person might not have been terribly representative of this group. You can easily go astray and worsen the problems with your data, and make it even more biased than it was before.”

A recent study from IT database ACM Digital Library shows even tools promising unbiased datasets without offering guidance or controls based on demographic data can produce a dramatically unbalanced racial dataset that appears diverse but completely omits some groups making up a significant proportion of the real population. If generated data is based on a very small number of base data samples without knowing how some specific features of those samples are distributed in the real population, you can have statistical diversity that isn’t representative.

“You’ve lulled yourself into a false sense of security that the model would work,” Carlsson says.

So the obvious danger is synthetic data might be poor quality, or just wrong, so using the right techniques to make data for each use case is as vital as checking it thoroughly.

“With tabular data, statistical correlations may be oversimplified while synthetic images might lack subtle variations present in real-world visual data,” says Vawdrey. “Text generation faces challenges with factual accuracy and coherence.” Problems also happen when synthetic data fails to capture the true complexity and nuances of real-world data, leading to models that perform well on synthetic tests but fail in production environments.”

Build on your expertise

Like LLMs, synthetic data need stringent grounding in real-world contexts, such as through RAG to avoid hallucinating or spouting nonsense, says Nikhil Pareek, CEO of AI lifecycle platform Future AGI. Plausible looking synthetic data can cause problems if the distribution is inaccurate, with class imbalance or correlation mismatches.

Iterative validation and semantic clustering to anchor generated data in actual observed patterns can help here, and that requires domain expertise, so you can spot the data that’s wrong, especially if you venture into simulation.

The good news is this gives organizations an opportunity for differentiation, Cox says. “The domain expertise you have about your business, your customers, and how your business works, that’s the most essential piece,” he says.

The trick is involving the right experts inside the business and acquiring the right technical expertise. But there are few experienced synthetic data engineers for enterprises to hire. “Generating high-quality, fit-for-purpose data requires specialized knowledge and expertise, posing a barrier for many organizations today,” warns Chitkara. And until organizations can trust synthetic data and the governance around it, adoption will be slow.

“For the business stakeholder looking at applying AI, the most important skill to develop today is benchmarking and evaluation,” Cox continues. “You need to have that baseline of what does good mean and how am I going to test the system to understand if it’s doing better than it was before I added synthetic data.” Monitoring and evaluation needs to be continuous and tied to business goals.

Running out of space

Because synthetic data is often easier to produce than real data, and because the whole point is creating many examples to cover multiple scenarios, enterprises are likely to end up with much larger datasets. They may also underestimate infrastructure required to make synthetic data.

“Early approaches like rule-based generation or SMOTE required minimal computational resources, while modern deep learning approaches like GANs demand substantial GPU capacity,” Vawdrey says. “The latest LLM-based synthetic data generation can require enterprise-grade infrastructure, especially for large-scale image or video synthesis.”

Once generated, enterprises also need to retain synthetic datasets and model artefacts for auditing; clear documentation trails must show how synthetic data was created, validated, and used.

Synthetic data can be structured and compact, without the noise, redundancies, and unstructured elements of messy, real-world data. But scenario exploration and intelligent simulations demand significant compute resources and storage capacity due to the large volumes of data generated, Chitkara says. Synthetic media datasets can run into petabytes.

“It’s an embarrassment of riches situation,” Cox adds. “You can easily create more things than you know what to do with. Just because it’s synthetic data doesn’t mean you don’t have to keep it around, audit it, and understand how you created it and how you used it. You still have to deal with it.”


Read More from This Article: Synthetic data’s fine line between reward and disaster
Source: News

Category: NewsMay 21, 2025
Tags: art

Post navigation

PreviousPrevious post:The AI-native generation is here. Don’t get left behindNextNext post:IBM’s massive SAP S/4HANA migration pays off

Related posts

AI and load balancing
May 21, 2025
Basis Technologies launches Klario to help automate SAP change management
May 21, 2025
The AI-native generation is here. Don’t get left behind
May 21, 2025
IBM’s massive SAP S/4HANA migration pays off
May 21, 2025
구글, 워비 파커와 손잡고 ‘AI 안경’ 개발 본격화···1억 5천만 달러 투자
May 21, 2025
칼럼 | ‘초연결’이 필수인 시대··· CIO·CISO를 위한 IoT 보안 체크리스트
May 21, 2025
Recent Posts
  • AI and load balancing
  • Basis Technologies launches Klario to help automate SAP change management
  • The AI-native generation is here. Don’t get left behind
  • Synthetic data’s fine line between reward and disaster
  • IBM’s massive SAP S/4HANA migration pays off
Recent Comments
    Archives
    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • October 2024
    • September 2024
    • August 2024
    • July 2024
    • June 2024
    • May 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • November 2023
    • October 2023
    • September 2023
    • August 2023
    • July 2023
    • June 2023
    • May 2023
    • April 2023
    • March 2023
    • February 2023
    • January 2023
    • December 2022
    • November 2022
    • October 2022
    • September 2022
    • August 2022
    • July 2022
    • June 2022
    • May 2022
    • April 2022
    • March 2022
    • February 2022
    • January 2022
    • December 2021
    • November 2021
    • October 2021
    • September 2021
    • August 2021
    • July 2021
    • June 2021
    • May 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    Categories
    • News
    Meta
    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    Tiatra LLC.

    Tiatra, LLC, based in the Washington, DC metropolitan area, proudly serves federal government agencies, organizations that work with the government and other commercial businesses and organizations. Tiatra specializes in a broad range of information technology (IT) development and management services incorporating solid engineering, attention to client needs, and meeting or exceeding any security parameters required. Our small yet innovative company is structured with a full complement of the necessary technical experts, working with hands-on management, to provide a high level of service and competitive pricing for your systems and engineering requirements.

    Find us on:

    FacebookTwitterLinkedin

    Submitclear

    Tiatra, LLC
    Copyright 2016. All rights reserved.