AI is being rapidly adopted by virtually all enterprises this year, but most are deploying the same platforms from the same vendors as everyone else.
Creating a customized AI solution based on a company’s unique needs requires data. Unfortunately, the data that companies have at hand might have significant gaps, and could be messy with privacy or compliance issues when it comes to using it. Also, there might just not be enough of it.
Synthetic data can bridge that gap, helping enterprises find real business value from their AI deployments.
In mid-April, digital transformation consultancy EPAM released a survey of over 7,300 executives and IT professionals at large companies. All respondents were either experimenting with or deploying AI with 14% just starting out and 32% developing competency, but not yet seeing consistent results. Nearly half are there, though, and say they’re achieving results and use them to become more competitive. But only 5% consider themselves to be disruptors and leading the pack in their use of AI.
Deloitte also found that 30% of senior executives say a shortage of high-quality data is one of the top barriers to gen AI adoption. And this is where synthetic data comes in.
“Having real data is key to any business,” says Chida Sadayappan, lead specialist for data cloud and ML at Deloitte Consulting. “But complementing that with synthetic data is a great business differentiator. AI models generated using this synthetic data will give an edge to companies.”
According to Gartner, 75% of businesses will use gen AI to create synthetic customer data by next year, up from less than 5% in 2023.
In fact, according to Forrester, the majority of global businesses are already working on initiatives involving synthetic data. More specifically, reports show that 14% have deployed multiple use cases at enterprise scale, 22% at departmental scale, and 22% are working on initial production implementation. Plus, 15% are in pilot stages and the same number are in late-state research and PoC.
So how does synthetic data help companies create business value? Here are the top ways.
1. Building AI that truly understands your business
The AI models from the big AI companies are, by necessity, generic.
“When major AI vendors train models on the same publicly available datasets, the result is often homogenized AI outputs, says Andy Frawley, CEO at Data Axle, a data services company. That limits differentiation.
“On top of that, these datasets can perpetuate inaccuracies that have been embedded over time, reducing the reliability of AI-driven decisions,” he adds. Reliability can also suffer because the information available to the big commercial models might not cover the nuances of specific customer segments.
Companies can address this gap by fine-tuning or augmenting existing AI models, or building small custom models by using their own data or purchased data. And when this isn’t enough, they can do it by creating new, synthetic data.
Nextuple, an inventory management company, uses synthetic data to create custom AI and ML models that can understand inventory management challenges. For example, say a large batch of inventory comes into a central warehouse. “We need to decide where to send it,” says Darpan Seth, the company’s cofounder and CEO. “It’s a high value decision you’re making at that point.” There are a lot of logistics and optimization factors that go into making such a decision, factors unique to every company, and the use of synthetic data has been critical to build and test this for years.
“So that’s not new,” says Seth. “But the way you can use synthetic data now — the possibilities are greater than they’ve ever been.”
And Nextuple isn’t just using synthetic data to help train ML and AI models, he says. Gen AI is now used to create synthetic data, making the process faster, easier, more flexible, and more intelligent than ever before.
“We fed it a lot of requirements that we see across the board, across all our customers,” says Seth. “It’s got all that data, and now you can ask it to generate user stories, test cases, test data — and test automations, as well.”
In the past, generating this synthetic data would’ve been a heavily manual process. For example, an order could be created with three items in it, then another with 10 items, and so on, with different minor variations. “All of that is blazing fast today because you can do it with generative AI,” Seth says.
Gen AI has also democratized the entire process.
“Late last year, we enabled everyone on the team with AI tools,” he says. “This is something anyone can use.”
And since their business customers use a variety of platforms, Nextuple builds its systems to be model-agnostic.
“We use everything from OpenAI and Claude, to Llama and Gemini,” says Seth. “AWS has Bedrock, and there’s Azure, and all of these providers have a range of models available. There’s over 75 to 80 companies with a range of different models.”
So Nextuple built its technology to make the back-end AI interchangeable. “Plus, tomorrow you might find a different model that does it better, or at lower cost,” he says.
Since Nextuple has to work with all the major cloud providers and AI platforms, it doesn’t use the AI technology stack from any particular vendor but instead built its own, using open source components including LangChain, LangGraph, LangFlow and, for RAG embedding, vector databases such as PostgreSQL’s Pgvector.
“And there are some new paradigms emerging, such as model context protocols,” he says. “Things are changing so fast.”
2. Filling in the gaps
Actual data is rarely complete. Sometimes, the gaps are due to changing behaviors. For example, historical shopping data might show a spike on Black Friday. But today, everyone might be shopping online and a one-day spike might be extended to an entire week. And sometimes gaps occur because some situations happen very rarely, so there aren’t enough examples of them. For some enterprises, those gaps can be consequential.
“I do a lot of traffic management,” says Karen Panetta, IEEE fellow and dean of graduate engineering at Tufts University. There’s plenty of data available from various video cameras, she says. But some of the most critical data, such as certain kinds of traffic accidents, are also the rarest.
“We didn’t have enough video on rollovers,” she says. “So we used synthetic data to generate that.” Then there’s facial recognition. There are plenty of databases of faces with photos taken in good light where the subjects are looking straight ahead. Training on only this kind of data results in systems that don’t always work and can even be dangerous if they’re being used for security.
“The minute you turn your head or put your glasses on, or smile, or put a mask on, it fails,” she says.
Image generators can be used to create permutations of photographs that simulate different lighting conditions or angles. But there’s a limit to how much can be done with current technology.
“We tried to generate some synthetic data for people with masks but it didn’t match human anatomy very well,” Panetta says. “Those contours are important. So it failed miserably. But it’s a good tool if the synthetic data really exhibits the behaviors you want to match.”
3. Protecting privacy while maximizing data value
Many companies have specific use cases that off-the-shelf models don’t cover well. It’s not just specialized inventory management applications or self-driving cars. It could also be as simple as generating an email or a slide deck for a prospective customer.
“There’s no objective answer about how to draft an email to a client,” says Eric Lin, VP of applied AI at Dynamo AI, a company focusing on AI guardrails and compliance. That’s because companies have their own style, language, and, of course, unique product information. The product information gap can be filled by pointing the AI at a vector database at the point of inference, via RAG embedding. But training an AI on emails to actual customers could violate their privacy, whether done through fine-tuning or RAG embedding. You don’t want an AI to include sensitive information about one customer to another.
“We’ve been afraid to leverage this data because of privacy and safety concerns,” adds Lin. But synthetic data can strip away all the sensitive private information so it doesn’t get into an AI’s knowledge base and enable enterprises to create models that write exactly the kind of emails and slide decks they need. And it’s not just for marketing applications.
“For companies in healthcare, for instance, synthetic data helps simulate patient data and clinical scenarios, ensuring compliance with privacy laws while creating diverse training sets,” says Bharath Thota, partner in the digital and analytics practice at Kearney.
By using synthetic data, healthcare firms can get better accuracy or create innovative new products, he says, even though the field is highly regulated.
4. Accelerating product development and R&D
Speaking of creating products, if a company is building something new, the problem might not be privacy but that there might not be any historic data to work with. That happened when Nextuple wanted to build a new application for inventory management.
“We wanted to simulate how a company’s inventory gets consumed on their network of distribution centers and stores based on typical demand factors,” says Nextuple’s Seth. “Without having real-world data, there was no way to test if it works in a real-world scenario.”
The synthetic data they created included inventory positions across a network of stores and warehouses, as well as simulated orders and the timing in which they came in.
“We used simulations to understand that, say, during Thanksgiving, there are certain surges in sales, and understanding what those real-life situations are, we created synthetic data,” he says. “Then we had the good fortune to test it out with a prospect, which validated our hypothesis.”
Another example of using synthetic data for product development? Building robots.
“We’re seeing so much improvement in robotics these days,” says Agustin Huerta, SVP of digital innovation at software development company Globant. There are virtual environments, like the Nvidia Omniverse, where simulated robots can interact with simulated objects, creating large amounts of training data to jump-start a robot’s ability to navigate spaces or handle products.
“And if you’re talking about computer vision data for training autonomous driving solutions, we need synthetic data — there’s no other way to do it,” he says. “Otherwise, we’ll need to be crashing cars.”
5. Exploring new markets without historical data
Another use case for synthetic data is when a company has a product, but wants to sell it in a new market. Businesses can model how consumers might behave, what they prefer, and how they might respond to new products or services, says Thota. They can also use the simulated data to help refine features and marketing strategies.
“A bank looking to enter a new region can use synthetic data to simulate local economic conditions, spending habits, and how people might adopt their financial products,” he adds.
Anand Rao, AI professor at Carnegie Mellon University, once worked with a ride sharing company looking to expand to new markets. But using the same strategy everywhere wouldn’t have been very effective since conditions vary geographically.
“In New York City, you need a five to 10 minute turnaround,” Rao says. “They’re less tolerant of mispredictions, like if it says eight minutes but it takes 12 minutes for the car to come. But in Ann Arbor, Michigan, if it’s a few minutes late, they can live with it.”
That means the optimization strategies needed to be different, and synthetic data helped to refine those strategies.
“We had over 200,000 go-to-market scenarios for ten cities,” he adds. That gave executives real insights into how to adapt for the new markets.
6. Constructing digital twins
Historically, digital twins have been used for things like modeling jet engines, helping companies with predictive maintenance, or for designing and managing factories and other complex physical facilities. Today, the definition of digital twins is expanding to include things like software systems, business workflows, or even people.
Companies are simulating customers, their behaviors, shopping journeys, buying patterns, and how they’ll respond to a particular promotion, says Tom Edwards, Americas consumer AI leader at EY. They do it by creating synthetic customer profiles. “It helps us understand how different demographics will respond to different product positioning,” he says. “And what we get out is better demand forecasting and better targeting.”
And he’s seeing companies using synthetic personas instead of focus groups.
“You can create hundreds of personas and test different messaging,” he says. “Synthetic data allows you to fill in psychographic details.”
These simulated personas can also be used to improve ecommerce personalization.
“I can run millions of different combinations, and when it comes time for you to shop, I can immediately match you based on one of these preconfigured personas, built on synthetic data,” he adds. “I know you better than a traditional algorithm might because I’ve already extrapolated millions of potential paths forward.”
The business value here could potentially be in the millions of dollars, he says, as it unlocks a way to seamlessly align with consumers and provide recommended products they haven’t seen before. A company can also create digital twins of employees.
“Internally, one of the things we’re looking at is our staffing and skills,” says Nick Kramer, leader of applied solutions at SSA & Company, a management consulting firm.
“We have historic data about our consultants, and unreliable data about skills and capabilities,” he says. “But we have rich project data and out of that, we’ve got our lump of clay, so to speak, and have been experimenting with different ways to synthetize data.”
The synthetic personas could be people, project roles, or specific titles, he says. Those are combined into simulated project teams, and that, in turn, creates a view of what staffing could look like and how to balance it against skills and tools, and how to optimize for outcome, speed, revenues, and margins.
7. Preparing for agentic AI
As AI evolves, so do the opportunities to use synthetic data. This year, for example, it’s all about agentic AI.
According to an April Cloudera survey, 96% of enterprise IT leaders say they plan to expand their use of AI agents in the next 12 months. And although 57% say they’ve already implemented AI agents, the single biggest barrier is data privacy, with 53% saying it’s slowing adoption. But it’s not just about preserving privacy when it comes to training AI agents.
“Synthetic data is a great way to accelerate the learning of those agents and map through complex scenarios,” says EY’s Edwards. It can also be used to ensure that agents can handle anything that’s thrown at them.
“If you’re able to run millions of different scenarios based on complex interactions, that becomes an incredibly valuable tool,” he says. “It’s going to become a foundational aspect for how you deploy an agent within an organization.”
Reality check: The risks of overreliance on synthetic data
There are also dangers overusing synthetic data. As Panetta discovered when trying to create synthetic images of people wearing face masks, it has its limits.
“If abused, you risk the equivalent of the overfitting problem where outputs become highly repetitive,” says Gordon Van Huizen, SVP of strategy at Mendix, an AI platform company. “Then feeding a prompt outside the training data can result in random or bizarre results because the system has difficulty interpreting the new pattern.”
There are ways to address this, though. Companies can create more diverse data sets, blend synthetic data with real data, or add noise to the data to create outliers.
“But the key to capitalizing on synthetic data is to always include human validation protocols wherever possible,” he says.
Read More from This Article: 7 ways synthetic data creates business value
Source: News