The AI data governance gap that keeps getting worse

Every enterprise I talk to right now is somewhere in the middle of building AI into their products or operations. Internal copilots, fraud detection, customer service bots. The ambition is real and the pace is aggressive.

But here is what keeps coming up in my conversations, and it is not a technical issue. It is a governance one.

When teams build AI systems, they need data. Usually a lot of it, and it must be realistic enough that the model learns something useful. So where does that data come from? Mostly production databases. The same systems that hold customer names, financial records, health data, government identifiers. The stuff organizations spend millions protecting in their live environments.

The data gets pulled out. It moves into a dev environment. Then a data lake. Then a training pipeline. Maybe a third-party annotation service. Maybe a shared notebook a contractor has access to. And at some point, along that chain, someone should be asking whether the data was supposed to leave the production boundary at all.

Almost nobody is asking.

What this looks like in practice

Last year I worked with the engineering leadership of a mid-sized lender that was understandably proud of how quickly its team had stood up a fraud-detection model. The model was performing well. When we walked the data flow together, we found a CSV with roughly 200,000 real customer records (names, account numbers, transaction histories, a partial column of government IDs) that had been exported eight months earlier for “initial training.”

That file was now living in three places I could verify and probably more I could not. A shared cloud folder the data science team used. Two laptops belonging to data scientists who had since rotated to other projects. And a contractor’s machine in a different country who had been brought in for a labeling pass and never had access revoked. The export itself had been approved at the time. The follow-up cleanup just never happened. The data had quietly become permanent.

Nobody had done anything wrong, exactly. The CISO had not been asked. The privacy team had signed off on the original use case six quarters earlier and assumed the data lived where it was supposed to live. The data engineers thought governance was somebody else’s job. The data scientists thought, “We have a privacy team for that.” Everybody was doing their part. The gap between those parts is where 200,000 real customer records ended up sitting on a contractor’s laptop.

That is the pattern I keep seeing, and it is not because anyone is being reckless. Security teams are watching for external threats. Privacy teams are managing consent and regulatory filings for customer-facing systems. Data engineering is keeping pipelines running. Nobody has been told it is their job to ask whether production data should be leaving the production boundary at all for AI work.

Why this keeps happening

Here is what I think is going on. Most AI development workflows in enterprises today evolved out of data-science notebooks, where the only design goal was speed of experimentation. Governance was meant to come later, except later never arrived, and now there are production datasets sitting in staging environments that have been labeled “temporary” for over a year.

What makes this different from traditional software testing is the sheer number of copies AI work generates. A developer testing an application pulls production data into one test database. That is one extra copy. Manageable. AI workflows produce a chain. The data gets extracted, transformed, sampled, split into training and evaluation sets, fed through multiple model iterations and sometimes exported to external platforms for labeling or benchmarking. Each step can create a new copy. Each copy is another place where sensitive data exists with weaker protections than the original.

And the risk is not theoretical. Security researchers have repeatedly demonstrated that large language models can memorize fragments of their training data and reproduce them when prompted the right way. The original Carlini et al. paper on GPT-2 extracted real names, phone numbers and email addresses by querying the model. A follow-up team did the same against production ChatGPT using a divergence attack that caused the model to dump training data fragments verbatim. If a model is trained on raw customer records, those records can surface in the model’s outputs. The data does not just disappear into the weights. It lingers.

The financial weight of that lingering shows up in IBM’s 2024 Cost of a Data Breach Report, which puts the average global cost of a breach at USD 4.88 million and notes that 40% of breaches now involve data stored across multiple environments and 35% involve shadow data. Every untracked copy of sensitive data in an AI pipeline is one more piece of that statistic.

Regulators are not waiting for the industry to catch up. GDPR Article 25 already requires data minimization and pseudonymization wherever personal data is processed, and it does not distinguish between customer-facing systems and internal model development. The EU AI Act’s Article 10 adds an explicit data governance obligation for high-risk AI systems, including documentation of where training data came from and how it was handled. When a regulator asks where a model’s training data originated and whether personal information was properly managed, “we’ll have to check with the data science team” is not going to be an acceptable answer.

What good looks like

The encouraging part is that the techniques to fix this are not exotic. They are not even particularly new. The challenge has always been making them the default path rather than something teams opt into when they remember to.

Two examples from my own work in the past six months.

A healthcare client I worked with had been training internal triage and scheduling models on raw patient records pulled from their EHR. We helped them shift the entire AI development environment to a synthetic-data pipeline. The new pipeline produced statistically faithful records that preserved the distributions and relationships their models needed to learn from, but contained zero real patient information. The data scientists got cleaner datasets than they had before. The privacy team got a clean break in the data lineage. The engineering effort was measured in weeks, not quarters.

A bank I advised had the same problem on a fraud-detection rebuild of raw customer data flowing straight from production into the model development environment. We replaced that export step with on-the-fly masking, so when a data scientist pulled records into a notebook, real names came through as realistic-but-fake names and account numbers came through as masked tokens.

Here is the part that often gets lost in these conversations. The fraud model did not actually need any of the real identifying information to do its job. What it needed to learn was patterns of behavior, things like how often a customer transacts, what kinds of merchants they tend to use, how a given purchase compares to their normal spending. Think of it this way: To spot a fraudulent charge, the model needs to know that “this customer usually spends $40 at coffee shops on weekday mornings, and now there is a $4,000 charge at 3 a.m. from another country.” Whether the customer is named Priya Sharma or a masked stand-in does not change that signal at all.

The model trained on the masked data went into testing and the accuracy numbers came back within a percentage point of what the team had been getting from raw production data. Same model behavior, dramatically smaller blast radius. The team did not lose a sprint over it.

Neither of these is a science project. Data masking and synthetic data generation are techniques the software testing world has used for two decades. The reason they have not been the default in AI development is not technical. It is that nobody has owned the question of whether they should be.

What I would push for

Three things, if I were starting from where most enterprises sit today.

Map the real data flows. Not the documented ones, the actual ones. Walk them with the data engineering and ML teams and trace where production data goes during model development. Include the cloud buckets, the Jupyter notebooks, the exports to third-party annotation platforms, the contractor laptops. That exercise is almost always uncomfortable because it surfaces data in places nobody knew about and nobody is monitoring.

Make masking or synthesis a hard gate, not guidance. If a team needs realistic data to train a model, they should get realistic masked or synthetic data by default. Raw production data crossing the dev boundary should require an exception, not an absence of objection.

Fold data provenance into the AI risk review you almost certainly already have. If the organization is already reviewing models for bias and performance drift, it should also be reviewing where the training data came from and whether it was handled responsibly. These are the same conversations, and treating them as separate is most of the reason this gap keeps getting wider.

The longer this waits, the more copies accumulate, the harder cleanup becomes and the more exposed the organization will be when a regulator or a breach forces the issue.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Read More from This Article: The AI data governance gap that keeps getting worse
Source: News

The AI data governance gap that keeps getting worse

What this looks like in practice

Why this keeps happening

What good looks like

What I would push for

Related posts