Why LLMs fail science — and what every CPG executive must know

We live in an era where generative AI can draft complex legal agreements in minutes, design plausible marketing campaigns in seconds and translate between dozens of languages on demand. The leap in capability from early machine learning models to today’s large language models (LLMs) — GPT-4, Claude, Gemini and beyond — has been nothing short of remarkable.

It’s no surprise that business leaders are asking: If an AI can write a convincing research paper or simulate a technical conversation, why can’t it run scientific experiments? In some circles, there’s even a whispered narrative that scientists — like travel agents or film projectionists before them — may soon be “disrupted” into irrelevance.

As someone who has spent over two decades at the intersection of AI innovation, scientific R&D, and enterprise-scale product development, I can tell you this narrative is both dangerously wrong and strategically misleading.

Yes, LLMs are transformative.

No, they cannot replace the process of scientific experimentation — and misunderstanding this boundary could derail your innovation agenda, especially in industries like Consumer Packaged Goods (CPG) where physical product success depends on rigorous, reproducible, real-world testing.

Why this matters for CPG leaders

In CPG, especially in food, beverage and personal care, the competitive edge increasingly comes from faster innovation cycles, breakthrough formulations and sustainable product designs.

The temptation to lean heavily on LLMs is understandable: speed to insight is everything.

But here’s the rub — formulation is science, and science is not a language game.

An LLM can describe the perfect dairy-free ice cream base; it cannot prove it will hold texture over a 9-month shelf life, survive transportation or comply with regulatory requirements across 30 markets.

Those proofs come only from empirical experimentation.

The 5 fundamental reasons LLMs cannot do science

1. LLMs lack grounded causality

Science is fundamentally about cause and effect.

You adjust an input variable — ingredient concentration, pH, temperature — and observe how the outcome changes. You refine hypotheses, model the relationships and test again.

An LLM has no access to the causal fabric of the physical world. It learns from statistical patterns in text, not from interacting with reality. Ask it to predict the viscosity of a new emulsion, and it will produce an answer that sounds plausible — because it’s mimicking patterns from its training data — but it has no understanding of the molecular dynamics at play.

Case in point: A recent large-scale study evaluated thousands of research ideas generated by LLMs against human-generated ones. On paper, the AI-generated ideas scored higher for novelty and excitement. In practice? They performed significantly worse when executed in real experiments. The causal gap between “sounds promising” and “works in reality” remains wide.

In CPG R&D, trusting such ungrounded predictions is more than a technical flaw — it’s a brand and safety risk.

2. LLMs cannot interact with the physical world

Science is a contact sport. You mix chemicals, bake prototypes, run machinery and observe results. Sensors measure properties, equipment logs conditions and analysts validate findings. An LLM can’t run a chromatography assay. It can’t measure shelf stability. It can’t taste-test a product, detect microbial growth or watch a new formulation fail in the filler line.

Instead, it produces second-hand knowledge — a language simulation of what has been measured by others in the past. That’s useful for inspiration and planning, but without a direct link to empirical feedback, it is incapable of scientific validation.

Case in point: In healthcare, where the stakes are life and death, a Nature Medicine analysis concluded that LLMs are not yet safe for clinical decision-making. They frequently misinterpret instructions and are sensitive to small changes in input formatting. Medicine, like CPG science, demands physically grounded data. Without it, a model can only offer guesses — and guesses are not enough.

3. LLMs struggle with novel phenomena

The most valuable discoveries in science happen at the edge of the known — where data is sparse or nonexistent. When CRISPR gene editing emerged, it wasn’t an idea floating in published literature for a model to remix. It was an experimental breakthrough achieved by scientists manipulating bacterial immune systems in the lab.

LLMs are interpolation engines — they recombine existing patterns. Faced with a phenomenon no one has recorded before, they can’t generate the underlying truth.

At best, they’ll invent an answer based on analogies — which may sound convincing but have no empirical anchor.

Case in point: Even in a well-documented field like history, nuance trips them up. In the Hist-LLM benchmark — drawn from the Seshat Global History Databank — GPT-4 Turbo scored only 46% accuracy on high-level historical reasoning tasks, barely above chance and riddled with factual errors. If a model struggles to reason about known historical facts, how can we expect it to handle unknown scientific frontiers?

For CPG, this matters because market-winning innovations often require novel formulations that haven’t been documented anywhere. If you’re first to market, there is no prior dataset for an LLM to draw from.

4. LLMs fail the reproducibility test

In science, reproducibility is the gold standard. If a finding can’t be replicated, it doesn’t stand.

LLM outputs, even when prompted identically, can vary from run to run. They can hallucinate — producing confident, specific claims without any verifiable source. Worse, the “source” of an LLM answer is an opaque blend of billions of learned parameters. There’s no experimental logbook, no metadata trail, no conditions record.

Case in point: In the GSM-IC benchmark, simple grade-school math problems were padded with irrelevant details. The result? Accuracy plummeted across models. Small, extraneous changes in input context destabilized performance — a direct violation of reproducibility.

In a regulated industry, you need traceability from hypothesis to final result. LLMs, as they stand today, cannot provide it.

5. LLMs confuse correlation with causation

LLMs excel at finding correlations — but in science, correlation without causation is a trap. It’s the classic “ice cream sales and shark attacks” problem: both go up in summer, but one doesn’t cause the other. In CPG innovation, this risk is acute.An LLM might note that certain emulsifiers are often used in plant-based dairy products with long shelf lives — but that doesn’t mean adding that emulsifier will extend your product’s shelf life.

Case in point: In a benchmark comparing nearly 5,000 LLM-generated science summaries to their source papers, overgeneralization occurred in 26% to 73% of cases depending on the model. The summaries often turned tentative correlations into definitive-sounding claims — exactly the kind of leap scientists are trained to avoid.

Only a designed experiment will tell you if the relationship is causal.

What LLMs can do for science — and CPG

If LLMs can’t do science, what can they do for science?

Plenty — as long as we use them with precision. LLMs can:

Accelerate literature reviews. They can synthesize hundreds of papers and patents in minutes, surfacing patterns and knowledge that might take human teams weeks to uncover.
Assist in hypothesis generation. They can suggest potential variables to test, based on prior art and analogous fields.
Support experimental design. They can help outline experimental protocols — to be refined by scientists — saving valuable time in the planning stage.
Automate documentation. Drafting lab reports, summarizing experiment outcomes or preparing regulatory submissions can be streamlined dramatically.
Enhance cross-disciplinary collaboration. They can translate technical findings into language accessible to marketing, supply chain or executive stakeholders.

Used wisely, LLMs become force multipliers for human scientists — not replacements.

The strategic risk of misuse

Here’s the executive danger: If your teams treat LLM outputs as equivalent to experimental data, you invite bad science at scale. Poor formulations, regulatory setbacks, product recalls — all of these can stem from an overreliance on AI-generated “facts” that were never tested.

The opposite extreme is just as risky: ignoring AI altogether. Competitors who learn to integrate LLMs as accelerators for ideation, documentation and knowledge transfer will outpace those who don’t.

The winning middle ground is AI-augmented experimentation — combining the speed and reach of LLMs with the rigor and certainty of empirical science.

A blueprint for responsible AI use in CPG R&D

To strike this balance, I recommend CPG leaders adopt a structured framework:

1. Separate ideation from validation

Allow LLMs to generate ideas, hypotheses and design options.
Require all experimental claims to pass through lab validation before use.

2. Establish AI provenance rules

Document all AI-assisted work, including prompts and versions used.
Create a clear chain from suggestion to validation.

3. Build AI literacy in R&D teams

Train scientists and engineers on both the strengths and limits of LLMs.
Ensure they can distinguish language-based plausibility from physical truth.

4. Integrate with digital R&D platforms

Connect LLM tools to lab data management systems for traceability.
Avoid standalone “chatbot” use that’s disconnected from the experimental record.

5. Measure impact responsibly

Track how LLMs affect R&D speed, cost and quality — not just output volume

Why this is a C-suite conversation

The question of whether LLMs can “do science” is not just a technical one — it’s a strategic one.
In the next decade, the companies that dominate CPG will be those that marry AI speed with scientific integrity.

That requires leadership from the top. Your role as an executive is to set the guardrails, invest in the right infrastructure and empower your teams to innovate safely and effectively.

The bottom line

LLMs are extraordinary — but they are not experimental scientists. Treating them as such risks your brand, your product pipeline and your consumers’ trust.

The future of innovation in CPG lies in AI-empowered human experimentation — where LLMs amplify human insight, but never replace the physical testing and validation that science demands.

If you’re building your next-gen R&D strategy, remember: Use LLMs to accelerate science, not to replace it. The difference could define your competitive position for the next decade.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Read More from This Article: Why LLMs fail science — and what every CPG executive must know
Source: News