I spent most of 2024 convinced I knew where the hard work was.
I was building Flow Orchestra, an AI-powered content workflow platform. Solo. All of it: the retrieval components, the generation agents, the scheduling layer, the autonomous content pipelines. From where I was sitting, the engineering challenge was obvious. Get the language models to do what I needed them to do. Prompt architecture. Model selection. That’s where the complexity lived, or so I thought.
I was completely wrong.
The models worked fine. They were, in fact, the easy part. The orchestration layer nearly broke me. Getting every agent in the system to correctly receive, interpret and pass context to the next one in the chain consumed weeks I hadn’t planned for, forced me to rethink the core architecture and taught me more about how multi-agent AI systems actually fail than anything else in thirty years of working with technology.
Context passing between agents. That was the problem. It sounds like a plumbing detail. It isn’t. It’s the difference between a system that works at scale and one that produces excellent demos that fall apart in production.
I’m sharing this because most organizations building AI systems right now are making the exact same mistake, with larger teams and significantly more money at stake.
The wrong question is still winning
Walk into any enterprise AI planning meeting and the conversation centers on models. Which LLM? Proprietary or open source? Fine-tuned or base? These are legitimate questions. They aren’t the questions that determine whether your AI system works across a business function at production scale.
Deloitte’s 2026 State of AI in the Enterprise report, drawn from a survey of 3,235 senior leaders across 24 countries, found that only 20 percent of organizations are seeing actual revenue impact from their AI investments, while 74 percent say revenue growth is still an aspiration. One widely cited analysis of enterprise deployments puts the pilot-to-production success rate at just 12 percent. The models in those failed projects weren’t the problem. They were often excellent. The problem was everything around them: the coordination infrastructure, the workflow design, the architecture connecting agent to agent.
The model isn’t your competitive advantage. The orchestration layer is. Most organizations are still optimizing the wrong thing.
You can’t orchestrate a broken workflow
Here’s a pattern I’ve watched play out many times. An organization has a workflow that runs inefficiently: slow approvals, documents lost between systems and handoffs between teams that produce duplication and errors. Then they decide to layer AI on top of it.
They build a retrieval agent, add a generation component and wire in automation. The demo looks exactly like the slide deck promised.
Then they push to production. And it fails in new, faster, more expensive ways.
Here’s what happened: the AI faithfully automated a broken process. It now does at machine speed what humans were doing badly at human speed. Context passed between stages arrives incomplete. Tasks route to the wrong places. Errors that previously took a week to compound now compound in minutes.
The AI didn’t create these problems. It amplified the ones already there. You can’t bolt coordination infrastructure onto a process that doesn’t make sense. The workflow has to be redesigned first. That sequencing is non-negotiable and most organizations skip it.
When I rebuilt the orchestration layer at Flow Orchestra, I got clear on what the non-negotiables actually were. There were three.
The first is a defined context contract between agents. Every agent in the system has to know exactly what it receives from the previous step, what it’s expected to produce and what format information travels in across the pipeline. This isn’t a prompt engineering decision. It’s an architectural decision, and it has to happen before anything else gets built. Without it, you’re hoping each agent correctly interprets what the last one meant. At small scale, that hope sometimes holds. At production scale, it doesn’t.
The second is a routing layer that isn’t just another language model. Most teams build an orchestrator agent to coordinate the others, and that orchestrator is itself a large language model making routing decisions with all the probabilistic variability that comes along with it. For business-critical workflows, that’s a liability. Routing logic needs to be deterministic where determinism matters: rules, classifiers, workflow engines. The model handles the language. The routing layer handles the logic. These shouldn’t be the same component. I’ve seen production systems fail at scale precisely because the orchestrator was brilliant at understanding language and inconsistent at routing reliably under volume.
The third is a memory layer that survives agent transitions. Context that crosses three agents in a pipeline has to make it through each hop intact. That means external state stored outside the agents themselves: session stores, structured databases that every agent in the chain reads from and writes to consistently. If your agents only have access to what’s in their immediate context window, your system forgets at exactly the wrong moments. And it won’t tell you it’s forgotten. It will produce subtly wrong outputs until something obviously breaks downstream.
These aren’t glamorous components. Nobody gives conference talks about context contracts. But they’re what the working systems have that the failed ones didn’t.
Start by drawing the context flow
Before your team builds another agent or evaluates another model: map your context flow.
Not the task flow or the feature list, but the context flow.
Draw every agent in your system. Draw what information enters each one. Draw what it produces and what it passes forward. Draw what happens to shared understanding at each transition. Draw what happens when one step fails. The diagram doesn’t need to be pretty. It needs to be honest. Where does one agent hand off to the next? What breaks if that handoff fails? What happens then? Does the system recover or just quietly produce garbage? Those three questions will tell you more about your architecture than any technical review. If you can answer them on paper, you’re ready to build. If you can’t, you’re not. If you’ve already built and still can’t answer them, you’ve found your problem.
Every CIO I talk to who’s frustrated with their AI deployment has the same presenting symptom: their agents work in isolation and fail in combination. The fix is never a better model. It’s always the same: go back to the context flow and design it like the infrastructure it actually is.
In two years, nobody’s going to remember which model they picked in 2025. They’re going to remember whether their systems actually worked.
Build the air traffic control system. Start by drawing the context flow.
This article is published as part of the Foundry Expert Contributor Network.
Want to join?
Read More from This Article: Why orchestration, not the model, determines whether your AI scales
Source: News

