Skip to content
Tiatra, LLCTiatra, LLC
Tiatra, LLC
Information Technology Solutions for Washington, DC Government Agencies
  • Home
  • About Us
  • Services
    • IT Engineering and Support
    • Software Development
    • Information Assurance and Testing
    • Project and Program Management
  • Clients & Partners
  • Careers
  • News
  • Contact
 
  • Home
  • About Us
  • Services
    • IT Engineering and Support
    • Software Development
    • Information Assurance and Testing
    • Project and Program Management
  • Clients & Partners
  • Careers
  • News
  • Contact

The one-model trap: Why agentic AI won’t scale in production

Whenever I see a new agent project kick off, I can almost always predict the first architecture decision: pick one monolithic model, wire it to some tools, and then tune prompts until something works. I have been there myself. It feels clean. It keeps procurement simple. It gives teams one benchmark to watch.  

It also breaks down as soon as you start to see any real traffic. 

Production agents don’t fail because the model is “bad.” They fail because the operating environment is messy: requests change shape, latency budgets conflict, tools flake out, costs spike, policy constraints shift and failure modes compound. A single-model architecture makes all of those problems focus on one point of failure. In practice, that becomes an availability risk, a cost risk and a governance risk over time. 

The thing that changed my mind was moving from demo success metrics to operational success metrics. In demos, I cared about “did the model answer correctly?” In production, I had to care about “did the whole system complete safely, on time, and at an acceptable unit cost?” That is a different question, and it demands a different design. 

The failure mode is not ‘intelligence,’ it is variance’

A lot of engineering teams approach model choice as a leaderboard problem: pick the model with the highest quality score, then standardize. That is true as far as it goes, but agent workloads are not narrow. They are a distribution of tasks with very different complexity profiles. 

For a specific product, around 70% of user tasks were routine classification, retrieval and transformation. Another 20% needed some moderate reasoning with interleaved tool use. The final 10% were hard edge cases that required long context, planning and retries. We first tried to route all of that through one big model because it gave the best average quality in demos and tests. The result was completely predictable: We paid high cost and latency for simple tasks, then had brittle behavior on the hardest 10% still. 

The core problem was not average quality, but variance. Production traffic has spikes, tool outages and adversarial users. If every request must depend on one model with one latency curve and one pricing curve, then your tail behavior will dominate your user experience. In practice, your p95 and p99 are what people remember. 

This is one reason why operational guidance like NIST’s AI Risk Management Framework ends up mattering in agent design: it pushes teams to think about reliability, monitoring and governance as first-class concerns, not post-launch cleanup. Once you start to frame agents as risk-bearing systems, single-model centralization starts to look a lot like technical debt you are knowingly incurring. 

I have also found that single-model setups make incident response slower. If model quality drops, is it a model update issue, prompt regression, retrieval drift, tool contract breakage, context truncation or an evaluation blind spot? With one giant pathway, everything is coupled. Coupling is expensive during incidents.  

Production agents are systems, not prompts  

The mental shift that finally stuck with my team is this: an agent is an orchestrated system with policies, not a prompt that just happens to call tools. Once you accept that, multi-model design starts to feel less like complexity for the sake of it, and more like systems engineering that you would expect to see anywhere. 

For the reasoning flows, I often borrow the patterns from the ReAct paper: interleave thinking and acting, and then ground decisions through tool results. In production, I find that pattern is better when you decouple roles across models. For example:  

  • A small fast model for intent detection, policy checks and tool argument normalization. 
  • A medium model for most retrieval-grounded synthesis.  
  • A high-capability model reserved for escalations, ambiguous requests or high-impact outputs. 
  • A deterministic layer for guardrails, schema validation and redaction no matter which model you use. 

The core idea here is to create isolation boundaries. If the high-capability model goes into an outage or a cost spike, core traffic still flows through lower tiers with graceful degradation. If a small model misroutes a fraction of tasks, fallbacks and confidence thresholds can recover with degraded behavior, not total failure. 

Observability is equally important here. Agent teams often log final answers and call that monitoring. That is a poor use of observability signals. You need traces across orchestration steps, tool calls, retrieval versions and policy decisions. I personally default to principles similar to OpenTelemetry concepts because distributed traces make model routing issues visible fast. If you don’t have that, you are debugging by anecdote. 

One other hard lesson is that governance policies change orders of magnitude faster than model contracts. Legal or security teams can require new redaction rules, retention windows or prohibited actions at literally no notice. If one model is deeply embedded in every stage of every reasoning flow, policy changes become large, painful migrations. In a multi-model architecture with clean interfaces, policy changes are mostly routing and control-plane updates. 

A practical multi-model architecture that actually survives operations 

For teams that ask me how to start and avoid overengineering, I suggest a staged approach that keeps complexity proportional to risk. 

  1. Stage 1: Separate control from generation. Maintain a control layer for routing, policies, budgets and retries. Keep generation models stateless behind some well-defined interfaces. This lets you swap models without changing business logic. 
  1. Stage 2: Capability tiering. Define at least three classes: fast-cheap, balanced and premium reasoning. Route based on task class, confidence and impact. If confidence is low or action is high risk, escalate. If request is routine, keep it in lower tiers. 
  1. Stage 3: Failure-aware execution. Build explicit timeouts, circuit breakers and fallback responses for every external dependency: model APIs, vector stores, internal tools and identity services. If retrieval fails, answer with bounded behavior instead of pretending certainty. If a high-end model is unavailable, degrade to a human handoff path when needed. 
  1. Stage 4: Production-like evaluation. Offline benchmark numbers are great, but they are not enough for agent systems. You need scenario suites with real tool behavior, delayed dependencies and policy edge cases. I personally require per-route metrics for success rate, p95 latency, token cost, escalation rate and policy violations. Only that level of instrumentation lets you tune routing thresholds responsibly. 
  1. Stage 5: Economic controls. Most agent cost overruns do not come from a single very expensive call. They come from retries, long contexts and recursive tool loops. Put per-session and per-step token budgets, cap retries by route, and enforce stop conditions in your planners. Cost governance should be automatic, not a monthly surprise. 

The one objection I hear a lot to this is that multi-model setups are harder to govern. In my experience, that is mostly the opposite if your architecture is explicit enough. Governance is hard when the behavioral surface is hidden in prompt text. Governance is tractable when routing decisions, policy checks and escalation criteria are visible, versioned and testable. 

Another objection is increased vendor lock-in risk from multiple providers or model families. That is a fair concern, but my experience is that lock-in risk is lower when you maintain an internal model abstraction and keep prompts, evaluation harnesses and tool schemas portable. Single-model stacks often feel simpler to start, then become very coupled to provider-specific behavior over time. 

The final question I am always asked is: when is one model still fine? I would say that one model is ok for low-volume internal copilots, non-critical workflows or early prototypes with a narrow task scope. It is not a sustainable default for customer-facing agents with uptime, compliance and cost targets. 

If I had to summarize in one sentence, that would be: production agent scalability is a control-plane problem that is commonly misdiagnosed as a model-choice problem. A single model can be a brilliant model and still fail your system goals. A multi-model architecture with strong routing and policy controls is the only thing that lets you scale for quality, reliability and cost at the same time. 

Disclaimer: The views and opinions expressed in this article are solely those of the author and do not necessarily represent the views, policies, or positions of any organization or employer. 

This article is published as part of the Foundry Expert Contributor Network.
Want to join?


Read More from This Article: The one-model trap: Why agentic AI won’t scale in production
Source: News

Category: NewsMarch 27, 2026
Tags: art

Post navigation

PreviousPrevious post:Day Two in enterprise AI: Why operations, drift, and retraining matter more than launchNextNext post:Por qué desarrollar aplicaciones empresariales propias con Vibe es una apuesta arriesgada

Related posts

샤오미, MIT 라이선스 ‘미모 V2.5’ 공개···장시간 실행 AI 에이전트 시장 겨냥
April 29, 2026
SAS makes AI governance the centerpiece of its agent strategy
April 29, 2026
The boardroom divide: Why cyber resilience is a cultural asset
April 28, 2026
Samsung Galaxy AI for business: Productivity meets security
April 28, 2026
Startup tackles knowledge graphs to improve AI accuracy
April 28, 2026
AI won’t fix your data problems. Data engineering will
April 28, 2026
Recent Posts
  • 샤오미, MIT 라이선스 ‘미모 V2.5’ 공개···장시간 실행 AI 에이전트 시장 겨냥
  • SAS makes AI governance the centerpiece of its agent strategy
  • The boardroom divide: Why cyber resilience is a cultural asset
  • Samsung Galaxy AI for business: Productivity meets security
  • Startup tackles knowledge graphs to improve AI accuracy
Recent Comments
    Archives
    • April 2026
    • March 2026
    • February 2026
    • January 2026
    • December 2025
    • November 2025
    • October 2025
    • September 2025
    • August 2025
    • July 2025
    • June 2025
    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • October 2024
    • September 2024
    • August 2024
    • July 2024
    • June 2024
    • May 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • November 2023
    • October 2023
    • September 2023
    • August 2023
    • July 2023
    • June 2023
    • May 2023
    • April 2023
    • March 2023
    • February 2023
    • January 2023
    • December 2022
    • November 2022
    • October 2022
    • September 2022
    • August 2022
    • July 2022
    • June 2022
    • May 2022
    • April 2022
    • March 2022
    • February 2022
    • January 2022
    • December 2021
    • November 2021
    • October 2021
    • September 2021
    • August 2021
    • July 2021
    • June 2021
    • May 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    Categories
    • News
    Meta
    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    Tiatra LLC.

    Tiatra, LLC, based in the Washington, DC metropolitan area, proudly serves federal government agencies, organizations that work with the government and other commercial businesses and organizations. Tiatra specializes in a broad range of information technology (IT) development and management services incorporating solid engineering, attention to client needs, and meeting or exceeding any security parameters required. Our small yet innovative company is structured with a full complement of the necessary technical experts, working with hands-on management, to provide a high level of service and competitive pricing for your systems and engineering requirements.

    Find us on:

    FacebookTwitterLinkedin

    Submitclear

    Tiatra, LLC
    Copyright 2016. All rights reserved.