Skip to content
Tiatra, LLCTiatra, LLC
Tiatra, LLC
Information Technology Solutions for Washington, DC Government Agencies
  • Home
  • About Us
  • Services
    • IT Engineering and Support
    • Software Development
    • Information Assurance and Testing
    • Project and Program Management
  • Clients & Partners
  • Careers
  • News
  • Contact
 
  • Home
  • About Us
  • Services
    • IT Engineering and Support
    • Software Development
    • Information Assurance and Testing
    • Project and Program Management
  • Clients & Partners
  • Careers
  • News
  • Contact

Delivering resilience and continuity for AI

Infrastructure might be the reason so many organizations report failures when scaling AI from POC to production. Almost every company in Microsoft’s latest State of AI Infrastructure report spoke of challenges scaling and operationalizing AI, and over half of more than 1,500 business leaders from various sectors and regions said they don’t have the right infrastructure to support the AI workloads they want to run — a proportion repeated in other surveys.

Building, deploying, and operationalizing AI models is when you find out how modern your infrastructure really is and where it lets you down. “Running AI on legacy architecture is like streaming 4K video over dial-up,” says Frank Miller, chief AI and platforms officer at digital infrastructure company Colt Technology Services. “You can convince yourself it will work, but the reality is very different.”

If you don’t want to be stuck firefighting just to keep available the AI you’ve spent so much on, you need both governance and modern architecture. “This means replacing rigid legacy systems with hybrid, cloud-native designs that scale for AI workloads,” he adds. “High-bandwidth, low latency connectivity ensures fast data access; redundancy and automated failover provide continuity; and zero-trust security with encryption protects sensitive AI flows. Adding observability and predictive monitoring helps anticipate issues before they disrupt operations, creating an infrastructure that’s resilient, secure, and ready for AI innovation.” 

Think of it as technical debt, suggests IDC group VP Daniel Saroff as most enterprises underestimate the strain AI puts on connectivity and compute. Siloed infrastructure won’t deliver what AI needs and CIOs need to think about these and other things in a more integrated way to make AI successful. “You have to look at your GPU infrastructure, bandwidth, network availability, and connectivity between respective applications,” he says. “If you have environments not set up for highly transactional, GPU-intensive environments, you’re going to have a problem,” Saroff warns. “And having very fragmented infrastructure means you need to pull data and integrate multiple different systems, especially when you start to look at agentic AI.”

Training, RAG, and agent workflows assume that data isn’t only correct but always reachable and not behind a bottleneck. Common API technologies like MCP are emerging as a way to standardize access to data, and legacy systems may not support that easily, he adds.

Get good at GPUs

Resilience is hardly a new idea for enterprise IT. High availability, failover, and disaster recovery are so universally required that one of the first six agents Microsoft added to Azure Copilot is there specifically to improve resiliency in the cloud. On premises, enterprises have decades of experience with infrastructure to draw from, but that rarely includes the expensive GPUs and other accelerators that are key to AI, whether you’re training or running inferencing.

They’re also more demanding, whether that’s the added complication of needing to autoconfigure GPU Kubernetes clusters with the right drivers and operators, or building out dedicated AI infrastructure that’s harder to service, and needs high-speed networking for distributed traffic with unfamiliar and fast-changing patterns.

“Building GPU infrastructure is really hard,” says Jason Hammons, VP of international systems engineering at VAST Data. “It’s brittle in large part to its massively parallel nature, but also because of the componentry. They’re just way more complex.”

AI demands high-bandwidth networks with low and critical predictable latency to deliver large payloads of data and small payloads of inferences and API calls. That might mean at least part of your enterprise network looking more like what’s in a cloud data center, perhaps with SmartNICs, InfiniBand, or RoCE, and programmable network operating systems like SONiC, as well as stable links with direct routes to AI data centers and cloud APIs.

If you have high-speed networking internally to the GPU cluster itself, enterprises can deliver a good AI experience, Hammons says, but building agents is even more demanding in terms of storage and networking. “When you start scaling agentic workloads, because of the complex I/O patterns they exhibit, the complicated nature of keeping those systems up can be exacerbated,” he says.

Intelligent routing and underlay optimization matter more in AI, and load balancing becomes more important than ever, requiring intelligent, adaptive routing and dynamic, multipath I/O so one congested or unhealthy path doesn’t interrupt an AI pipeline. You have to give critical AI traffic high enough priority to support your workloads without getting in the way of critical production systems like ERP and payment services, or VoIP and video meetings.

“AI workflows are much more network-based,” says Artur Bergman, CTO at software developer Fastly. “You have to scale across machines and that’s quite a big shift from enterprise workloads that don’t have those levels of network or latency requirements.”

It’s no longer just about avoiding critical failures or recovering from them fast. You also have to design systems for graceful degradation so they can still perform well enough when there are failures.

Similarly, resilient AI needs more than the synchronous replication you’re used to having for any production workload. “A lot of these systems need to be load balanced across sites and have that redundancy across multiple domains,” Hammons says. The complexity of that has even sophisticated organizations turning to providers like CoreWeave and what he calls AI native neo clouds.

Taking a hybrid approach to APIs is almost universal. Whether you’re bursting out to an AI data center, building on hyperscaler GPU infrastructure and cloud databases, or calling cloud APIs, you need to think about those connections. That means updating legacy networks and considering multiple connectivity providers for redundancy.

And if you’re doing AI at the edge, especially in near real time environments like factories and retail, you also have to think about distributed reliability, and what connectivity and latency is needed to deliver inferencing or update local models across sites for consistency.

“Cross-cloud communication is just going to grow,” Bergman says. Fastly customers are already keeping training set data there so they can use it in multiple clouds. “We can ingress it to all the clouds without the cloud egress charges.”

Authenticating agent access and privileges when acting on behalf of employees may add complexities in future, he suggests. That doesn’t need low-level network changes but at the application layer, he predicts a lot of evolution has to happen for these things to scale out in a secure, reliable way.

Flatten your architecture

Most AI adoption today is happening on architectures never designed for this level of volatility, says Richard Copeland, CEO of cloud services provider Leaseweb. “Everyone wants the magic of AI, but the moment they scale it, they’re confronted with the messy reality of data gravity, latency budgets, and storage economics,” he adds. “Teams are trying to secure endpoints, expand pipelines, add GPUs, and increase bandwidth but none of that stops the operational chaos if the foundation beneath it isn’t intentionally resilient.”

You’ll almost certainly need more storage to support AI and not just for training sets, he points out. “You’re storing embeddings, vector indexes, model checkpoints, agent logs, synthetic datasets, and the agents themselves are producing new data every second,” he says. So spend the time to work out how much of that you actually need to store, where, and for how long.

But designing for continuity means treating resilience as a design principle, not an insurance policy. Organizations that stay ahead are flattening architectures, pushing compute closer to data, automating lifecycle policies, and building environments where AI pipelines can fail over without anyone breaking a sweat, says Copeland.

Flatter architectures also reduce technical debt, but most enterprises have accumulated so many layers of tools, proxies, queues, storage tiers, and checkpoints that their AI pipelines behave like Rube Goldberg machines, he adds. “Data has to climb up and down that stack before it reaches the models that need it, and every hop adds latency, fragility, and operational overhead,” he says.

Find out where delays are coming from and you may find systems you don’t need. “Remove redundant middleware, automate data-placement and lifecycle policies, and shift workloads toward the environments where the data already lives,” he continues. Consolidating storage tiers, moving GPU workloads into simpler regional or on-premises environments, and tuning the network path should allow a system that behaves predictably rather than chaotically.

Data by design

Making AI scale will almost certainly mean taking a hard look at your data architecture. Every database adds features for AI. And lakehouses promise you can bring operational data and analytics together without affecting the SLAs of production workloads. Or you can go further with data platforms like Azure Fabric that bring in streaming and time series data to use for AI applications.

If you’ve already tried different approaches, you likely need to rearchitect your data layer to get away from the operational sprawl of fragmented microservices, where every data hand-off between separate vector stores, graph databases, and document silos introduces latency and governance gaps. Too many points of failure make it hard to deliver high availability guarantees.

“The traditional patchwork of databases, pipelines, and bespoke vector stores simply can’t keep up with AI’s latency, governance, and scale requirements,” says Nadeem Asghar, chief product and technology officer at cloud AI database platform SingleStore. “Unified intelligence planes will replace today’s fragmented stacks, collapsing data, compute, and inference into a single live system.”

Dominik Tomicevic, CEO of graph database provider Memgraph recommends separating the models and agents that form your intelligence layer from your knowledge layer, where truth, data, and information live and require synchronous or near‑synchronous replicas across zones or regions.

Although AI infrastructure means tackling data- and network-heavy distributed systems, he views that as a solvable engineering problem. “A resilient AI stack starts with a strongly‑typed knowledge graph or GraphRAG store that can be clustered, replicated, backed up, monitored, and accessed‑controlled just like any other mission‑critical database,” he says.

That gives you flexibility to scale search and data nodes separately, or even change models and vendors in the future. It also means security and resilience go hand in hand.

“Fine‑grained access control at the graph level means the retrieval layer will never leak data that the underlying database wouldn’t allow, even if an LLM is curious,” he adds. “On top of that, you bake in observability and service-level objectives specifically for AI, like latency and error budgets for GraphRAG queries, quality metrics for retrieval results, and cost budgets for model calls.”

Put platforms in place

The pressure to move from prototypes to production deployments that deliver the value of AI means individual projects need policies and best practices to build on, rather than having to make all the right decisions themselves, so they can focus on technical questions like choosing models rather than building infrastructure.

If this sounds like the principles of platform engineering, that’s how you make AI a capability rather than a series of experiments. IDC’s Saroff argues that a unified platform workflow you’ve already done gives you a backbone of process, APIs, data, and technologies. Instead of solving the same problem over and over, you deliver infrastructure that includes GPUs and accelerators, as well as multiple flavors of compute, observability for models, API calls and applications, as well as cost management and governance.

All these systems need to be feeding into observability and optimization tools with near real time feedback. You can’t wait until you get your monthly cloud bill to discover you’ve blown past your budget, or until you hit an outage to realize the APIs you rely on are returning errors and requiring multiple retries. API management is key to track usage and optimize costs.

And you need all of that to integrate with existing infrastructure and workflows. “Every company has the same problem. You need AI to compete, but all your actual business runs on legacy infrastructure and software that predates the iPhone,” argues Jarrod Vawdrey, field chief data scientist at Domino Data Lab.

He defines forward deployed engineers as translators who navigate the complexity between desired business outcomes, legacy systems, and modern AI capabilities. “They can wrangle a large language model and integrate it with your 20-year-old ERP system that nobody wants to touch.”

The integrations will be new but the fundamentals aren’t. Doing IT right is what will allow you to do AI correctly, says Bola Rotibi, chief of enterprise research at technology research and advisory firm CCS Insight.

The good news is you may already have done the heavy lifting using, for example, well-architected frameworks for cloud as AI applications will inherit that redundancy, exception handling, and chaos engineering. “If your architecture is built for resiliency, then chances are you’ve already started thinking about all the things required to underpin AI,” she says.

All of this, of course, is going to cost money. IDC predicts that by 2027, organizations will realize they’re underestimating the costs of AI infrastructure by almost a third, and will start applying FinOps to it.

But true resilience relies on understanding both business and operational context, making for a more combined, collaborative environment, Rotibi suggests. While CIOs usually struggle to justify infrastructure investments, tying them to delivering reliable and secure AI allows IT to continue providing value that’s aligned with business priorities rather than being seen as a cost center.


Read More from This Article: Delivering resilience and continuity for AI
Source: News

Category: NewsDecember 30, 2025
Tags: art

Post navigation

PreviousPrevious post:서비스나우 기고 | ‘플랫폼 기반 운영이 ROI를 만든다’ 에이전틱 AI 시대의 업무 최적화 전략NextNext post:CIOs’ top resolution should be to create a visions-rich 2026

Related posts

SAS makes AI governance the centerpiece of its agent strategy
April 29, 2026
The boardroom divide: Why cyber resilience is a cultural asset
April 28, 2026
Samsung Galaxy AI for business: Productivity meets security
April 28, 2026
Startup tackles knowledge graphs to improve AI accuracy
April 28, 2026
AI won’t fix your data problems. Data engineering will
April 28, 2026
The inference bill nobody budgeted for
April 28, 2026
Recent Posts
  • SAS makes AI governance the centerpiece of its agent strategy
  • The boardroom divide: Why cyber resilience is a cultural asset
  • Samsung Galaxy AI for business: Productivity meets security
  • Startup tackles knowledge graphs to improve AI accuracy
  • AI won’t fix your data problems. Data engineering will
Recent Comments
    Archives
    • April 2026
    • March 2026
    • February 2026
    • January 2026
    • December 2025
    • November 2025
    • October 2025
    • September 2025
    • August 2025
    • July 2025
    • June 2025
    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • October 2024
    • September 2024
    • August 2024
    • July 2024
    • June 2024
    • May 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • November 2023
    • October 2023
    • September 2023
    • August 2023
    • July 2023
    • June 2023
    • May 2023
    • April 2023
    • March 2023
    • February 2023
    • January 2023
    • December 2022
    • November 2022
    • October 2022
    • September 2022
    • August 2022
    • July 2022
    • June 2022
    • May 2022
    • April 2022
    • March 2022
    • February 2022
    • January 2022
    • December 2021
    • November 2021
    • October 2021
    • September 2021
    • August 2021
    • July 2021
    • June 2021
    • May 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    Categories
    • News
    Meta
    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    Tiatra LLC.

    Tiatra, LLC, based in the Washington, DC metropolitan area, proudly serves federal government agencies, organizations that work with the government and other commercial businesses and organizations. Tiatra specializes in a broad range of information technology (IT) development and management services incorporating solid engineering, attention to client needs, and meeting or exceeding any security parameters required. Our small yet innovative company is structured with a full complement of the necessary technical experts, working with hands-on management, to provide a high level of service and competitive pricing for your systems and engineering requirements.

    Find us on:

    FacebookTwitterLinkedin

    Submitclear

    Tiatra, LLC
    Copyright 2016. All rights reserved.