Delivering resilience and continuity for AI

Infrastructure might be the reason so many organizations report failures when scaling AI from POC to production. Almost every company in Microsoft’s latest State of AI Infrastructure report spoke of challenges scaling and operationalizing AI, and over half of more than 1,500 business leaders from various sectors and regions said they don’t have the right infrastructure to support the AI workloads they want to run — a proportion repeated in other surveys.

Building, deploying, and operationalizing AI models is when you find out how modern your infrastructure really is and where it lets you down. “Running AI on legacy architecture is like streaming 4K video over dial-up,” says Frank Miller, chief AI and platforms officer at digital infrastructure company Colt Technology Services. “You can convince yourself it will work, but the reality is very different.”

If you don’t want to be stuck firefighting just to keep available the AI you’ve spent so much on, you need both governance and modern architecture. “This means replacing rigid legacy systems with hybrid, cloud-native designs that scale for AI workloads,” he adds. “High-bandwidth, low latency connectivity ensures fast data access; redundancy and automated failover provide continuity; and zero-trust security with encryption protects sensitive AI flows. Adding observability and predictive monitoring helps anticipate issues before they disrupt operations, creating an infrastructure that’s resilient, secure, and ready for AI innovation.”

Think of it as technical debt, suggests IDC group VP Daniel Saroff as most enterprises underestimate the strain AI puts on connectivity and compute. Siloed infrastructure won’t deliver what AI needs and CIOs need to think about these and other things in a more integrated way to make AI successful. “You have to look at your GPU infrastructure, bandwidth, network availability, and connectivity between respective applications,” he says. “If you have environments not set up for highly transactional, GPU-intensive environments, you’re going to have a problem,” Saroff warns. “And having very fragmented infrastructure means you need to pull data and integrate multiple different systems, especially when you start to look at agentic AI.”

Training, RAG, and agent workflows assume that data isn’t only correct but always reachable and not behind a bottleneck. Common API technologies like MCP are emerging as a way to standardize access to data, and legacy systems may not support that easily, he adds.

Get good at GPUs

Resilience is hardly a new idea for enterprise IT. High availability, failover, and disaster recovery are so universally required that one of the first six agents Microsoft added to Azure Copilot is there specifically to improve resiliency in the cloud. On premises, enterprises have decades of experience with infrastructure to draw from, but that rarely includes the expensive GPUs and other accelerators that are key to AI, whether you’re training or running inferencing.

They’re also more demanding, whether that’s the added complication of needing to autoconfigure GPU Kubernetes clusters with the right drivers and operators, or building out dedicated AI infrastructure that’s harder to service, and needs high-speed networking for distributed traffic with unfamiliar and fast-changing patterns.

“Building GPU infrastructure is really hard,” says Jason Hammons, VP of international systems engineering at VAST Data. “It’s brittle in large part to its massively parallel nature, but also because of the componentry. They’re just way more complex.”

AI demands high-bandwidth networks with low and critical predictable latency to deliver large payloads of data and small payloads of inferences and API calls. That might mean at least part of your enterprise network looking more like what’s in a cloud data center, perhaps with SmartNICs, InfiniBand, or RoCE, and programmable network operating systems like SONiC, as well as stable links with direct routes to AI data centers and cloud APIs.

If you have high-speed networking internally to the GPU cluster itself, enterprises can deliver a good AI experience, Hammons says, but building agents is even more demanding in terms of storage and networking. “When you start scaling agentic workloads, because of the complex I/O patterns they exhibit, the complicated nature of keeping those systems up can be exacerbated,” he says.

Intelligent routing and underlay optimization matter more in AI, and load balancing becomes more important than ever, requiring intelligent, adaptive routing and dynamic, multipath I/O so one congested or unhealthy path doesn’t interrupt an AI pipeline. You have to give critical AI traffic high enough priority to support your workloads without getting in the way of critical production systems like ERP and payment services, or VoIP and video meetings.

“AI workflows are much more network-based,” says Artur Bergman, CTO at software developer Fastly. “You have to scale across machines and that’s quite a big shift from enterprise workloads that don’t have those levels of network or latency requirements.”

It’s no longer just about avoiding critical failures or recovering from them fast. You also have to design systems for graceful degradation so they can still perform well enough when there are failures.

Similarly, resilient AI needs more than the synchronous replication you’re used to having for any production workload. “A lot of these systems need to be load balanced across sites and have that redundancy across multiple domains,” Hammons says. The complexity of that has even sophisticated organizations turning to providers like CoreWeave and what he calls AI native neo clouds.

Taking a hybrid approach to APIs is almost universal. Whether you’re bursting out to an AI data center, building on hyperscaler GPU infrastructure and cloud databases, or calling cloud APIs, you need to think about those connections. That means updating legacy networks and considering multiple connectivity providers for redundancy.

And if you’re doing AI at the edge, especially in near real time environments like factories and retail, you also have to think about distributed reliability, and what connectivity and latency is needed to deliver inferencing or update local models across sites for consistency.

“Cross-cloud communication is just going to grow,” Bergman says. Fastly customers are already keeping training set data there so they can use it in multiple clouds. “We can ingress it to all the clouds without the cloud egress charges.”

Authenticating agent access and privileges when acting on behalf of employees may add complexities in future, he suggests. That doesn’t need low-level network changes but at the application layer, he predicts a lot of evolution has to happen for these things to scale out in a secure, reliable way.

Flatten your architecture

Most AI adoption today is happening on architectures never designed for this level of volatility, says Richard Copeland, CEO of cloud services provider Leaseweb. “Everyone wants the magic of AI, but the moment they scale it, they’re confronted with the messy reality of data gravity, latency budgets, and storage economics,” he adds. “Teams are trying to secure endpoints, expand pipelines, add GPUs, and increase bandwidth but none of that stops the operational chaos if the foundation beneath it isn’t intentionally resilient.”

You’ll almost certainly need more storage to support AI and not just for training sets, he points out. “You’re storing embeddings, vector indexes, model checkpoints, agent logs, synthetic datasets, and the agents themselves are producing new data every second,” he says. So spend the time to work out how much of that you actually need to store, where, and for how long.

But designing for continuity means treating resilience as a design principle, not an insurance policy. Organizations that stay ahead are flattening architectures, pushing compute closer to data, automating lifecycle policies, and building environments where AI pipelines can fail over without anyone breaking a sweat, says Copeland.

Flatter architectures also reduce technical debt, but most enterprises have accumulated so many layers of tools, proxies, queues, storage tiers, and checkpoints that their AI pipelines behave like Rube Goldberg machines, he adds. “Data has to climb up and down that stack before it reaches the models that need it, and every hop adds latency, fragility, and operational overhead,” he says.

Find out where delays are coming from and you may find systems you don’t need. “Remove redundant middleware, automate data-placement and lifecycle policies, and shift workloads toward the environments where the data already lives,” he continues. Consolidating storage tiers, moving GPU workloads into simpler regional or on-premises environments, and tuning the network path should allow a system that behaves predictably rather than chaotically.

Data by design

Making AI scale will almost certainly mean taking a hard look at your data architecture. Every database adds features for AI. And lakehouses promise you can bring operational data and analytics together without affecting the SLAs of production workloads. Or you can go further with data platforms like Azure Fabric that bring in streaming and time series data to use for AI applications.

If you’ve already tried different approaches, you likely need to rearchitect your data layer to get away from the operational sprawl of fragmented microservices, where every data hand-off between separate vector stores, graph databases, and document silos introduces latency and governance gaps. Too many points of failure make it hard to deliver high availability guarantees.

“The traditional patchwork of databases, pipelines, and bespoke vector stores simply can’t keep up with AI’s latency, governance, and scale requirements,” says Nadeem Asghar, chief product and technology officer at cloud AI database platform SingleStore. “Unified intelligence planes will replace today’s fragmented stacks, collapsing data, compute, and inference into a single live system.”

Dominik Tomicevic, CEO of graph database provider Memgraph recommends separating the models and agents that form your intelligence layer from your knowledge layer, where truth, data, and information live and require synchronous or near‑synchronous replicas across zones or regions.

Although AI infrastructure means tackling data- and network-heavy distributed systems, he views that as a solvable engineering problem. “A resilient AI stack starts with a strongly‑typed knowledge graph or GraphRAG store that can be clustered, replicated, backed up, monitored, and accessed‑controlled just like any other mission‑critical database,” he says.

That gives you flexibility to scale search and data nodes separately, or even change models and vendors in the future. It also means security and resilience go hand in hand.

“Fine‑grained access control at the graph level means the retrieval layer will never leak data that the underlying database wouldn’t allow, even if an LLM is curious,” he adds. “On top of that, you bake in observability and service-level objectives specifically for AI, like latency and error budgets for GraphRAG queries, quality metrics for retrieval results, and cost budgets for model calls.”

Put platforms in place

The pressure to move from prototypes to production deployments that deliver the value of AI means individual projects need policies and best practices to build on, rather than having to make all the right decisions themselves, so they can focus on technical questions like choosing models rather than building infrastructure.

If this sounds like the principles of platform engineering, that’s how you make AI a capability rather than a series of experiments. IDC’s Saroff argues that a unified platform workflow you’ve already done gives you a backbone of process, APIs, data, and technologies. Instead of solving the same problem over and over, you deliver infrastructure that includes GPUs and accelerators, as well as multiple flavors of compute, observability for models, API calls and applications, as well as cost management and governance.

All these systems need to be feeding into observability and optimization tools with near real time feedback. You can’t wait until you get your monthly cloud bill to discover you’ve blown past your budget, or until you hit an outage to realize the APIs you rely on are returning errors and requiring multiple retries. API management is key to track usage and optimize costs.

And you need all of that to integrate with existing infrastructure and workflows. “Every company has the same problem. You need AI to compete, but all your actual business runs on legacy infrastructure and software that predates the iPhone,” argues Jarrod Vawdrey, field chief data scientist at Domino Data Lab.

He defines forward deployed engineers as translators who navigate the complexity between desired business outcomes, legacy systems, and modern AI capabilities. “They can wrangle a large language model and integrate it with your 20-year-old ERP system that nobody wants to touch.”

The integrations will be new but the fundamentals aren’t. Doing IT right is what will allow you to do AI correctly, says Bola Rotibi, chief of enterprise research at technology research and advisory firm CCS Insight.

The good news is you may already have done the heavy lifting using, for example, well-architected frameworks for cloud as AI applications will inherit that redundancy, exception handling, and chaos engineering. “If your architecture is built for resiliency, then chances are you’ve already started thinking about all the things required to underpin AI,” she says.

All of this, of course, is going to cost money. IDC predicts that by 2027, organizations will realize they’re underestimating the costs of AI infrastructure by almost a third, and will start applying FinOps to it.

But true resilience relies on understanding both business and operational context, making for a more combined, collaborative environment, Rotibi suggests. While CIOs usually struggle to justify infrastructure investments, tying them to delivering reliable and secure AI allows IT to continue providing value that’s aligned with business priorities rather than being seen as a cost center.

Read More from This Article: Delivering resilience and continuity for AI
Source: News

Delivering resilience and continuity for AI

Get good at GPUs

Flatten your architecture

Data by design

Put platforms in place

Related posts