When the AI goes dark: Building enterprise resilience for the age of agentic AI

[Note: This article was written in conjunction with Eugene Chuvyrov and Sheeraz Memon, Innovation Engineering, ServiceNow.]

When my home internet went down for days, it initially felt like a pleasant break. Then reality hit: doorbell, security system, thermostat, lights, gym, speakers…all dead. Every assistant stopped assisting. It turned out that my home had become dependent on connectivity in ways I hadn’t fully grasped.

Well before AI became ubiquitous, we already saw what traditional IT fragility could cost. Southwest Airlines lost $800 million in 2023 when it canceled 16,700 flights during peak holiday travel, while Meta’s 2021 six-hour global outage cost the company $100 million in revenue and a five percent decline in stock valuation.

Now consider how enterprises are racing to deploy AI agents, much more complex and far harder to restore, at unprecedented speed in this AI-first world. I’ve spent years in executive conversations about AI — strategy, architecture, transformation — yet not once has anyone raised the topic of AI disaster recovery. Not a single time.

We are sprinting toward what I call agentic amnesia, a state where enterprises become so dependent on AI that its failure erases the organizational intelligence needed to recover. In that process, companies are building intelligence fragility into their very foundation. Yet no one seems to be planning for what happens when it breaks.

Why traditional disaster recovery falls short

For decades, disaster recovery has centered on a straightforward premise: Back up your systems, replicate your data and restore from a known state when things go wrong. Assets such as servers, storage and databases can be snapshotted, copied and recovered. The playbook was well understood.

AI systems break this model entirely. Instead of merely storing data, AI accumulates intelligence. When we talk about AI “state,” we’re describing something fundamentally different from a database that can be rolled back.

Consider what’s actually at stake. Embeddings are how an AI system encodes and retrieves knowledge. Think of it as an employee’s mental map of where information lives across the organization. Fine-tuned model weights represent customizations that shape how the AI reasons about your specific business context, much like institutional knowledge built over time. Agent workflows are multistep processes that AI executes autonomously, like a trained team running a complex playbook without supervision.

Lose this state, and you haven’t just lost data. You’ve lost the organizational intelligence that took hundreds of human days of annotation, iteration and refinement to create. You can’t simply re-enter it from memory.

Worse, a corrupted AI state doesn’t announce itself the way a crashed server does. Joint research from Anthropic, the UK AI Security Institute and the Alan Turing Institute found that as few as 250 malicious documents can produce a backdoor vulnerability in a large language model. A 13-billion-parameter model can be compromised by the same small number of poisoned documents as a 600-million-parameter model, challenging the assumption that scale provides protection. By the time you notice a poisoned model has degraded performance and subtly propagated wrong outputs, the damage may already be embedded in decisions across the enterprise.

You won’t be able to simply restore from backup when the very model itself has been compromised.

The intelligence fragility

This challenge is compounded by the immaturity of the AI vendor landscape. Hyperscale cloud providers may advertise “four nines” of uptime (99.99% availability, which translates to roughly 52 minutes of downtime per year), but many AI providers, particularly the startups emerging rapidly in this space, cannot yet offer these enterprise-grade service guarantees.

In June 2024, ChatGPT, Claude, Perplexity and Google Gemini all experienced outages at roughly the same time Your AI-powered workforce may be far more fragile than your continuity plans assume, especially without clear commitments for model uptime, latency and recovery time. You may actually be at a point where your business cannot function until the AI provider you are using recovers its service.

ServiceNow’s 2025 Enterprise AI Maturity Index found that average maturity scores dropped 9 points year-over-year, with fewer than 1% of organizations scoring above 50 on a 100-point scale. The finding suggests AI innovation is outpacing organizations’ capacity to deploy it safely at scale.

We are moving toward a world where businesses cannot function without their digital workforce. When AI agents handle customer interactions, manage supply chains, execute financial processes and coordinate operations, a sustained AI outage isn’t an inconvenience. It’s an existential threat.

The overlooked resilience layer

The solution isn’t purely technical. Even in an AI-driven enterprise, people remain the final line of resilience. History suggests that new jobs and capabilities emerge alongside technological disruption, as long as organizations invest in developing them.

In most companies, workforce readiness for AI, particularly for the event of AI failure, appears to be an afterthought at best. This is a fatal blind spot. Unlike a database outage, where employees can revert to manual processes they remember, AI agents perform work that humans may no longer know how, or perform at the necessary scale. If your AI-powered customer service goes down, can your team step in? Do they understand the workflows well enough to execute them? Have they been cross-trained to bridge the gap?

Business continuity planning must ensure that staff understand AI pipelines and data flows, that teams are cross-trained to prevent reliance on a handful of specialists, and that substitution plans exist for when AI systems falter.

Humans are not just a fallback option. They are an integral component of a resilient AI-native enterprise. Motivated, trained and prepared teams can bridge gaps when AI fails, ensuring continuity of both systems and operations. When you continually reduce your workforce to appease your shareholders, will your human employees remain motivated, trained and prepared?

The strategic imperative

AI is no longer experimental technology. It has become foundational business infrastructure. Without robust continuity planning that accounts for AI’s unique fragility, organizations risk operational paralysis when — not if — these systems fail. And the risk will only increase over time: as AI-powered automation expands across the enterprise, the people and knowledge needed to handle those tasks if it fails will continue to diminish.

The stakes extend beyond operational risk. Trust has become the foundational architecture separating organizations capable of deploying autonomous agents from those perpetually managing the consequences of systems they cannot safely control. As organizations establish AI councils and governance committees, this conversation must be on the table. In my experience, it hasn’t been.

So how do you do it?

Commission an AI resilience audit. Map every AI dependency, identify single points of failure and assess recovery capabilities for each.
Demand enterprise-grade SLAs from AI vendors. If they can’t commit to uptime, latency and recovery guarantees in writing, factor that fragility into your risk planning.
Designate AI continuity owners. Someone must be accountable for AI resilience in the same way that someone owns cybersecurity or financial controls.
Run failure drills. Simulate AI outages quarterly. Discover what breaks and who can bridge the gap before a real crisis forces the question, and create a full tactical framework. Netflix famously created Chaos Monkey, a tool that deliberately disrupts systems to test resilience; an AI-powered Chaos Monkey could be valuable for more random resilience testing.
Invest in workforce readiness. Cross-train teams, document AI workflows and build substitution plans so humans can step in when agents step out.

The question every leadership team should be asking today is: If our AI went dark tomorrow, how would we continue to serve customers and keep the business running?

Don’t wait for your own “weekend without internet” moment to discover that your business depends entirely on systems you haven’t prepared to lose. Agentic amnesia isn’t inevitable. Intelligent fragility is a design choice. But so is resilience. When you begin preparing today, you are choosing to be resilient in the future when others falter at their first major AI outage.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Read More from This Article: When the AI goes dark: Building enterprise resilience for the age of agentic AI
Source: News

When the AI goes dark: Building enterprise resilience for the age of agentic AI

Why traditional disaster recovery falls short

The intelligence fragility

The overlooked resilience layer

The strategic imperative

Related posts