A database error knocked out digital services at Starbucks, McDonald’s, and many more big brands last November. A cybersecurity update grounded flights, canceled surgeries, and thousands of other services in 2024.
Today IT-related disasters get their own year-in-review recaps. Tomorrow the growing prevalence of agentic AI systems will compound complexity and court greater risk.
A collection of public and private clouds, edge networks, so-called AI factories filled with GPUs and specialized hardware support these emerging workloads — each a link in a complex web that can put business resiliency at risk when any component goes down.
“As technology leaders look ahead, the question is how to build infrastructure that can thrive under the scale, speed, and complexity that AI demands,” notes Iram Parveen, assistant manager at Deloitte’s Center for Integrated Research.
The good news? IT leaders can focus on controlling what they can control and take steps to mitigate the risks of widespread IT dependencies — and maybe, just maybe, avert disasters. But first, a recap of how organizations became so enshrouded in complexity.
COVID compounded IT complexity
It’s been a rough couple of years for digital services, but no one should be surprised.
Organizations, desperate to maintain business productivity for globally distributed teams during the COVID-19 pandemic, boosted complexity by adopting dozens of new applications and services — both on-premises and in the cloud.
This, in turn, increased the interdependence between customers and their providers, with more API calls and endpoints than most organizations have any business supporting. From on-premises to cloud, to microservices to SaaS, the failure points are myriad — and they’re increasing.
For CIOs, the risk profile has shifted from data center uptime to ecosystem fragility, says Bradd Busick, principal for AI, data, and technology at Frazier Healthcare Partners and former CIO of MultiCare Health System.
Put another way: Businesses have effectively institutionalized complexity, says IDC analyst Frank Dickson. “With that complexity driven into interconnecting systems, what may have caused problems in a system now can be replicated through all systems,” he adds.
IT resilience, from 30,000 feet
What’s an IT leader to do at a time when most organizations are adding more applications and services — especially as the lure of AI proves too strong to ignore? What’s the playbook for IT resiliency amid the growing complexity?
Broadly, CIOs must fuse cybersecurity, business continuity, and architecture into an enterprise discipline that assumes failure and designs around it, says Busick.
IT leaders must leverage these elements to run their minimum viable business (MVB). What constitutes an MVB varies per industry, but for an airline this would include ensuring that its flight reservation system is always available to customers.
“If end users can’t check email, that’s a problem,” IDC’s Dickson says. “If airlines can’t fly their planes, that ends their business.”
The IT resilience playbook
How do IT leaders do that in practice? The approach must be multitiered, with classes of protection: proactive, active, and reactive.
Proactive measures, which include technology architecture choices and contractual approaches that improve resilience posture for technology in production are key, says Brent Ellis, principal analyst at Forrester Research. These may include “fire drills” that pressure-test employees and critical systems for outages, cybersecurity incidents, and natural disasters.
Busick says he’s segmented critical platforms, such as electronic health record (EHR) systems, as well as medications and monitoring systems, from general enterprise IT systems to hedge against outages or cyberattacks.
IT leaders naturally have plenty of technology tools at their disposal to help with these endeavors.
One toolset includes observability, a software toolset designed to gain deep visibility into IT systems’ health and performance through telemetry data such as logs, metrics, and traces. While past approaches included monitoring known problems, observability enables IT staff to query systems’ behavior to detect and debug novel issues before they adversely impact the environment.
Active measures cover day-to-day operations and the services used to monitor and manage technology in the business. Classic reactive measures include backups, disaster recovery infrastructure, failover and high-availability environments, incident plans, and crisis management practices.
Some of these areas may overlap, but ultimately establishing system-level resilience, rather than at the component or service level, is critical, Ellis says. Resilience must also be tested as the tech environment changes.
“Organizationally, enterprises must break down the wall between technology implementation and the business,” Ellis says. “Because at this point, technology is the business and technology resilience is business resilience.”
Preparing for the AI tidal wave
These best practices, designed to protect core and ancillary business operations against systemic outages, cybersecurity attacks, and other risks, are all critical arrows in the IT resiliency quiver. Collectively, these approaches will take on greater importance as organizations grow their consumption of AI workloads.
While most organizations are not yet launching agentic AI en masse, the technology will exponentially increase business risks when it mainstreams. After all, while agentic systems may scale productivity, they can also wreck “your entire organization at scale,” says Dickson, citing last year’s incident of how an agentic coding tool deleted a whole database.
Whether IT leaders are protecting agentic systems, or physical or virtual supply chains, there is no 100% solution. For all the talk of preparing people, technology, and processes, organizations are still at the mercy of their providers, their tools, and good old-fashioned human error.
“It’s not about perfect; it’s about good,” Dickson says. “How do we reduce the complexity, increase the redundancy, and how do we make these systems better?”
Read More from This Article: CIOs must rethink resiliency for an increasingly complex IT world
Source: News

