Skip to content
Tiatra, LLCTiatra, LLC
Tiatra, LLC
Information Technology Solutions for Washington, DC Government Agencies
  • Home
  • About Us
  • Services
    • IT Engineering and Support
    • Software Development
    • Information Assurance and Testing
    • Project and Program Management
  • Clients & Partners
  • Careers
  • News
  • Contact
 
  • Home
  • About Us
  • Services
    • IT Engineering and Support
    • Software Development
    • Information Assurance and Testing
    • Project and Program Management
  • Clients & Partners
  • Careers
  • News
  • Contact

What CIOs can learn from the massive Optus outage

The week’s high-profile resignation of Optus CEO Kelly Bayer Rosmarin in the wake of the Australian telco’s massive outage that left 10 million Australians and 400,000 businesses without phone or internet for up to 12 hours earlier this month underscores the stakes involved when it comes to setting an IT strategy for business resilience.

At a Australian Senate inquiry last week, Lambo Kanagaratnam, the telco’s managing director of networks, told lawmakers that Optus “didn’t have a plan in place for that specific scale of outage.” Rosmarin herself admitted that prior to the outage she carried a spare SIM card from competitor Vodafone — and that since the outage she now carries a second spare SIM from rival Telstra.

During the outage, Optus failed to connect 228 triple-0 emergency calls, including one from the colleague of a man suffering a heart attack.

The network outage, which shows the vulnerabilities in interconnected systems, provides a reminder that, despite sophisticated systems, things can, and will, go wrong, and it offers some important lessons for CIOs to take prudent action now.

As dramatic and widespread the Optus outage was, such incidents are far from isolated anomalies and happen to many organizations with differing levels of severity. And industry analysis finds the cost of such outages is increasing, according to Uptime Institute’s Annual Outage Report 2023.

For CIOs, handling such incidents goes beyond just managing IT systems. It demands a blend of foresight, strategic prioritization, and having effective disaster recovery plans in place. The Optus outage provides a prompt for assessment, offering IT leaders insights into how to better strengthen defenses and how to better respond when things go wrong. Here are some of the key lessons of this latest high-profile IT outage.

Adopt a protocol to test updates first

Initial reports from Optus connected the outage to “changes to routing information from an international peering network” in the wake of a “routine software upgrade.” Parent company SingTel has since refuted that explanation, citing safety systems in routers at Optus being at fault, not the software upgrade.  

In her Senate testimony, Bayer Rosmarin stated that the root cause was that the company’s routers “hit a fail-safe mechanism, which meant that each one of them independently shut down,” an event she said was “triggered by the upgrade on the SingTel international peering network.”

Be that as it may, the outage underscores an important point: Before rolling out updates, particularly organization- or network-wide updates, it’s advisable to test on an internal system before uploading to the network. “It’s what they call ‘fat fingers,’” says telecommunications analyst Paul Budde.

“If there is an error in it, you want the network to recognize it and filter it out or you can get this cascading effect across the whole system,” Budde says. “And if the whole network is down, technicians will have problems just getting into the system. Then the question becomes: What is your redundancy?”

In the case of Optus, the fix involved a system reset of more than 100 devices in 14 sites across Australia. In all, a core group of 150 engineers and technicians worked to remedy the outage, “while 250 other workers and five international companies also provided support,” according to a report from ABC News based on Senate inquiry documents.

Map weak points and address them

Gabby Fredkin, head of data and analytics at IT research and advisory firm Adapt, says it is vital to map your company’s infrastructure, segment services so they can stand alone in the event of an outage, identify weak points, and stress-test those weak points to understand any vulnerabilities in the system.

“It’s easier said than done,” Fredkin concedes.

Still, networks are only as robust as their weakest points, and when there’s a single point of failure, especially if it relates to critical infrastructure, it can result in crippling system-wide outages. At the very least, CIOs must know where these single points of failure exist in their systems to help ensure redundancy and provide context for making decisions around priorities and budget.

“You may not be able to have redundant paths across your entire network; it’s just too expensive. But when major outages happen to your organization or others, it’s an opportunity to review the risk versus the cost,” says Matt Tett, managing director of Enex Test Lab.

“It is worth reviewing the budget and considering whether it’s good to have more dual loading on the network to save a bit of pain in the future,” he says.

Planning for inevitable outages

Even if they’re not overseeing vast networks like Optus’, IT leaders and their executive counterparts must plan for outages, their own or those of their service providers, as even small or localized outages can still disrupt the business and its customers.

“It’s important to review your business continuity plans and ensure you’ve got some kind of backup, where possible, to continue with [business as usual],” says Tett.

This business continuity plan might include processes for reverting to paper-based systems, shifting to cellular coverage instead of internet, ensuring executives and key staff have dual SIM phones to switch networks to ensure continuity of communications, or whatever is relevant to the organization.

“It’s like having a flight manual so that if you lose a significant part of the technology you can try and ensure there are some offline ways to continue functioning,” he says.

Spark the disaster recovery conversation

CIOs can use these headline-making incidents to spur conversations with their infrastructure leaders to review their disaster recovery plan. “Don’t wait for something to happen. It should be an ongoing, systematic approach to look at where vulnerabilities lie,” says Fredkin, who cites Netflix’s Chaos Monkey, which creates random outages in its production environment, as a key component of the streaming media giant’s strategy for improving the resiliency of its complex systems.

“Causing chaos in their system allows them to expose weak points, see how things might pan out, and plan and run drills of what could happen,” he says.

Conversations around disaster recovery need to involve the CFO and CEO to map the risks of being offline and of losing customer trust, as well as the costs to mitigate those risks. “How one company is impacted can differ substantially to the way another company’s impacted, so you’ve got to take that into account to,” Fredkin says.

Understand third-party risks

According to Uptime, managed digital infrastructure services, including cloud, colocation, telecom, and hosting companies, account for a growing proportion of outages today. As such IT leaders must be aware of — and know how to manage — third-party vendor risks, says Budde, “particularly in a technological landscape where cost-saving measures and outsourcing have become common.”

For software or hardware updates, it’s vital to have a list of critical vendors along with the timing and nature of updates. CIOs need to look at whether it’s feasible to roll out updates to some customers and not others or to parts of your infrastructure and not others, Fredkin says. They also need to find “a way you can do some testing so it doesn’t impact the entire by production environment,” he adds.

“Having good relationships with the people who provide the hardware and the software is crucial. Knowing when something, like an update, is coming ahead of time, and having some sort of control over when that update is pushed through to your organization can be very beneficial,” he says.

Make the case for IT modernization

As unfortunate as they are, headline-grabbing outages often offer the opportunity for IT leaders to make their own case for IT modernization, Fredkin advises. Although not expressly the case with Optus, when systems go offline, it is often related to a legacy technology issue, and these incidents can help motivate buy-in at the leadership and board level to update systems to ensure they’re secure and resilient at speed and at scale, he says.

“When CIOs are making a modernization use case, they need to have the stakeholder buy-in for the business to come along the journey,” he says.

Complex, mission-critical functions can take two to three years to complete, so there needs to be a way of ordering and prioritizing efforts as well. “Think of it like a traffic-light system,” Fredkin says, looking at what is crucial and critical, and what is urgent. “What are the biggest gaps in the system? And in terms of the longer-term refresh, that’s a different prioritization, because some things need to be done in a specific order,” he says.

“It’s that classic waterfall mentality, which still has a very big place when it comes to redesigning critical infrastructure,” he adds.

Consider the larger picture

Whether they originate with your systems or are the result of connected networks, outages can impact a wide range of businesses at once. As such, IT leaders might want to consider thinking beyond their organization’s four walls, Budde says.

“A tailored disaster and resilience plan needs to include compliance with industry standards and regular review of IT systems and protocols to ensure robustness, particularly in response to potential network stress and security threats,” he says, adding that such efforts might need to go further than just your organization, depending on your industry.

“We may need some out-of-the-box thinking and start looking at nationwide solutions and industry-wide solutions in how organizations can assist each other in these situations,” he says.

Overlook communications to your peril

Last, but by no means least, organizations need a comprehensive communications playbook for when outages or disruptions occur, regardless of whether those outages originate with them.

“It’s vital to have clear, concise communication about any outages or issues,” says Enex Test Labs’ Tett. This communication should be up the chain to the CEO as well as outward to customers and the media to provide as much clarity as possible about the situation.

“The first thing organizations need to think of is how to clearly communicate with their customers, even if it’s not them that’s causing a disruption. And the second is, if they can’t communicate with their customers because of network outages, have a strategy in place to be able to communicate via the media,” he says.

It should also include some kind of time frame to help manage expectations around downtime and restoration of business as usual. “Whether it’s a few hours or 48 hours, be open and transparent,” says Tett.

Business Continuity, Disaster Recovery, Networking, Telecommunications Industry
Read More from This Article: What CIOs can learn from the massive Optus outage
Source: News

Category: NewsNovember 23, 2023
Tags: art

Post navigation

PreviousPrevious post:Beyond gigabit: the need for 10 Gbps in business networksNextNext post:A forensic look to modernize tech at South Africa’s SIU

Related posts

Workday’s new dev tools help enterprises connect with external agents
June 5, 2025
Why runtime security is the key to cloud protection
June 5, 2025
Autonomous and credentialed: AI agents are the next cloud risk
June 5, 2025
How AI is helping PwC clients comply with European Union sustainability regulations
June 5, 2025
Behind the cloud reset: What CIOs are learning from real world deployments
June 5, 2025
The ROI of AI: Why impact > hype
June 5, 2025
Recent Posts
  • Workday’s new dev tools help enterprises connect with external agents
  • Why runtime security is the key to cloud protection
  • Autonomous and credentialed: AI agents are the next cloud risk
  • How AI is helping PwC clients comply with European Union sustainability regulations
  • Behind the cloud reset: What CIOs are learning from real world deployments
Recent Comments
    Archives
    • June 2025
    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • October 2024
    • September 2024
    • August 2024
    • July 2024
    • June 2024
    • May 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • November 2023
    • October 2023
    • September 2023
    • August 2023
    • July 2023
    • June 2023
    • May 2023
    • April 2023
    • March 2023
    • February 2023
    • January 2023
    • December 2022
    • November 2022
    • October 2022
    • September 2022
    • August 2022
    • July 2022
    • June 2022
    • May 2022
    • April 2022
    • March 2022
    • February 2022
    • January 2022
    • December 2021
    • November 2021
    • October 2021
    • September 2021
    • August 2021
    • July 2021
    • June 2021
    • May 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    Categories
    • News
    Meta
    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    Tiatra LLC.

    Tiatra, LLC, based in the Washington, DC metropolitan area, proudly serves federal government agencies, organizations that work with the government and other commercial businesses and organizations. Tiatra specializes in a broad range of information technology (IT) development and management services incorporating solid engineering, attention to client needs, and meeting or exceeding any security parameters required. Our small yet innovative company is structured with a full complement of the necessary technical experts, working with hands-on management, to provide a high level of service and competitive pricing for your systems and engineering requirements.

    Find us on:

    FacebookTwitterLinkedin

    Submitclear

    Tiatra, LLC
    Copyright 2016. All rights reserved.