Case in point: taking stock of the CrowdStrike outages

Last summer, a faulty CrowdStrike software update took down millions of computers, caused billions in damages, and underscored that companies are still not able to manage third-party risks, or respond quickly and efficiently to disruptions.

“It was an interesting case study of global cyber impact,” says Charles Clancy, CTO at Mitre.

In response to the outage, 84% of companies are either considering diversifying their software and service providers, or are already doing so, according to a survey by Adaptavist released in late January.

For companies who had been using CrowdStrike, switching vendors might seem like an obvious solution.

“But then what endpoint detection and response platform should you use instead?” Clancy asks. “Ditching them isn’t the answer if they’re the best product on the market.”

What happened

In CrowdStrike’s own root cause analysis, the cybersecurity company’s Falcon system deploys a sensor to user machines to monitor potential dangers. On July 19, 2024, CrowdStrike released an update, and it crashed user machines.

The company released a fix 78 minutes later, but making it required users to manually access the affected devices, reboot in safe mode, and delete a bad file. An automated fix wasn’t released until three days later.

A total of 8.5 million computers were affected. As a result of the outage, thousands of flights were canceled and tens of thousands delayed worldwide. Several hospitals canceled surgeries as well, and banks, airports, public transit systems, 911 centers, and multiple government agencies — including the Department of Homeland Security — also suffered outages.

The overall cost was estimated at $5.4 billion for Fortune 500 firms alone, according to an analysis by Parametrix, and total economic damages could run into tens of billions, Nir Perry, CEO of cyber insurance risk platform Cyberwrite, told Reuters. By comparison, the previous record-holder for most expensive downtime was the 2017 AWS outage, which cost customers an estimated $150 million.

Delta alone had more than $500 million in losses as a result of crippled operations and thousands of flight cancellations and delays. In a lawsuit the airline filed in October, Delta claimed the faulty update was pushed out in an unsafe manner and CrowdStrike should pay for the losses. In a countersuit, CrowdStrike blamed Delta for the airline’s problems, saying that other airlines were able to recover much faster, and that the contract between the two companies meant Delta wasn’t allowed to sue for damages.

In total, CrowdStrike’s stock price fell from $343 the day before the outage to a low of $218 on August 2. That’s a loss of over $30 billion or more than a third of its total market capitalization.

But, as of January 28, the company’s stock price was over $400, an all-time high, helped by a perfect score on an industry test for ransomware detection. And also by improvements to its quality control processes as CrowdStrike added a check for that particular problem after the outage, as well as other tests, deployment layers, and checks. Customers also got additional controls over how updates are deployed.

In addition, CrowdStrike hired two independent software security vendors to review the Falcon sensor code, its quality control, and release processes, and also changed how its updates are released: more gradually, to “increasing rings of deployment,” says Adam Meyers, CrowdStrike’s SVP for counter adversary operations. “This allows us to monitor for issues in a controlled environment and proactively roll back changes if problems are detected before affecting a wider population,” he told a Congressional subcommittee in September.

But while CrowdStrike made changes, companies around the world re-evaluated how much trust they placed in their vendors, reviewed their software security processes, and refocused their attention on resilience.

Trust, but verify. On second thought, don’t trust…

The outage was a rude awakening for Akamai, a content delivery company, says CIO and SVP Kate Prouty. “It was a reminder of how incredibly interconnected the world is,” she says.

Akamai was not itself a CrowdStrike customer, but does use similar services from outside vendors to help protect its systems.

“The first thing we did was audit all the solutions we have that have an agent that sits on a machine and has access to an operating system to make sure none of them have auto update,” she says. “When you have a third-party vendor that pushes updates to a system automatically, that takes control out of your hands.”

But turning off automatic updates can be a problem for some companies. What if there’s an urgent security fix? It can take time to test each update to make sure it works before rolling it out — time bad actors can take advantage of.

If there’s a security threat and potential exposure, you have to go through the testing process as quickly as you can, Prouty says. “There’s no point in patching even a security issue without knowing if it’s going to cause harm in your environment,” she adds.

Akamai has a structure in place that allows it to do the testing quickly, and involves both automation and human intervention. “It’s worth doing that extra step of diligence because it can save you problems down the road,” she says. After the testing is complete, the update is then rolled out in stages. “It doesn’t completely eliminate the risk, but it certainly reduces the risk of having a large-scale impact,” she adds.

When possible, Akamai avoids using tools that require agents, though there are areas, including cybersecurity, where they’re necessary and the benefits outweigh the risks. “But we didn’t have a lot of them to audit, and we didn’t find anything that was misconfigured,” says Prouty.

Akamai also has other measures in place to reduce the risk of problems third-party software causes, including microsegmentation and identity-based authentication and access controls.

Contracts, audits, and SBOMs

Beyond protecting enterprise architecture from dangerous updates, and dangerous software in general, there are other steps companies can take to safeguard their software supply chain, starting with selecting the vendor and signing the contract. “I’m a CIO in an enviable position in that we sell security solutions that work very well,” Prouty says. “Our legal team knows exactly what to ask for when negotiating contracts. If a company isn’t willing to provide us with what we require to keep our company safe, then we don’t do business with them.”

According to the Cybersecurity and Infrastructure Security Agency, it’s hard for vendors to invest money in security if customers aren’t asking for it. That means, in addition to creating a secure by design philosophy within software companies, the industry also needs a secure by demand philosophy on the buyer side.

As part of this effort, CISA released a software acquisition guide in August for government enterprise customers that could serve as a model for enterprises in general.

The guide addresses four phases of software ownership: software supply chains, development practices, deployment, and vulnerability management, and says they help organizations buying software better understand their software manufacturers’ approach to cybersecurity, and ensure that secure by design is a core consideration.

After the CrowdStrike incident, Akamai began reviewing all its vendor agreements to make sure the contracts had all the necessary protections in place. “We’re still in the process of looking at everything,” Prouty says.

And, again, it’s not enough to take the vendor’s word for it that they’re safe. Akamai, for example, uses tools that audit the configuration of cloud software solutions, as well as run other security checks. “They’re not going to eliminate risk but they’ll significantly reduce it,” she says.

Another approach that enterprises are increasingly using is asking vendors to provide a software bill of materials (SBOM). In an Anchore survey released in November, 78% of organizations plan to increase their use of SBOMs in the next 18 months.

Building resilience

Unfortunately, all the precautions in the world can only reduce risk, not eliminate it. That’s why Akamai also plans for worst-case scenarios and runs drills to gauge its ability to respond quickly, and look for areas that need improvement. Immediately after the CrowdStrike outage happened, for example, Akamai ran a tabletop exercise.

“If this had happened to us, what would it look like?” Prouty asks. The exercise even involved running through CrowdStrike’s remediation process. The exercise worked, she says, and Akamai would’ve been able to recover if the bad update had slipped through the checks.

More companies should be doing these kinds of preparedness drills, says Mitre’s Clancy. “You need to understand your incident response plan, your communication plan, and not just have it written down, but practice it so those skills are fresh,” he says.

In addition, it’s important to involve more than just the security team in these exercises. “When you have an incident, the entire business is impacted,” he adds. “CIOs need to bring the other business executives in on these exercises and disaster response plans. In the real world, they’re the ones calling the shots, not some incident response manager three levels down.”

Resiliency is particularly important since enterprises can’t always test all third-party software. “Independently auditing every software update isn’t practical,” Clancy says. “The best thing to do is have playbooks in place to respond and recover if something like this does happen.” But 84% of organizations didn’t have an adequate incident response plan in place before the CrowdStrike outage took place, the Adaptavist survey shows. And of those who did have a plan, only 16% found them effective during the crisis. Fortunately, that might now be changing.

After the outage, 54% of organizations say they’re implementing an incident response plan, or investing more into the one they have. Plus, about half are introducing or increasing investment into a variety of testing measures, and monitoring and observing technologies over the next 12 months.

Next steps

Guy Moskowitz, CEO and co-founder at Coro Cybersecurity, says the big problem is when vendors prioritize speed and profits over best practices. “CrowdStrike pushes out around a dozen updates every day,” he says. That’s a lot of opportunities for things to go wrong. “I hope we’ll see a push for legislation that recommends or even requires that all cybersecurity companies immediately implement staging environment safeguards to their software upgrade rollout process,” he adds. “This way, they’ll catch any mishaps in a secure environment before rolling out the update broadly to customers.”

He’s not the only one who wants to see government action. In the Adaptavist survey, 47% of respondents say they’re now more supportive of regulations around cybersecurity and resilience than they were before, and 48% are more supportive of regulations around software quality assurance. In addition, 49% endorse mandatory incident reporting requirements.

In August, the US Technology Policy Committee of the Association for Computing Machinery released a statement calling for a thorough investigation of the incident so both private enterprises and regulators can learn how to better strengthen cyberinfrastructure, improve incident response programs and remediation processes, improve international coordination and cooperation, and develop claims processes for these incidents.

“When mistakes happen, it can be serious — and this was a very serious incident,” says Jody Westby, vice-chair of AMC’s US Technology Policy Committee. “Companies had to go through and reset systems, and it took weeks to recover from this.”

But there’s only so much individual customers can do, she says.

“The big vendors aren’t going to have 5,000 different contracts with 5,000 different customers,” she says. “In some cases we can push contract clauses and say, ‘You’ll send us a SOC 2 report every year and you’ll attest you have all these controls.’ And they might sign and say yes, but you won’t really know. There’s only so far you can go with due diligence.”

What the CrowdStrike incident has done is highlight the need for better government assistance, she says.

The Association for Computing Machinery says there’s already an organization that seems to be uniquely positioned to undertake an investigation into the incident and publish results: the CISA’s Cyber Safety Review Board. In its statement, the ACM urged the US government to provide the CSRB with the necessary resources it needs to take on this investigation. That would have been nice but instead, the Department of Homeland Security just disbanded it, citing “misuse of resources.” The AI Safety and Security Board was also disbanded. That’s a particular problem because, just as with CrowdStrike, there’s a growing dependence on a small number of vendors. OpenAI’s ChatGPT, Anthropic’s Claude, Google’s Gemini, and Meta’s Llama are the foundation of nearly all enterprise AI applications, says Chuck Herrin, field CISO at security firm F5.

“Our rush to adopt AI without corresponding investment in security and resilience suggests we’re setting ourselves up for potentially catastrophic failures that could make the CrowdStrike incident appear minor in retrospect,” he says. “The CrowdStrike incident required physical access to affected systems for recovery, yet organizations are now creating AI dependencies so deep that manual intervention may become impossible.”

Read More from This Article: Case in point: taking stock of the CrowdStrike outages
Source: News