Cybersecurity vendor CrowdStrike initiated a series of computer system outages across the world on Friday, July 19, disrupting nearly every industry and sowing chaos at airports, financial institutions, and healthcare systems, among others.
At issue was a flawed update to CrowdStrike Falcon, the company’s popular endpoint detection and response (EDR) platform, which crashed Windows machines and sent them into an endless reboot cycle, taking down servers and rendering ‘blue screens of death’ on displays across the world.
How did the CrowdStrike outage unfold?
Australian businesses were among the first to report encountering difficulties on Friday morning, with some continuing to encounter difficulties throughout the day. Travelers at Sydney Airport experienced delays and cancellations. At 6pm Australian Eastern Standard Time (08:00 UTC), Bank Australia posted an announcement to its home page saying that its contact center services were still experiencing problems.
Businesses across the globe followed suit, as their days began. Travelers at airports in Hong Kong, India, Berlin, and Amsterdam encountered delays and cancelations. The Federal Aviation Administration reported that US airlines grounded all flights for a period of time, according to the New York Times.
What has been the impact of the CrowdStrike outage?
As one of the largest cybersecurity companies, CrowdStrike’s software is very popular among businesses across the globe. For example, it is estimated that over half of Fortune 500 companies use its security products.
Because of this, fallout from the flawed update has been widespread and substantial, with some calling it the “largest IT outage in history.”
To provide scope for this, more than 3,000 flights within, into, or out of the US were canceled on July 19, with more than 11,000 delayed. Planes continued to be grounded in the days since, with nearly 2,500 flights canceled within, into, or out of the US, and more than 38,000 delayed, three days after the outage occurred.
The outage also significantly impacted the healthcare industry, with some healthcare systems and hospitals postponing all or most procedures and clinicians resorting to pen and paper, unable to access EHRs.
Given the nature of the fix for many enterprises, and the popularity of CrowdStrike’s software, IT organizations have been working around the clock to restore their systems, with many still mired in doing so days after the initial faulty update was served up by CrowdStrike.
What caused the CrowdStrike outage?
In a blog post on July 19, CrowdStrike CEO George Kurtz apologized to the company’s customers and partners for crashing their Windows systems. Separately, the company provided initial details about what caused the disaster.
According to CrowdStrike, a defective content update to its Falcon EDR platform was pushed to Windows machines at 04:09 UTC (0:09 ET) on Friday, July 19. CrowdStrike typically pushes updates to configuration files (called “Channel Files”) for Falcon endpoint sensors several times a day.
The defect that triggered the outage was in Channel File 291, which is stored in “C:WindowsSystem32driversCrowdStrike” with a filename beginning “C-00000291-” and ending “.sys”. Channel File 291 passes information to the Falcon sensor about how to evaluate “named pipe” execution, which Windows systems use for intersystem or interprocess communication. These commands are not inherently malicious but can be misused.
“The update that occurred at 04:09 UTC was designed to target newly observed, malicious named pipes being used by common C2 [command and control] frameworks in cyberattacks,” the technical post explained.
However, according to CrowdStrike, “The configuration update triggered a logic error that resulted in an operating system crash.”
Upon automatic reboot, the Windows systems with the defective Channel File 291 installed would crash again, causing an endless reboot cycle.
The flawed update only impacted machines running Windows. Linux and MacOS machines using CrowdStrike were unaffected, according to the company.
How has CrowdStrike responded?
According to the company, CrowdStrike pushed out a fix removing the defective content in Channel File 291 just 79 minutes after the initial flawed update was sent. Machines that had not yet updated to the faulty Channel File 291 update would not be impacted by the flaw. But those machines that had already downloaded the defective content weren’t so lucky.
To remediate those systems caught up in endless reboot, CrowdStrike published another blog post with a far longer set of actions to perform. Included were suggestions for remotely detecting and automatically recovering affected systems, with detailed sets of instructions for temporary workarounds for affected physical machines or virtual servers, including manual reboots.
How has recovery from the outage fared?
For many organizations, recovering from the outage is an ongoing issue. With one suggested solution for remedying the defective content being to reboot each machine manually into safe mode, deleting the defective file, and restarting the computer, doing so at scale will remain a challenge.
It has been noted that some organizations with hardware refresh plans in place are considering accelerating those plans as a remedy to replace affected machines rather than commit the resources necessary to conduct the manual fix to their fleets.
What is CrowdStrike Falcon?
CrowdStrike Falcon is endpoint detection and response (EDR) software that monitors end-user hardware devices across a network for suspicious activities and behavior, reacting automatically to block perceived threats and saving forensics data for further investigation.
Like all EDR platforms, CrowdStrike has deep visibility into everything happening on an endpoint device — processes, changes to registry settings, file and network activity — which it combines with data aggregation and analytics capabilities to recognize and counter threats by either automated processes or human intervention.
Because of this, Falcon is privileged software with deep administrative access to the systems it monitors, making it tightly integrated with core operating systems, with the ability to shut down activities that it deems malicious. This tight integration proved to be a weakness for IT organizations in this instance, rendering Windows machines inoperable due to the flawed Falcon update.
The company has also introduced AI-powered automation capabilities into Falcon for IT, to help bridge the gap between IT and security operations, according to the company.
What has been the fallout of CrowdStrike’s failure?
In addition to dealing with fixing their Windows machines, IT leaders and their teams are evaluating lessons that can be gleaned from the incident, with many looking at ways to avoid single points of failure and re-evaluating their cloud strategies.
As for CrowdStrike, US Congress has called on CEO Kurtz to testify at a hearing about the tech outage. According to the New York Times, Kurtz was sent a letter by Representative Mark Green (R-Tenn.), chairman of the Homeland Security Committee, and Representative Andrew Garbarino (R-NY).
Americans “deserve to know in detail how this incident happened and the mitigation steps CrowdStrike is taking,” they wrote, according to the New York Times.
Ongoing coverage of the CrowdStrike failure
- July 19: Blue screen of death strikes crowd of CrowdStrike servers
- July 20: CrowdStrike CEO apologizes for crashing IT systems around the world, details fix
- July 20: Put not your trust in Windows — or CrowdStrike
- July 22: CrowdStrike incident has CIOs rethinking their cloud strategies
- July 22: Microsoft pins Windows outage on EU-enforced ‘interoperability’ deal
- July 22: Focusing open source on security, not ideology
- July 22: Early IT takeaways from the CrowdStrike outage
Read More from This Article: CrowdStrike failure: What you need to know
Source: News