CrowdStrike, the cybersecurity company at the center of massive global IT outages, blamed the meltdown on a bug in quality control software that allowed bad data in an update to be sent to millions of computers running Microsoft Windows.
CrowdStrike routinely sends out configuration updates for its Falcon Sensor product, a software suite that monitors and protects the user’s computer against threats and attacks.
Those updates are delivered in two different ways. One is called “Sensor Content,” which directly updates CrowdStrike’s Falcon Sensor and runs at the highest level of access to system resources. Separately, there is “Rapid Response Content” that updates how that sensor behaves to detect malware, allowing for fast response to changing threats.
However, a Rapid Response Content update that went out on the morning of July 19 included a broken file and slipped through CrowdStrike’s quality-control software.
The incident review further indicates that while CrowdStrike performs both automated and manual testing on sensor content, it places “trust in the checks performed in the Content Validator” on Rapid Response Content, which had until this point run smoothly.
The assumption that the Rapid Response Content rollout wouldn’t cause issues led to the Falcon Sensor loading the problematic update. This caused an out-of-bounds memory read, a type of error that occurs when a program tries to read data from memory that is outside of the bounds of what it is allowed to access. This triggered an exception that “could not be gracefully handled, resulting in a Windows operating system crash,” according to the review.
CrowdStrike, a California-based company, lost one-fifth of its stock value in the wake of the disaster. The firm promised to reform the way that it issues critical content updates.
Specifically, the company said it is planning to implement a “staggered deployment strategy” for future updates, first sending them out to just a handful of machines before a global rollout. This method is known in the industry as a “canary deployment.”
CrowdStrike will also “enhance existing error handling in the Content Interpreter,” which is part of the Falcon Sensor.
CrowdStrike also promised to use humans to test its Rapid Response Content, add extra validation checks to the content validator, and give customers the option to decide when and where these updates are deployed.
“Nothing is more important to me than the trust and confidence that our customers and partners have put into CrowdStrike,” George Kurtz, the company’s founder and chief executive, said in a statement following the outages.
“As we resolve this incident, you have my commitment to provide full transparency on how this occurred and steps we’re taking to prevent anything like this from happening again.”