- | 8:00 am
CrowdStrike showed us the risks of over-automation. Will we heed the warning?
Sometimes, disruption is not innovation.
Last month’s CrowdStrike incident was preventable. Even though the company claims to have testing protocols, it still managed to release buggy software in an automated update that caused tremendous damage and inconvenience to thousands of people. There are many factors that contributed to this mistake, and some of them have to do with people, assumptions, and misapplied startup culture.
The computerized systems that run most of the world we depend on (shipping, rail and air freight, flights, hospitals, banks, governments, and emergency services, etc.) all have operating systems with specific configurations, and many of these machine configurations are not standardized to each other. Because they are heterogeneous, they do not run, fail, or stay intact in standardized ways. There are different vendors for different needs, machines, industries, and preferences. And within those contexts, even when the same software is used, there are different versions of the same software. A lot of that software relies on automated updates. With the Crowdsource fiasco, we saw what happens if one of those automated updates go wrong.
CrowdStrike said the incident was caused by a bug in the “Content Validator,” which is software that evaluates other software to ensure that the data contained in various types of attachments are valid and usable. CrowdStrike’s automated testing process inadvertently released a bug that caused automated updates to crash machines remotely. These were so bad that many machines could not accept a remote automated boot up to fix them. As a result, any machines still need to be manually reset. That costs money and time, and keeps critical services offline, which is dangerous.
CrowdStrike was founded by CEO George Kurtz, who was trained as an accountant and ran Risk Management for McAfee, before becoming its CTO. While at McAfee, Kurtz was no stranger to meddlesome bugs: In 2010, McAfee experienced a glitch with Windows XP that caused “widespread problems,” and preceded the company’s sale later that year to Intel.
Startups often have more nimble ways to produce and deploy code than Enterprise companies, but this is not always a positive. If CrowdStrike is similar to other publicly traded companies, its management is focused on quarterly earnings and its accountability to shareholders. Enthusiastic automation of testing might have been an area where CrowdStrike attempted to be frugal, or to “disrupt and automate security,” producing the “change” its CEO desires. It’s been a common startup practice to automate testing internally and then to release sequential updates to fix any bugs, but nowadays other businesses are incorporating it, too, as part of a being “agile” and pushing for more and more efficiency.
But startups that grow into larger businesses that have both small business and enterprise clients, and may not necessarily consider the context of them in the rush to scale. Sometimes thinking like a startup in that way can be dangerous. Rather than initially taking more care in-house, this shifts the testing to their customers’ live businesses instead. The problem with this becomes apparent when an automated update with automated or partially automated testing, has bugs that pass through the automated safeguards and cause fatal crashes to machines in industries with services upon which people’s real lives depend—like 911, or hospitals.
We’ve developed a cultural consumer model that is all about replacing the old with the new, and not much on repairing the old. If something is working,the people who are empowered to make decisions may figure that if something works, it doesn’t need maintenance or updating.
One-stop solutions like CrowdStrike, that offer automated updates, make maintenance invisible, and perhaps even offer companies a way to trim their IT departments. Financially, this is a compelling narrative. This approach was successful for CrowdStrike, and it quickly outgrew its startup stature—but perhaps not its culture. Startup culture has to mature when it provides services to enterprise companies running software that controls services and products upon which people’s lives and livelihoods depend. In those use cases, software needs to be resilient and robust. It’s why government projects take so long and have so many steps. Safety requires redundancy, attention, and care. And that means thorough and complete testing.
Data theft and hacking are huge security issues that can have physical repercussions, as with the Las Vegas hotel hack and others. But when the trusted security company crashes everyone’s machines by mistake, it also creates vulnerabilities for hackers to do damage, by providing easy opportunities to breach systems, while machines all over the world are compromised from the faulty updates.
During the outages, the FCC posted on X that it was “aware of reports of a systems outage causing disruptions in service, including 911.” But beyond that, the agency didn’t offer the public much intel.
What makes this all the more frustrating is that Crowdstrike did comment to a prior FCC call for shaping policy. In 2023, CrowdStrike responded to the FCC’s request for the Federal Communications Commission: Data Breach Reporting Requirements with a lengthy submission suggesting ways to protect systems. It concluded with the ill-fated quote, “There’s only one thing to remember about CrowdStrike: We stop breaches.”
There must be something that CrowdStrike was missing internally to actually cause outages instead. To ensure that it won’t happen again, CrowdStrike said that it intends to take measures to localize some testing with developers, add additional checks to the Content Validator, and include other safeguards to protect against this type of problematic content from being deployed in the future.
These protocols could have been implemented in the first place. Why weren’t they?
The security industry has likely had some good reasons for evolving the way it has, but radical disruptions, such as the way CrowdStrike approaches wanting to “change the security industry” (Kurtz’s words), are potentially insecure and expensive for its customers. It’s a hard lesson for the CrowdStrike CEO, who is now being called by Congressional Republicans on the House Homeland Security committee to testify about how these incidents happened and what can be done to prevent them in the future.
Perhaps this will encourage CrowdStrike and others to add people back into the testing loop. But, it seems just as likely that the glitch we saw last month becomes known not as a bug, but rather as an increasingly common feature in our rush to automate. And adding AI without thorough testing will just make things worse.
[Correction, August 2, 2024: A previous version of this story said that the CrowdStrike incident caused breaches. In fact, it caused outages.]