BSOD

97%-of-crowdstrike-systems-are-back-online;-microsoft-suggests-windows-changes

97% of CrowdStrike systems are back online; Microsoft suggests Windows changes

falcon punch —

Kernel access gives security software a lot of power, but not without problems.

A bad update to CrowdStrike's Falcon security software crashed millions of Windows PCs last week.

Enlarge / A bad update to CrowdStrike’s Falcon security software crashed millions of Windows PCs last week.

CrowdStrike

CrowdStrike CEO George Kurtz said Thursday that 97 percent of all Windows systems running its Falcon sensor software were back online, a week after an update-related outage to the corporate security software delayed flights and took down emergency response systems, among many other disruptions. The update, which caused Windows PCs to throw the dreaded Blue Screen of Death and reboot, affected about 8.5 million systems by Microsoft’s count, leaving roughly 250,000 that still need to be brought back online.

Microsoft VP John Cable said in a blog post that the company has “engaged over 5,000 support engineers working 24×7” to help clean up the mess created by CrowdStrike’s update and hinted at Windows changes that could help—if they don’t run afoul of regulators, anyway.

“This incident shows clearly that Windows must prioritize change and innovation in the area of end-to-end resilience,” wrote Cable. “These improvements must go hand in hand with ongoing improvements in security and be in close cooperation with our many partners, who also care deeply about the security of the Windows ecosystem.”

Cable pointed to VBS enclaves and Azure Attestation as examples of products that could keep Windows secure without requiring kernel-level access, as most Windows-based security products (including CrowdStrike’s Falcon sensor) do now. But he stopped short of outlining what specific changes might be made to Windows, saying only that Microsoft would continue to “harden our platform, and do even more to improve the resiliency of the Windows ecosystem, working openly and collaboratively with the broad security community.”

When running in kernel mode rather than user mode, security software has full access to a system’s hardware and software, which makes it more powerful and flexible; this also means that a bad update like CrowdStrike’s can cause a lot more problems.

Recent versions of macOS have deprecated third-party kernel extensions for exactly this reason, one explanation for why Macs weren’t taken down by the CrowdStrike update. But past efforts by Microsoft to lock third-party security companies out of the Windows kernel—most recently in the Windows Vista era—have been met with pushback from European Commission regulators. That level of skepticism is warranted, given Microsoft’s past (and continuing) record of using Windows’ market position to push its own products and services. Any present-day attempt to restrict third-party vendors’ access to the Windows kernel would be likely to draw similar scrutiny.

Microsoft has also had plenty of its own security problems to deal with recently, to the point that it has promised to restructure the company to make security more of a focus.

CrowdStrike’s aftermath

CrowdStrike has made its own promises in the wake of the outage, including more thorough testing of updates and a phased-rollout system that could prevent a bad update file from causing quite as much trouble as the one last week did. The company’s initial incident report pointed to a lapse in its testing procedures as the cause of the problem.

Meanwhile, recovery continues. Some systems could be fixed simply by rebooting, though they had to do it as many as 15 times—this could give systems a chance to grab a new update file before they could crash. For the rest, IT admins were left to either restore them from backups or delete the bad update file manually. Microsoft published a bootable tool that could help automate the process of deleting that file, but it still required laying hands on every single affected Windows install, whether on a virtual machine or a physical system.

And not all of CrowdStrike’s remediation solutions have been well-received. The company sent out $10 UberEats promo codes to cover some of its partners’ “next cup of coffee or late night snack,” which occasioned some eye-rolling on social media sites (the code was also briefly unusable because Uber flagged it as fraudulent, according to a CrowdStrike representative). For context, analytics company Parametrix Insurance estimated the cost of the outage to Fortune 500 companies somewhere in the realm of $5.4 billion.

97% of CrowdStrike systems are back online; Microsoft suggests Windows changes Read More »

crowdstrike-blames-testing-bugs-for-security-update-that-took-down-8.5m-windows-pcs

CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

oops —

Company says it’s improving testing processes to avoid a repeat.

CrowdStrike's Falcon security software brought down as many as 8.5 million Windows PCs over the weekend.

Enlarge / CrowdStrike’s Falcon security software brought down as many as 8.5 million Windows PCs over the weekend.

CrowdStrike

Security firm CrowdStrike has posted a preliminary post-incident report about the botched update to its Falcon security software that caused as many as 8.5 million Windows PCs to crash over the weekend, delaying flights, disrupting emergency response systems, and generally wreaking havoc.

The detailed post explains exactly what happened: At just after midnight Eastern time, CrowdStrike deployed “a content configuration update” to allow its software to “gather telemetry on possible novel threat techniques.” CrowdStrike says that these Rapid Response Content updates are tested before being deployed, and one of the steps involves checking updates using something called the Content Validator. In this case, “a bug in the Content Validator” failed to detect “problematic content data” in the update responsible for the crashing systems.

CrowdStrike says it is making changes to its testing and deployment processes to prevent something like this from happening again. The company is specifically including “additional validation checks to the Content Validator” and adding more layers of testing to its process.

The biggest change will probably be “a staggered deployment strategy for Rapid Response Content” going forward. In a staggered deployment system, updates are initially released to a small group of PCs, and then availability is slowly expanded once it becomes clear that the update isn’t causing major problems. Microsoft uses a phased rollout for Windows security and feature updates after a couple of major hiccups during the Windows 10 era. To this end, CrowdStrike will “improve monitoring for both sensor and system performance” to help “guide a phased rollout.”

CrowdStrike says it will also give its customers more control over when Rapid Response Content updates are deployed so that updates that take down millions of systems aren’t deployed at (say) midnight when fewer people are around to notice or fix things. Customers will also be able to subscribe to release notes about these updates.

Recovery of affected systems is ongoing. Rebooting systems multiple times (as many as 15, according to Microsoft) can give them enough time to grab a new, non-broken update file before they crash, resolving the issue. Microsoft has also created tools that can boot systems via USB or a network so that the bad update file can be deleted, allowing systems to restart normally.

In addition to this preliminary incident report, CrowdStrike says it will release “the full Root Cause Analysis” once it has finished investigating the issue.

CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs Read More »

crowdstrike-fixes-start-at-“reboot-up-to-15-times”-and-get-more-complex-from-there

CrowdStrike fixes start at “reboot up to 15 times” and get more complex from there

turning it off and back on again, and again, and again —

Admins can also restore backups or manually delete CrowdStrike’s buggy driver.

CrowdStrike fixes start at “reboot up to 15 times” and get more complex from there

Airlines, payment processors, 911 call centers, TV networks, and other businesses have been scrambling this morning after a buggy update to CrowdStrike’s Falcon security software caused Windows-based systems to crash with a dreaded blue screen of death (BSOD) error message.

We’re updating our story about the outage with new details as we have them. Microsoft and CrowdStrike both say that “the affected update has been pulled,” so what’s most important for IT admins in the short term is getting their systems back up and running again. According to guidance from Microsoft, fixes range from annoying but easy to incredibly time-consuming and complex, depending on the number of systems you have to fix and the way your systems are configured.

Microsoft’s Azure status page outlines several fixes. The first and easiest is simply to try to reboot affected machines over and over, which gives affected machines multiple chances to try to grab CrowdStrike’s non-broken update before the bad driver can cause the BSOD. Microsoft says that some of its customers have had to reboot their systems as many as 15 times to pull down the update.

Early guidance for fixing the CrowdStrike bug is simply to reboot systems over and over again so that they can try to grab a non-broken update.

Enlarge / Early guidance for fixing the CrowdStrike bug is simply to reboot systems over and over again so that they can try to grab a non-broken update.

Microsoft

If rebooting doesn’t work

If rebooting multiple times isn’t fixing your problem, Microsoft recommends restoring your systems using a backup from before 4: 09 UTC on July 18 (just after midnight on Friday, Eastern time), when CrowdStrike began pushing out the buggy update. Crowdstrike says a reverted version of the file was deployed at 5: 27 UTC.

If these simpler fixes don’t work, you may need to boot your machines into Safe Mode so you can manually delete the file that’s causing the BSOD errors. For virtual machines, Microsoft recommends attaching the virtual disk to a known-working repair VM so the file can be deleted, then reattaching the virtual disk to its original VM.

The file in question is a CrowdStrike driver located at Windows/System32/Drivers/CrowdStrike/C-00000291*.sys. Once it’s gone, the machine should boot normally and grab a non-broken version of the driver.

Deleting that file on each and every one of your affected systems individually is time-consuming enough, but it’s even more time-consuming for customers using Microsoft’s BitLocker drive encryption to protect data at rest. Before you can delete the file on those systems, you’ll need the recovery key that unlocks those encrypted disks and makes them readable (normally, this process is invisible, because the system can just read the key stored in a physical or virtual TPM module).

This can cause problems for admins who aren’t using key management to store their recovery keys, since (by design!) you can’t access a drive without its recovery key. If you don’t have that key, Cryptography and infrastructure engineer Tony Arcieri on Mastodon compared this to a “self-inflicted ransomware attack,” where an attacker encrypts the disks on your systems and withholds the key until they get paid.

And even if you do have a recovery key, your key management server might also be affected by the CrowdStrike bug.

We’ll continue to track recommendations from Microsoft and CrowdStrike about fixes as each company’s respective status pages are updated.

“We understand the gravity of the situation and are deeply sorry for the inconvenience and disruption,” wrote CrowdStrike CEO George Kurtz on X, formerly Twitter. “We are working with all impacted customers to ensure that systems are back up and they can deliver the services their customers are counting on.”

CrowdStrike fixes start at “reboot up to 15 times” and get more complex from there Read More »