Crowdstrike

tsa-silent-on-crowdstrike’s-claim-delta-skipped-required-security-update

TSA silent on CrowdStrike’s claim Delta skipped required security update


We’re all trying to find the guy who did this

CrowdStrike and Delta’s legal battle has begun. Will Microsoft be sued next?

Travelers sit with their luggage on the check-in floor of the Delta Air Lines terminal at Los Angeles International Airport (LAX) on July 23, 2024 in Los Angeles, California. Credit: Mario Tama / Staff | Getty Images News

Delta and CrowdStrike have locked legal horns, threatening to drag out the aftermath of the worst IT outage in history for months or possibly years.

Each refuses to be blamed for Delta’s substantial losses following a global IT outage caused by CrowdStrike suddenly pushing a flawed security update despite Delta and many other customers turning off auto-updates.

CrowdStrike has since given customers more control over updates and made other commitments to ensure an outage of that scale will never happen again, but Delta isn’t satisfied. The airline has accused CrowdStrike of willfully causing losses by knowingly deceiving customers by failing to disclose an unauthorized door into their operating systems that enabled the outage.

In a court filing last Friday, Delta alleged that CrowdStrike should be on the hook for the airline’s more than $500 million in losses—partly because CrowdStrike has admitted that it should have done more testing and staggered deployments to catch the bug before a wide-scale rollout that disrupted businesses worldwide.

“As a result of CrowdStrike’s failure to use a staged deployment and without rollback capabilities, the Faulty Update caused widespread and catastrophic damage to millions of computers, including Delta’s systems, crashing Delta’s workstations, servers, and redundancy systems,” Delta’s complaint said.

Delta has further alleged that CrowdStrike postured as a certified best-in-class security provider who “never cuts corners” while secretly designing its software to bypass Microsoft security certifications in order to make changes at the core of Delta’s computing systems without Delta’s knowledge.

“Delta would have never agreed to such a dangerous process had CrowdStrike disclosed it,” Delta’s complaint said.

In testimony to Congress, CrowdStrike executive Adam Meyers suggested that the faulty update did follow standard protocols. He explained that “CrowdStrike’s software code is certified by Microsoft” and that it’s “updated less frequently,” and “new configurations are sent with rapid occurrence to protect against threats as they evolve,” not to bypass security checks, as Delta alleged.

But by misleading customers about these security practices, Delta alleged, CrowdStrike put “profit ahead of protection and software stability.” As Delta sees it, CrowdStrike built in the unauthorized door so that it could claim to resolve security issues more quickly than competitors. And if a court agrees that CrowdStrike’s alleged failure to follow standard industry best practices does constitute, at the very least, “gross negligence,” Delta could win.

“While we aimed to reach a business resolution that puts customers first, Delta has chosen a different path,” CrowdStrike’s spokesperson told Ars. “Delta’s claims are based on disproven misinformation, demonstrate a lack of understanding of how modern cybersecurity works, and reflect a desperate attempt to shift blame for its slow recovery away from its failure to modernize its antiquated IT infrastructure. We have filed for a declaratory judgment to make it clear that CrowdStrike did not cause the harm that Delta claims and they repeatedly refused assistance from both CrowdStrike and Microsoft. Any claims of gross negligence and willful misconduct have no basis in fact.”

CrowdStrike sues to expose Delta’s IT flaws

In its court filing, however, CrowdStrike said there’s much more to the story than that. It has accused Delta of failing to follow laws, including best practices established by the Transportation Security Administration (TSA).

While many CrowdStrike customers got systems back up and running within a day of the outage, Delta’s issues stretched painfully for five days, disrupting travel for a million customers. According to CrowdStrike, the prolonged delay at Delta was not due to CrowdStrike failing to provide adequate assistance but allegedly to Delta’s own negligence to comply with TSA requirements designed to ensure that no major airline ever experiences prolonged system outages.

“Despite the immediate response from CrowdStrike, it was Delta’s own response and IT infrastructure that caused delays in Delta’s ability to resume normal operation, resulting in a longer recovery period than other major airlines,” CrowdStrike’s complaint said.

In March 2023, the TSA added a cybersecurity emergency amendment to its cybersecurity programs. The amendment required airlines like Delta to develop “policies and controls to ensure that operational technology systems can continue to safely operate in the event that an information technology system has been compromised,” CrowdStrike’s complaint said.

Complying with the amendment ensured that airlines could “timely” respond to any exploitation of their cybersecurity or operating systems, CrowdStrike explained.

CrowdStrike realized that Delta was allegedly non-compliant with the TSA requirement and other laws when its “efforts to help remediate the issues revealed” alleged “technological shortcomings and failures to follow security best practices, including outdated IT systems, issues in Delta’s active directory environment, and thousands of compromised passwords.”

TSA declined Ars’ request to comment on whether it has any checks in place to ensure compliance with the emergency amendment.

While TSA has made no indication so far that it intends to investigate CrowdStrike’s claims, the Department of Transportation (DOT) is currently investigating Delta’s seemingly inferior customer service during the outage. That probe could lead to monetary fines, potentially further expanding Delta’s losses.

In a statement, DOT Secretary Pete Buttigieg said, “We have made clear to Delta that they must take care of their passengers and honor their customer service commitments. This is not just the right thing to do, it’s the law, and our department will leverage the full extent of our investigative and enforcement power to ensure the rights of Delta’s passengers are upheld.”

On X (formerly Twitter), Buttigieg said that the probe was sparked after DOT received hundreds of complaints about Delta’s response. A few days later, Buttigieg confirmed that the probe would “ensure the airline is following the law and taking care of its passengers during continued widespread disruptions.” But DOT declined Ars’ request to comment on whether DOT was investigating Delta’s alleged non-compliance with TSA security requirements, only noting that “TSA is not part of DOT.”

Will Microsoft be sued next?

Delta has been threatening legal action over the CrowdStrike outage since August, when Delta confirmed in an SEC filing that the outage caused “approximately 7,000 flight cancellations over five days.” At that time, Delta CEO Ed Bastian announced, “We are pursuing legal claims against CrowdStrike and Microsoft to recover damages caused by the outage, which total at least $500 million.”

But Delta’s lawsuit Friday notably does not name Microsoft as a defendant.

Ars could not immediately reach Delta’s lawyer, David Boies, to confirm if another lawsuit may be coming or if that legal threat to Microsoft was dropped.

It could be that Microsoft dissuaded Delta from filing a complaint. Immediately in August, Microsoft bucked Delta’s claims that the tech giant was in any way liable for Delta’s losses, The Register reported. In a letter to Boies, Microsoft lawyer Mark Cheffo wrote that Microsoft “empathizes” with Delta, but Delta’s public comments blaming Microsoft for the outage are “incomplete, false, misleading, and damaging to Microsoft and its reputation.”

“The truth is very different from the false picture you and Delta have sought to paint,” Cheffo wrote, noting that Microsoft did not cause the outage and Delta repeatedly turned down Microsoft’s offers to help restore its systems. That includes one instance where a Delta employee allegedly responded to a Microsoft inquiry three days after the outage by saying that Delta was “all good.” Additionally, a message from Microsoft CEO Satya Nadella to Delta’s Bastian allegedly went unanswered.

Cheffo alleged that Delta was cagey about accepting Microsoft’s help because “the IT system it was most having trouble restoring—its crew-tracking and scheduling system—was being serviced by other technology providers, such as IBM, because it runs on those providers’ systems, and not Microsoft Windows or Azure.”

According to Cheffo, Microsoft was “surprised” when Delta threatened to sue since the issues seemed to be with Delta’s IT infrastructure, not Microsoft’s services.

“Microsoft continues to investigate the circumstances surrounding the CrowdStrike incident to understand why other airlines were able to fully restore business operations so much faster than Delta, including American Airlines and United Airlines,” Cheffo wrote. “Our preliminary review suggests that Delta, unlike its competitors, apparently has not modernized its IT infrastructure, either for the benefit of its customers or for its pilots and flight attendants.”

At that time, Cheffo told Boies that Microsoft planned to “vigorously defend” against any litigation. Additionally, Microsoft’s lawyer demanded that Delta preserve documents, including ones showing “the extent to which non-Microsoft systems or software, including systems provided and/or designed by IBM, Oracle, Amazon Web Services, Kyndryl or others, and systems using other operating systems, such as Linux, contributed to the interruption of Delta’s business operations between July 19 and July 24.”

It seems possible that Cheffo’s letter spooked Delta out of naming Microsoft as a defendant in the lawsuit over the outage, potentially to avoid a well-resourced opponent or to save public face if Microsoft’s proposed discovery threatened to further expose Delta’s allegedly flawed IT infrastructure.

Microsoft declined Ars’ request to comment.

CrowdStrike says TOS severely limits damages

CrowdStrike appears to be echoing Microsoft’s defense tactics, arguing that Delta struggled to recover due to its own IT failures.

According to CrowdStrike, even if Delta’s breach of contract claims are valid, CrowdStrike’s terms of service severely limit damages. At most, CrowdStrike’s terms stipulate, damages owed to Delta may be “two times the value of the fees paid to service provider for the relevant subscription services subscription term,” which is likely substantially less than $500 million.

And Delta wants much more than lost revenue returned. Beyond the $500 million in losses, the airline has asked a Georgia court to calculate punitive damages and recoup Delta for future revenue losses as its reputation took a hit due to public backlash from Delta’s lackluster response to the outage.

“CrowdStrike must ‘own’ the disaster it created,” Delta’s complaint said, alleging that “CrowdStrike failed to exercise the slight diligence or care of the degree that persons of common sense, however inattentive they may be, would use under the same or similar circumstances.”

CrowdStrike is hoping a US district court jury will agree that Delta was the one that dropped the ball the most as the world scrambled to recover from the outage. The cybersecurity company has asked the jury to declare that any potential damages are limited by CrowdStrike’s subscriber terms and that “CrowdStrike was not grossly negligent and did not commit willful misconduct in any way.”

This story was updated to include CrowdStrike’s statement.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

TSA silent on CrowdStrike’s claim Delta skipped required security update Read More »

microsoft-to-host-security-summit-after-crowdstrike-disaster

Microsoft to host security summit after CrowdStrike disaster

Bugging out —

Redmond wants to improve the resilience of Windows to buggy software.

Photo of a Windows BSOD

Microsoft is stepping up its plans to make Windows more resilient to buggy software after a botched CrowdStrike update took down millions of PCs and servers in a global IT outage.

The tech giant has in the past month intensified talks with partners about adapting the security procedures around its operating system to better withstand the kind of software error that crashed 8.5 million Windows devices on July 19.

Critics say that any changes by Microsoft would amount to a concession of shortcomings in Windows’ handling of third-party security software that could have been addressed sooner.

Yet they would also prove controversial among security vendors that would have to make radical changes to their products, and force many Microsoft customers to adapt their software.

Last month’s outages—which are estimated to have caused billions of dollars in damages after grounding thousands of flights and disrupting hospital appointments worldwide—heightened scrutiny from regulators and business leaders over the extent of access that third-party software vendors have to the core, or kernel, of Windows operating systems.

Microsoft will host a summit next month for government representatives and cyber security companies, including CrowdStrike, to “discuss concrete steps we will all take to improve security and resiliency for our joint customers,” Microsoft said on Friday.

The gathering will take place on September 10 at Microsoft’s headquarters near Seattle, it said in a blog post.

Bugs in the kernel can quickly crash an entire operating system, triggering the millions of “blue screens of death” that appeared around the globe after CrowdStrike’s faulty software update was sent out to clients’ devices.

Microsoft told the Financial Times it was considering several options to make its systems more stable and had not ruled out completely blocking access to the Windows kernel—an option some rivals fear would put their software at a disadvantage to the company’s internal security product, Microsoft Defender.

“All of the competitors are concerned that [Microsoft] will use this to prefer their own products over third-party alternatives,” said Ryan Kalember, head of cyber security strategy at Proofpoint.

Microsoft may also demand new testing procedures from cyber security vendors rather than adapting the Windows system itself.

Apple, which was not hit by the outages, blocks all third-party providers from accessing the kernel of its MacOS operating system, forcing them to operate in the more limited “user-mode.”

Microsoft has previously said it could not do the same, after coming to an understanding with the European Commission in 2009 that it would give third parties the same access to its systems as that for Microsoft Defender.

Some experts said, however, that this voluntary commitment to the EU had not tied Microsoft’s hands in the way it claimed, arguing that the company had always been free to make the changes now under consideration.

“These are technical decisions of Microsoft that were not part of [the arrangement],” said Thomas Graf, a partner at Cleary Gottlieb in Brussels who was involved in the case.

“The text [of the understanding] does not require them to give access to the kernel,” added AJ Grotto, a former senior director for cyber security policy at the White House.

Grotto said Microsoft shared some of the blame for the July disruption since the outages would not have been possible without its decision to allow access to the kernel.

Nevertheless, while it might boost a system’s resilience, blocking kernel access could also bring “real trade-offs” for the compatibility with other software that had made Windows so popular among business customers, Forrester analyst Allie Mellen said.

“That would be a fundamental shift for Microsoft’s philosophy and business model,” she added.

Operating exclusively outside the kernel may lower the risk of triggering mass outages but it was also “very limiting” for security vendors and could make their products “less effective” against hackers, Mellen added.

Operating within the kernel gave security companies more information about potential threats and enabled their defensive tools to activate before malware could take hold, she added.

An alternative option could be to replicate the model used by the open-source operating system Linux, which uses a filtering mechanism that creates a segregated environment within the kernel in which software, including cyber defense tools, can run.

But the complexity of overhauling how other security software works with Windows means that any changes will be hard for regulators to police and Microsoft will have strong incentives to favor its own products, rivals said.

It “sounds good on paper, but the devil is in the details,” said Matthew Prince, chief executive of digital services group Cloudflare.

© 2024 The Financial Times Ltd. All rights reserved Not to be redistributed, copied, or modified in any way.

Microsoft to host security summit after CrowdStrike disaster Read More »

parody-site-clownstrike-refused-to-bow-to-crowdstrike’s-bogus-dmca-takedown

Parody site ClownStrike refused to bow to CrowdStrike’s bogus DMCA takedown

Parody site ClownStrike refused to bow to CrowdStrike’s bogus DMCA takedown

Doesn’t CrowdStrike have more important things to do right now than try to take down a parody site?

That’s what IT consultant David Senk wondered when CrowdStrike sent a Digital Millennium Copyright Act (DMCA) takedown notice targeting his parody site ClownStrike.

Senk created ClownStrike in the aftermath of the largest IT outage the world has ever seen—which CrowdStrike blamed on a buggy security update that shut down systems and incited prolonged chaos in airports, hospitals, and businesses worldwide.

Although Senk wasn’t personally impacted by the outage, he told Ars he is “a proponent of decentralization.” He seized the opportunity to mock “CrowdStrike’s ability to cause literal billions of dollars of damage” because he viewed this as “collateral from the incredible amount of ‘centralization’ in the tech industry.”

Setting up the parody site at clownstrike.lol on July 24, Senk’s site design is simple. It shows the CrowdStrike logo fading into a cartoon clown, with circus music blasting throughout the transition. For the first 48 hours of its existence, the site used an unaltered version of CrowdStrike’s Falcon logo, which is used for its cybersecurity platform, but Senk later added a rainbow propeller hat to the falcon’s head.

“I put the site up initially just to be silly,” Senk told Ars, noting that he’s a bit “old-school” and has “always loved parody sites” (like this one).

It was all fun and games, but on July 31, Senk received a DMCA notice from Cloudflare’s trust and safety team, which was then hosting the parody site. The notice informed Senk that CSC Digital Brand Services’ global anti-fraud team, on behalf of CrowdStrike, was requesting the immediate removal of the CrowdStrike logo from the parody site, or else Senk risked Cloudflare taking down the whole site.

Senk immediately felt the takedown was bogus. His site was obviously parody, which he felt should have made his use of the CrowdStrike logos—altered or not—fair use. He immediately responded to Cloudflare to contest the notice, but Cloudflare did not respond to or even acknowledge receipt of his counter notice. Instead, Cloudflare sent a second email warning Senk of the alleged infringement, but once again, Cloudflare failed to respond to his counter notice.

This left Senk little choice but to relocate his parody site to “somewhere less-susceptible to DMCA takedown requests,” Senk told Ars, which ended up being a Hetzner server in Finland.

Currently on the ClownStrike site, when you click a CSC logo altered with a clown wig, you can find Senk venting about “corporate cyberbullies” taking down “content that they disagree with” and calling Cloudflare’s counter notice system “hilariously ineffective.”

“The DMCA requires service providers to ‘act expeditiously to remove or disable access to the infringing material,’ yet it gives those same ‘service providers’ 14 days to restore access in the event of a counternotice!” Senk complained. “The DMCA, like much American legislation, is heavily biased towards corporations instead of the actual living, breathing citizens of the country.”

Reached for comment, CrowdStrike declined to comment on ClownStrike’s takedown directly. But it seems like the takedown notice probably never should have been sent to Senk. His parody site likely got swept up in CrowdStrike’s anti-fraud efforts to stop bad actors attempting to take advantage of the global IT outage by deceptively using CrowdStrike’s logo on malicious sites.

“As part of our proactive fraud management activities, CrowdStrike’s anti-fraud partners have issued more than 500 takedown notices in the last two weeks to help prevent bad actors from exploiting current events,” CrowdStrike’s statement said. “These actions are taken to help protect customers and the industry from phishing sites and malicious activity. While parody sites are not the intended target of these efforts, it’s possible for such sites to be inadvertently impacted. We will review the process and, where appropriate, evolve ongoing anti-fraud activities.”

Parody site ClownStrike refused to bow to CrowdStrike’s bogus DMCA takedown Read More »

crowdstrike-claps-back-at-delta,-says-airline-rejected-offers-for-help

CrowdStrike claps back at Delta, says airline rejected offers for help

Who’s going to pay for this mess? —

Delta is creating a “misleading narrative,” according to CrowdStrike’s lawyers.

LOS ANGELES, CALIFORNIA - JULY 23: Travelers from France wait on their delayed flight on the check-in floor of the Delta Air Lines terminal at Los Angeles International Airport (LAX) on July 23, 2024 in Los Angeles, California.

Enlarge / LOS ANGELES, CALIFORNIA – JULY 23: Travelers from France wait on their delayed flight on the check-in floor of the Delta Air Lines terminal at Los Angeles International Airport (LAX) on July 23, 2024 in Los Angeles, California.

CrowdStrike has hit back at Delta Air Lines’ threat of litigation against the cyber security company over a botched software update that grounded thousands of flights, denying it was responsible for the carrier’s own IT decisions and days-long disruption.

In a letter on Sunday, lawyers for CrowdStrike argued that the US carrier had created a “misleading narrative” that the cyber security firm was “grossly negligent” in an incident that the US airline has said will cost it $500 million.

Delta took days longer than its rivals to recover when CrowdStrike’s update brought down millions of Windows computers around the world last month. The airline has alerted the cyber security company that it plans to seek damages for the disruptions and hired litigation firm Boies Schiller Flexner.

CrowdStrike addressed Sunday’s letter to the law firm, whose chair, David Boies, has previously represented the US government in its antitrust case against Microsoft and Harvey Weinstein, among other prominent clients.

Microsoft has estimated that about 8.5 million Windows devices were hit by the faulty update, which stranded airline passengers, interrupted hospital appointments and took broadcasters off air around the world. CrowdStrike said last week that 99 percent of Windows devices running the affected Falcon software were now back online.

Major US airlines Delta, United, and American briefly grounded their aircraft on the morning of July 19. But while United and American were able to restore their operations over the weekend, Delta’s flight disruptions continued well into the following week.

The Atlanta-based carrier in the end canceled more than 6,000 flights, triggering an investigation from the US Department of Transportation amid claims of poor customer service during the operational chaos.

CrowdStrike’s lawyer, Michael Carlinsky, co-managing partner of Quinn Emanuel Urquhart & Sullivan, wrote that, if it pursues legal action, Delta Air Lines would have to explain why its competitors were able to restore their operations much faster.

He added: “Should Delta pursue this path, Delta will have to explain to the public, its shareholders, and ultimately a jury why CrowdStrike took responsibility for its actions—swiftly, transparently and constructively—while Delta did not.”

CrowdStrike also claimed that Delta’s leadership had ignored and rejected offers for help: “CrowdStrike’s CEO personally reached out to Delta’s CEO to offer onsite assistance, but received no response. CrowdStrike followed up with Delta on the offer for onsite support and was told that the onsite resources were not needed.”

Delta Chief Executive Ed Bastian said last week that CrowdStrike had not “offered anything” to make up for the disruption at the airline. “Free consulting advice to help us—that’s the extent of it,” he told CNBC on Wednesday.

While Bastian has said that the disruption would cost Delta $500 million, CrowdStrike insisted that “any liability by CrowdStrike is contractually capped at an amount in the single-digit millions.”

A spokesperson for CrowdStrike accused Delta of “public posturing about potentially bringing a meritless lawsuit against CrowdStrike” and said it hoped the airline would “agree to work cooperatively to find a resolution.”

Delta Air Lines declined to comment.

© 2024 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

CrowdStrike claps back at Delta, says airline rejected offers for help Read More »

97%-of-crowdstrike-systems-are-back-online;-microsoft-suggests-windows-changes

97% of CrowdStrike systems are back online; Microsoft suggests Windows changes

falcon punch —

Kernel access gives security software a lot of power, but not without problems.

A bad update to CrowdStrike's Falcon security software crashed millions of Windows PCs last week.

Enlarge / A bad update to CrowdStrike’s Falcon security software crashed millions of Windows PCs last week.

CrowdStrike

CrowdStrike CEO George Kurtz said Thursday that 97 percent of all Windows systems running its Falcon sensor software were back online, a week after an update-related outage to the corporate security software delayed flights and took down emergency response systems, among many other disruptions. The update, which caused Windows PCs to throw the dreaded Blue Screen of Death and reboot, affected about 8.5 million systems by Microsoft’s count, leaving roughly 250,000 that still need to be brought back online.

Microsoft VP John Cable said in a blog post that the company has “engaged over 5,000 support engineers working 24×7” to help clean up the mess created by CrowdStrike’s update and hinted at Windows changes that could help—if they don’t run afoul of regulators, anyway.

“This incident shows clearly that Windows must prioritize change and innovation in the area of end-to-end resilience,” wrote Cable. “These improvements must go hand in hand with ongoing improvements in security and be in close cooperation with our many partners, who also care deeply about the security of the Windows ecosystem.”

Cable pointed to VBS enclaves and Azure Attestation as examples of products that could keep Windows secure without requiring kernel-level access, as most Windows-based security products (including CrowdStrike’s Falcon sensor) do now. But he stopped short of outlining what specific changes might be made to Windows, saying only that Microsoft would continue to “harden our platform, and do even more to improve the resiliency of the Windows ecosystem, working openly and collaboratively with the broad security community.”

When running in kernel mode rather than user mode, security software has full access to a system’s hardware and software, which makes it more powerful and flexible; this also means that a bad update like CrowdStrike’s can cause a lot more problems.

Recent versions of macOS have deprecated third-party kernel extensions for exactly this reason, one explanation for why Macs weren’t taken down by the CrowdStrike update. But past efforts by Microsoft to lock third-party security companies out of the Windows kernel—most recently in the Windows Vista era—have been met with pushback from European Commission regulators. That level of skepticism is warranted, given Microsoft’s past (and continuing) record of using Windows’ market position to push its own products and services. Any present-day attempt to restrict third-party vendors’ access to the Windows kernel would be likely to draw similar scrutiny.

Microsoft has also had plenty of its own security problems to deal with recently, to the point that it has promised to restructure the company to make security more of a focus.

CrowdStrike’s aftermath

CrowdStrike has made its own promises in the wake of the outage, including more thorough testing of updates and a phased-rollout system that could prevent a bad update file from causing quite as much trouble as the one last week did. The company’s initial incident report pointed to a lapse in its testing procedures as the cause of the problem.

Meanwhile, recovery continues. Some systems could be fixed simply by rebooting, though they had to do it as many as 15 times—this could give systems a chance to grab a new update file before they could crash. For the rest, IT admins were left to either restore them from backups or delete the bad update file manually. Microsoft published a bootable tool that could help automate the process of deleting that file, but it still required laying hands on every single affected Windows install, whether on a virtual machine or a physical system.

And not all of CrowdStrike’s remediation solutions have been well-received. The company sent out $10 UberEats promo codes to cover some of its partners’ “next cup of coffee or late night snack,” which occasioned some eye-rolling on social media sites (the code was also briefly unusable because Uber flagged it as fraudulent, according to a CrowdStrike representative). For context, analytics company Parametrix Insurance estimated the cost of the outage to Fortune 500 companies somewhere in the realm of $5.4 billion.

97% of CrowdStrike systems are back online; Microsoft suggests Windows changes Read More »

crowdstrike-blames-testing-bugs-for-security-update-that-took-down-8.5m-windows-pcs

CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

oops —

Company says it’s improving testing processes to avoid a repeat.

CrowdStrike's Falcon security software brought down as many as 8.5 million Windows PCs over the weekend.

Enlarge / CrowdStrike’s Falcon security software brought down as many as 8.5 million Windows PCs over the weekend.

CrowdStrike

Security firm CrowdStrike has posted a preliminary post-incident report about the botched update to its Falcon security software that caused as many as 8.5 million Windows PCs to crash over the weekend, delaying flights, disrupting emergency response systems, and generally wreaking havoc.

The detailed post explains exactly what happened: At just after midnight Eastern time, CrowdStrike deployed “a content configuration update” to allow its software to “gather telemetry on possible novel threat techniques.” CrowdStrike says that these Rapid Response Content updates are tested before being deployed, and one of the steps involves checking updates using something called the Content Validator. In this case, “a bug in the Content Validator” failed to detect “problematic content data” in the update responsible for the crashing systems.

CrowdStrike says it is making changes to its testing and deployment processes to prevent something like this from happening again. The company is specifically including “additional validation checks to the Content Validator” and adding more layers of testing to its process.

The biggest change will probably be “a staggered deployment strategy for Rapid Response Content” going forward. In a staggered deployment system, updates are initially released to a small group of PCs, and then availability is slowly expanded once it becomes clear that the update isn’t causing major problems. Microsoft uses a phased rollout for Windows security and feature updates after a couple of major hiccups during the Windows 10 era. To this end, CrowdStrike will “improve monitoring for both sensor and system performance” to help “guide a phased rollout.”

CrowdStrike says it will also give its customers more control over when Rapid Response Content updates are deployed so that updates that take down millions of systems aren’t deployed at (say) midnight when fewer people are around to notice or fix things. Customers will also be able to subscribe to release notes about these updates.

Recovery of affected systems is ongoing. Rebooting systems multiple times (as many as 15, according to Microsoft) can give them enough time to grab a new, non-broken update file before they crash, resolving the issue. Microsoft has also created tools that can boot systems via USB or a network so that the bad update file can be deleted, allowing systems to restart normally.

In addition to this preliminary incident report, CrowdStrike says it will release “the full Root Cause Analysis” once it has finished investigating the issue.

CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs Read More »

microsoft-says-8.5m-systems-hit-by-crowdstrike-bsod,-releases-usb-recovery-tool

Microsoft says 8.5M systems hit by CrowdStrike BSOD, releases USB recovery tool

still striking —

When reboots don’t work, bootable USB sticks may help ease fixes for some PCs.

A bad update to CrowdStrike's Falcon security software crashed millions of Windows PCs last week.

Enlarge / A bad update to CrowdStrike’s Falcon security software crashed millions of Windows PCs last week.

CrowdStrike

By Monday morning, many of the major disruptions from the flawed CrowdStrike security update late last week had cleared up. Flight delays and cancellations were no longer front-page news, and multiple Starbucks locations near me are taking orders through the app once again.

But the cleanup effort continues. Microsoft estimates that around 8.5 million Windows systems were affected by the issue, which involved a buggy .sys file that was automatically pushed to Windows PCs running the CrowdStrike Falcon security software. Once downloaded, that update caused Windows systems to display the dreaded Blue Screen of Death and enter a boot loop.

“While software updates may occasionally cause disturbances, significant incidents like the CrowdStrike event are infrequent,” wrote Microsoft VP of Enterprise and OS Security David Weston in a blog post. “We currently estimate that CrowdStrike’s update affected 8.5 million Windows devices, or less than one percent of all Windows machines. While the percentage was small, the broad economic and societal impacts reflect the use of CrowdStrike by enterprises that run many critical services.”

The “easy” fix documented by both CrowdStrike (whose direct fault this is) and Microsoft (which has taken a lot of the blame for it in mainstream reporting, partly because of an unrelated July 18 Azure outage that had hit shortly before) was to reboot affected systems over and over again in the hopes that they would pull down a new update file before they could crash. For systems where that method hasn’t worked—and Microsoft has recommended customers reboot as many as 15 times to give computers a chance to download the update—the recommended fix has been to delete the bad .sys file manually. This allows the system to boot and download a fixed file, resolving the crashes without leaving machines unprotected.

To help ease the pain of that process, Microsoft over the weekend released a recovery tool that helps to automate the repair process on some affected systems; it involves creating bootable media using a 1GB-to-32GB USB drive, booting from that USB drive, and using one of two options to repair your system. For devices that can’t boot via USB—sometimes this is disabled on corporate systems for security reasons—Microsoft also documents a PXE boot option for booting over a network.

WinPE to the rescue

The bootable drive uses the WinPE environment, a lightweight, command-line-driven version of Windows typically used by IT administrators to apply Windows images and perform recovery and maintenance operations.

One repair option boots directly into WinPE and deletes the affected file without requiring administrator privileges. But if your drive is protected by BitLocker or another disk-encryption product, you’ll need to manually enter your recovery key so that WinPE can read data on the drive and delete the file. According to Microsoft’s documentation, the tool should automatically delete the bad CrowdStrike update without user intervention once it can read the disk.

If you are using BitLocker, the second recovery option attempts to boot Windows into Safe Mode using the recovery key stored in your device’s TPM to automatically unlock the disk, as happens during a normal boot. Safe Mode loads the minimum set of drivers that Windows needs to boot, allowing you to locate and delete the CrowdStrike driver file without running into the BSOD issue. The file is located at Windows/System32/Drivers/CrowdStrike/C-00000291*.sys on affected systems, or users can run “repair.cmd” from the USB drive to automate the fix.

For its part, CrowdStrike has set up a “remediation and guidance hub” for affected customers. As of Sunday, the company said it was “test[ing] a new technique to accelerate impacted system remediation,” but it hasn’t shared more details as of this writing. The other fixes outlined on that page include rebooting multiple times, manually deleting the affected file, or using Microsoft’s boot media to help automate the fix.

The CrowdStrike outage didn’t just delay flights and make it harder to order coffee. It also affected doctor’s offices and hospitals, 911 emergency services, hotel check-in and key card systems, and work-issued computers that were online and grabbing updates when the flawed update was sent out. In addition to providing fixes for client PCs and virtual machines hosted in its Azure cloud, Microsoft says it has been working with Google Cloud Platform, Amazon Web Services, and “other cloud providers and stakeholders” to provide fixes to Windows VMs running in its competitors’ clouds.

Microsoft says 8.5M systems hit by CrowdStrike BSOD, releases USB recovery tool Read More »

on-the-crowdstrike-incident

On the CrowdStrike Incident

Things went very wrong on Friday.

A bugged CrowdStrike update temporarily bricked quite a lot of computers, bringing down such fun things as airlines, hospitals and 911 services.

It was serious out there.

Ryan Peterson: Crowdstrike outage has forced Starbucks to start writing your name on a cup in marker again and I like it.

My understanding it was a rather stupid bug, a NULL pointer from the memory unsafe C++ language.

Zack Vorhies: Memory in your computer is laid out as one giant array of numbers. We represent these numbers here as hexadecimal, which is base 16 (hexadecimal) because it’s easier to work with… for reasons.

The problem area? The computer tried to read memory address 0x9c (aka 156).

Why is this bad?

This is an invalid region of memory for any program. Any program that tries to read from this region WILL IMMEDIATELY GET KILLED BY WINDOWS.

So why is memory address 0x9c trying to be read from? Well because… programmer error.

It turns out that C++, the language crowdstrike is using, likes to use address 0x0 as a special value to mean “there’s nothing here”, don’t try to access it or you’ll die.

And what’s bad about this is that this is a special program called a system driver, which has PRIVLIDGED access to the computer. So the operating system is forced to, out of an abundance of caution, crash immediately.

This is what is causing the blue screen of death. A computer can recover from a crash in non-privileged code by simply terminating the program, but not a system driver. When your computer crashes, 95% of the time it’s because it’s a crash in the system drivers.

If the programmer had done a check for NULL, or if they used modern tooling that checks these sorts of things, it could have been caught. But somehow it made it into production and then got pushed as a forced update by Crowdstrike… OOPS!

Here is another technical breakdown.

A non technical breakdown would be:

  1. CrowdStrike is set up to run whenever you start the computer.

  2. Then someone pushed an update to a ton of computers.

  3. Which is something CrowdStrike was authorized to do.

  4. The update contained a stupid bug, that would have been caught if those involved had used standard practices and tests.

  5. With the bug, it tries to access memory in a way that causes a crash.

  6. Which also crashes the computer.

  7. So you have to do a manual fix to each computer to get around this.

  8. If this had been malicious it could probably have permawiped all the computers, or inserted Trojans, or other neat stuff like that.

  9. So we dodged a bullet.

  10. Also, your AI safety plan needs to take into account that this was the level of security mindset and caution at CrowdStrike, despite CrowdStrike having this level of access and being explicitly in the security mindset business, and that they were given this level of access to billions of computers, and that their stock was only down 11% on the day so they probably keep most of that access and we aren’t going to fine them out of existence either.

Yep.

EDIT, added 11: 30am 7/22/24: Ben Thompson has a post summarizing what happened. It broadly agrees with what is described here, and in particular highlights the EU’s role via the 2009 Microsoft dissent decree that prevents Microsoft from locking down the Windows kernel space. I am convinced that without that decree, Microsoft would probably have done that.

George Kurtz (CEO CrowdStrike): CrowdStrike is actively working with customers impacted by a defect found in a single content update for Windows hosts. Mac and Linux hosts are not impacted. This is not a security incident or cyberattack. The issue has been identified, isolated and a fix has been deployed.

We refer customers to the support portal for the latest updates and will continue to provide complete and continuous updates on our website. We further recommend organizations ensure they’re communicating with CrowdStrike representatives through official channels. Our team is fully mobilized to ensure the security and stability of CrowdStrike customers.

Dan Elton: No apology. Many people have been wounded or killed by this. They are just invisible because we can’t point to them specifically. But think about it though — EMS services were not working. Doctors couldn’t access EMR & hospitals canceled medical scans.

Stock only down 8% [was 11% by closing].

I don’t think the full scope of this disaster has really sunk in. Yes, the problems will be fixed within a few days & everything will go back to normal. However, 911 services are down across the country. Think about that for a second. Hospitals around the world running on paper.

It’s hard to map one’s mind around since all the people who have been killed and will be killed by this — and I’m sure there are many — are largely invisible.

Claude’s median estimate is roughly 1,000 people died due to the outage, when given the hypothetical scenario of an update with this bug being pushed and no other info.

Where Claude got it wrong is it expected a 50%+ drop in share price for CrowdStrike. We should be curious why this did not happen. When told it was 11%, Claude came up with many creative potential explanations, and predicted that this small a drop would become an object of future study.

Then again, perhaps no one cares about reputation these days? You get to have massive security failures and people still let you into their kernels?

Anshel Sag: For those who don’t remember, in 2010, McAfee had a colossal glitch with Windows XP that took down a good part of the internet. The man who was McAfee’s CTO at that time is now the CEO of Crowdstrike. The McAfee incident cost the company so much they ended up selling to Intel.

I mean, sure, it looks bad, now, in hindsight.

At this rate, the third time will be an AGI company.

So do we blame George Kurtz? Or do we blame all of you who let it happen?

Aside from ‘letting a company run by George Kurtz access your kernel,’ that is.

It happened because various actors did not do deeply standard things they should obviously have been doing.

A fun game is to watch everyone say ‘the real problem is X and Y is a distraction’ with various things being both X and Y in different statements. It can all be ‘real’ problems.

Owen Lynch: Everyone is talking about how memory safety would have stopped the crowdstrike thingy. Seems to me that’s a distraction; the real problem is that the windows security model is reactive (try to write software that detects hacks) rather than proactive (run processes in sealed sandboxes with permissions granted by-need instead of by-default). Then there’s little need for antivirus in the same sense.

Of course, the kernel managing these sandboxes needs to be memory safe, but this is a low bar, ideally it should be either exhaustively fuzzed (like SQLite) or actually formally verified.

But most software should be allowed to be horrendously incorrect or actually malicious, but only in its little box.

Here is a thread where they debate whether to blame CrowdStrike or Microsoft.

Luke Parrish: Microsoft designed their OS to run driver files without even a checksum and you say they aren’t responsible? They literally tried to execute a string of zeroes!

Jennifer Marriott: Still the issue is CrowdStrike. If I buy a program and install it on my computer and it bricks my computer I blame the program not the computer.

Luke Parrish: CrowdStrike is absolutely to blame, but so is Microsoft. Microsoft’s software, Windows, is failing to do extremely basic basic checks on driver files before trying to load them and give them full root access to see and do everything on your computer.

This is analogous to the fire safety triangle: Heat, fuel, and oxygen. Any one of those can be removed to prevent combustion. Multiple failures led to this outcome. Microsoft could have prevented this with good engineering practices, just as CrowdStrike could have.

The market did not think Microsoft would suffer especially adverse effects. The Wall Street Journal might say this was the ‘latest woe for Microsoft’ but their stock on Friday was down less than the Nasdaq. That seems right to me. Yes, Microsoft could and should have prevented this, but ultimately it will not cause people to switch.

The Wall Street Journal also attempts to portray this as a failure of Microsoft to have a ‘closed ecosystem’ the way Apple does (in a limited way on a Mac, presumably, this is not a phone). This, they say, is what you let others actually do things for real on your machine, the horrors. There are a minimum of two ways this is Obvious Nonsense, even if you grant a bunch of other absurd assumptions.

  1. Linux exists.

  2. Microsoft is barred from not giving this access by a 2009 EU consent decree.

Did Microsoft massively screw up by not guarding against this particular failure mode? Oh, absolutely, everyone agrees on that. But they failed (as I understand essentially everyone) by not having proper safety checks and failure modes, not by failing to deny access.

There was a clear pattern where ‘critical infrastructure’ that is vitally important to keep online like airlines and banks and hospitals went down, while the software companies providing other non-critical services had no such issues.

‘Too important to improve’ (or ‘too vital to allow?’) is remarkably common.

Where you cannot faround, you cannot find out. And where you cannot do either, it is hard to find good help.

Microsoft Worm: In retrospect it’s pretty ~funny how most shitware SaaS companies & social media companies exclusively run Real Software for Grown-Ups while critical infrastructure (airlines, hospitals, etc.) all uses dotcom-era software from comically incompetent zombie firms with 650 PE ratios.

Gallabytes: We used to explain this bifurcation as a function of size but with most of the biggest companies being tech giants now that explanation has been revealed as cope. what’s the real cause?

Sarah Constantin: My guess would be it’s “do any good software engineers work there or not?” Good software engineers work at both startups and Big Tech cos but I have *onesmart programmer friend who works at a bank, and zero at hospitals, airlines, etc.

Gallabytes: This is downstream I think and not universal – plenty of good programmers in gaming industry but it’s still full of this kind of madness. So far the most accurate classifier I’ve got is actually “does this company run on Windows?”

Scott Leibrand: I think it comes down to whether they hire mostly nerd vs. normie employees.

Illiane: Pretty sure it’s just a result of these tech companies starting out with a « cleaner » blank slate than critical infra that’s been here for decades and relies on mega legacy system which would be very hard and risky to replace. Banks still largely run on COBOL mainframes!

Tech companies at least started out able to find out and hire good help, and built their engineering cultures and software stacks around that. Banks do not have that luxury.

Why else might we have had this stunning display of incompetence?

Lina Khan, head of the FTC, has no sense of irony.

Lina Khan: All too often these days, a single glitch results in a system-wide outage, affecting industries from healthcare and airlines to banks and auto-dealers. Millions of people and businesses pay the price.

These incidents reveal how concentration can create fragile systems.

Concentrating production can concentrate risk, so that a single natural disaster or disruption has cascading effects.

This fragility has contributed to shortages in areas ranging from IV bags to infant formula.

Another area where we may lack resiliency is cloud computing.

In response to @FTC’s inquiry, market participants shared concerns about widespread reliance on a handful of cloud providers, noting that consolidation can create single points of failure.

And we’re continuing to collect public comment on serial acquisitions and roll-up strategies across the economy.

If you’ve encountered an area where a series of deals has consolidated a market, we welcome your input.

Yes. The problem is too much concentration in cloud providers, says Lina Khan. We must Do Something about that. I mean, how could this possibly have happened? That all the major cloud providers went down at the same time over the same software bug?

Must be a lack of regulation.

Except, well, actually, says Mark Atwood.

Mark Atwood: If you are in a regulated industry, you are required to install something like Crowdstrike on all your machines. If you use Crowdstrike, your auditor checks a single line and moves on. If you use anything else, your auditor opens up an expensive new chapter of his book.

The real culprit here is regulatory capture. Notice that everybody getting hit hard by this is in a heavily regulated industry: finance, airlines, healthcare, etc. That’s because those regulations include IT security mandates, and Crowdstrike has positioned themselves as the only game in town for compliance. Hence you get this software monoculture prone to everything getting hit at once like this.

Andres Sandberg: A good point. I saw the same in the old FHI-Amlin systemic risk of risk modelling project: regulators inadvertently reduce model diversity, making model-mediated systemic risk grow. “Sure, you can use a model other than RMS, but it will be painful for both of us…”

Ray Taylor: what if you use Mac / Linux?

Andres Sandberg: You will have to use the right operating system to run the industry standard software. Even if it is Windows XP in 2017.

Some disputed this. I checked with Claude Sonnet 3.5. It looks like there are plenty of functional alternative services, and yes they will work, but CrowdStrike does automated compliance reporting and is widely recognized, and this is actually core to their pitch of why companies should use them – to reduce compliance costs.

I also checked with two friends who know about such things. It seems CrowdStrike did plausibly have a superior product to the alternatives, even discounting the regulatory questions.

It was also pointed out that while a lot of installs were to please auditors, a lot of what the auditors were checking for was not formal government regulations, rather it was largely industry standards without legal enforcement, but that you need to do to get contracts, like SOC 2 or ISO 27001.

In the end, is there a functional difference? In some ways, probably not.

So given the increasing number of requirements Claude was able to list off, and the costs of non-compliance, everyone in these ‘critical infrastructure’ businesses ended up turning to the company whose main differential, and perhaps to them main product offering, was ‘regulatory compliance.’

That then set us up with additional single points of failure. It also meant that the company in charge of those failure points had a culture built around checking off boxes on government forms rather than actual computer security or having a security mindset.

You know who did not use CrowdStrike? Almost anyone who did not face this regulatory burden. It was only in 8.5 million Windows machines.

Byrne Hobart: <1% penetration. This Crowdstrike company seems like it's got a nice TAM to go after, just have to make sure they don't do anything to mess it up.

Another nice bit that I presume is a regulatory compliance issue: Rules around passwords and keys are reliably absurd.

Dan Elton: Many enterprises in healthcare use disk encryption like Bitlocker which complicates #CrowdStrike cleanup.

This is what one IT admin reports:

“We can’t boot into safe mode because our BitLocker keys are stored inside of a service that we can’t login to because our AD is down.”

Another says “Most of our comms are down, most execs’ laptops are in infinite BSOD boot loops, engineers can’t get access to credentials to servers.”

Would it be better if the disaster were worse, such as what likely happens to a crypto project in this spot? Crypto advocate says yes, Gallabytes points out actually no.

Dystopia Breaker: in crypto, when a project has a large incompetence event (hack, insider compromise, whatever), the project loses all of their money and is dead forever in tradtech/bureautech, when a project has a large incompetence event, they do a ‘post mortem’ and maybe get some nastygrams.

Consider for a moment the incentives that this dynamic creates and the outcomes that arise by dialing out these two incentive gradients into the future.

It’s actually worse than ‘they get some nastygrams’, what usually happens is that regulators (who usually know less than nothing about the technosphere) demand band-aid solutions (surveillance, usually) that increase systemic risk [e.g. CrowdStrike itself].

Gallabytes: And that’s a huge downside of crypto!

Most systems will be back to normal by Monday, while in crypto many would be irreversibly broken.

It’d be better still if our institutions learned from this failure but I’m not holding my breath. you basically only see this kind of failure in over regulated oligopolistic markets, so the case for massive deregulation is much clearer than migration to crypto.

As George Carlin famously said, somewhere in the middle, the truth lies. Letting CrowdStrike off the hook because they ‘are the standard’ is insufficiently strong incentives. Taking everything involved down hard is worse.

What about the role of AI?

Andrej Karpathy: What a case study of systemic risk with CrowdStrike outage… that a few bits in the wrong place can brick ~1 billion computers and all the 2nd, 3rd order effects of it. What other single points of instantaneous failure exist in the technosphere and how do we design against it.

Davidad: use LLMs to reimplement all kernel-mode software with formal verification.

How about we use human software engineers to do the rebuild, instead?

It is great that we can use AIs to write code faster, and enable people to skill up. For jobs like ‘rewrite the kernel,’ I am going to go ahead and say I want to stick with the humans. There are many overdetermined reasons.

Patrick Collison (responding to Karpathy): I’ve always thought that we should run scheduled internet outages.

Andrej Karpathy: National bit flip day.

Indomitable American Soul: Its crazy when you think that this could have all been avoided by testing the release on a single sandbox machine.

Andrej Karpathy: I just feel like this is the particular problem but not the *actualdeeper problem. Any part of the system should be allowed to go *crazy*, randomly or even adversarially, and the rest of it should be robust to that. This is what you want, even if robustness is very often at tension with efficiency.

There are two problems.

  1. This error should not have been able to bring down the system.

  2. This error should never have happened even if it couldn’t crash the system.

Either of these on its own should establish that we have a terrible situation that poses catastrophic risks even without AI, and which AI will make a lot worse, and urgently needs fixing.

Together, they are terrifying.

The obvious failure mode is not malicious. It is exactly what happened this time, except in the future, with AI.

  1. AI accidentally outputs buggy code.

  2. Human does not catch it.

  3. What do you mean ‘unit tests’ and ‘canaries’?

  4. Whoops.

Or the bug is more subtle than this, so we do run the standard tests, and it passes. That happens all the time, it is not usually quite this stupid and obvious.

The next failure is that the AI intentionally outputs bugged code, or malicious code, whether or not a human instructed it (explicitly, implicitly or by unfortunate implication) otherwise.

And of course the other failure mode is that the AI, or someone with an AI, intentionally seeks out the attack vector in order to deploy such code.

Shako: A rogue AI could probably brick every computer in the world indefinitely with ongoing zero days to exploit things like we saw today. Probably not too far from the capability either.

Arthur: It won’t need zero days, we’ll have given it root power globally because it’s convenient.

Leo Gao (OpenAI, distinct thread): Thankfully, it’s unimaginable that an AGI could ever compromise a large fraction of internet connected computers.

Jeffrey Ladish: Fortunately there are no single points of failure or over reliances on a single service provider with system level access to a large fraction of the computers that run, uh, everything.

Everyone: “Oh no the AGI will be able to discover 0days in every piece of software, we’ll be totally pwned”

AGI: “Why would I need 0days? 🙄”

Where should we worry about concentration? Is this a reason to want everyone to be using different AIs from different providers, instead of the same AI?

That depends on what constitutes the single point of failure (SPOF).

If the SPOF is ‘all the AIs turn rogue or go crazy or shut off at the same time’ then you want AI diversity.

If the SPOF is ‘every distinct frontier AI is itself an SPOF, because if even one of them goes fully off the rails then that is a catastrophe’ then you do not want AI diversity.

These questions can have very different answers for catastrophic or existential risk, versus mundane risk.

For mundane risk, you by default want your systems to fail at different times in distinct ways, but you need to worry about long dependency chains where you are only as strong as the weakest link. So if you are (for example) combining five different AI systems that each are the best at a particular subtask, and cannot easily swap them out in time, then you are vulnerable if any of them go haywire.

For existential or catastrophic risk, it depends on your threat model.

Any single rogue agent under current conditions, be it human or AI, could potentially have set off the CrowdStrike bug, or a version of it that was far worse. There are doubtless many such cases. So do you think that ‘various good guys with various AIs’ could then defend against that? Would ‘some people defend and some don’t’ be sufficient, or do you need to almost always (or actual always) successfully defend?

I am very skeptical of the ‘good guy with an AI’ proposal, even if such defenses are physically possible (and I am skeptical of that too). Why didn’t a ‘good guy with a test machine or a debugger’ stop the CrowdStrike update? Because even if there was a perfectly viable way to act responsibly, that does not mean we are going to do that if it is trivially inconvenient or is not robustly checked.

Again, yes, if we allow it you are going to give the AI root access and take yourself out of the loop, because not doing so is going to be annoying, and expensive, and you are in competition with people who are willing to do such things. If you don’t, someone else will, and their AIs will end up with the market share and the power.

Indeed, the very fact that these many AIs are allowed to be in this intense competition with each other with rapid iteration will make it all but certain corners will be cut to absurd degrees, especially when it comes to things like collective security.

Another thing that can happen is the one dangerous AI suddenly becomes a lot of dangerous AIs, because it can be copied, or it can scale its resources with similar effect. Or by having many such potentially dangerous AIs, you place authority over it into many hands, and what happens if even one of them chooses to be sufficiently irresponsible or malicious with it?

What about the risk of regulatory capture happening with safety in AI, the way it happened here with mundane computer security and CrowdStrike? What happens if everyone is hiring a company, Acme Safety Compliance (ASC), to handle all their ‘AI safety’ needs, and ASC’s actual product is regulatory compliance?

Well, then we’re in very big trouble. As in dead.

Every time I look at an AI lab’s scaling policy, I say some form of:

  1. If they implement the spirit of a good version of this document, I don’t know if that is good enough, but that would be a big help.

  2. If they implement the letter of even a good version of this document, and game the requirements, then that is worth very little if anything.

  3. If they don’t even implement the letter of it in the breach, it’s totally worthless.

  4. We cannot rely on their word that they will implement even the letter of this.

This is another reason most of the value, right now, is in disclosure and information requirements on the largest frontier models. If you have to tell me what you are doing, then that is not an easy thing to meaningfully ‘capture.’

But yeah, this is going to be tough and a real danger. It always is. And it always needs to be balanced against the alternative options available, and what happens if you do nothing.

It can also be pointed out that this is another logical counter to ‘but you need to tell me exactly what constitutes compliance, and if I technically do that then I should have full safe harbor,’ as many demand for themselves in many contexts. That is a very good way to get exactly what is written down, and no more, to get the letter only and not the spirit. That works if there is a risk that can indeed be taken out of the room by adhering to particular rules. But if the risk is inherent in the system and not so easy to deal with, you cannot make the situation non-risky on one side of a line.

One thing to note is that CrowdStrike was an active menace. It was de facto mandatory that they be given this level of access. If CrowdStrike was (for example) instead a red teaming service that attempted to break into your computers, it would have been much harder (but not, indirectly, impossible) for it to cause this disaster.

Another key insight is that you do not only have to work around things that might go wrong when everyone does their jobs properly, and you face an actually hard problem.

Your solution must also be designed anticipating the stupidest failures.

Because that is what you probably first get.

And saying ‘oh there are like 5 ways someone would take action such that this would obviously not happen’ is a surprisingly weak defense.

Then, later, you also get the failures that happen when the AI is smarter than you.

And again, then, whatever happens, there is a good chance many will say ‘it would have been fine if we hadn’t acted like completely incompetent idiots and followed even a modicum of best practices’ and on this exact set of events they will have been right. But that will also be why that particular set of events happened, rather than something harder to fathom.

Also down were the banks. Anything requiring computer access was stopped cold.

Patrick McKenzie: In “could have come out of a tabletop exercise”, sudden surge by many customers of ATM transactions has them flagging customers as likely being fraud impacted.

Good news: you have an automated loop which allows a customer to recognize a transaction.

Bad news: Turns out that subdomain is running on Windows.

I’m not trying to grind their nose in it. Widespread coordinated outages are terrible and the few things that knock out all the PCs are always going to be nightmares.

I do have to observe that some people who write regulations which effectively mandate a monoculture don’t know what SPOF stands for and our political process is unlikely to put two and two together for them.

Same story at three banks, two GSFIs and one large regional, for anyone wanting a data point. Well I guess I know next week’s Bits about Money topic.

It was only a single point of failure for Windows machines that trusted CrowdStrike. But in a corporate context, that is likely to either be all or none of them.

That created some obvious issues, and offered opportunity for creative solutions.

Patrick McKenzie: Me: *cash*

Tradesman: Wait how did you get that with the banks down?

Me: *explains*

Tradesman: Oh that’s creative.

Me: Nah. Next plan was creative.

Tradesman: What was that?

Me: Going to the church and buying all cash on hand with a check.

Tradesman: What.

Me: I don’t drink.

Tradesman: What.

Me: The traditional business to use in this situation is the local bar, but I don’t drink and so the local bar doesn’t know me, so that’s right out.

Tradesman: What.

Me: Though come to think of it I certainly know someone who knows both me and the bar owner, so I could probably convince them to give me a workweek’s take on a handshake.

Tradesman: This is effed up.

Me: I mean money basically always works like this, in a way.

Called someone who I (accurately) assumed would have sufficient cash on hand and said “I need a favor.”, then he did what I’d do on receiving the same phone call.

Another obvious solution is ‘keep an emergency cash fund around.’ In a world where one’s bank accounts might all get frozen at once, or the banks might go down for a while, it seems sensible to have such a reserve somewhere you can access it in this kind of emergency. You are not giving up much in interest.

This is also a damn good reason to not ban or eliminate physical cash, in general.

On the CrowdStrike Incident Read More »

crowdstrike-fixes-start-at-“reboot-up-to-15-times”-and-get-more-complex-from-there

CrowdStrike fixes start at “reboot up to 15 times” and get more complex from there

turning it off and back on again, and again, and again —

Admins can also restore backups or manually delete CrowdStrike’s buggy driver.

CrowdStrike fixes start at “reboot up to 15 times” and get more complex from there

Airlines, payment processors, 911 call centers, TV networks, and other businesses have been scrambling this morning after a buggy update to CrowdStrike’s Falcon security software caused Windows-based systems to crash with a dreaded blue screen of death (BSOD) error message.

We’re updating our story about the outage with new details as we have them. Microsoft and CrowdStrike both say that “the affected update has been pulled,” so what’s most important for IT admins in the short term is getting their systems back up and running again. According to guidance from Microsoft, fixes range from annoying but easy to incredibly time-consuming and complex, depending on the number of systems you have to fix and the way your systems are configured.

Microsoft’s Azure status page outlines several fixes. The first and easiest is simply to try to reboot affected machines over and over, which gives affected machines multiple chances to try to grab CrowdStrike’s non-broken update before the bad driver can cause the BSOD. Microsoft says that some of its customers have had to reboot their systems as many as 15 times to pull down the update.

Early guidance for fixing the CrowdStrike bug is simply to reboot systems over and over again so that they can try to grab a non-broken update.

Enlarge / Early guidance for fixing the CrowdStrike bug is simply to reboot systems over and over again so that they can try to grab a non-broken update.

Microsoft

If rebooting doesn’t work

If rebooting multiple times isn’t fixing your problem, Microsoft recommends restoring your systems using a backup from before 4: 09 UTC on July 18 (just after midnight on Friday, Eastern time), when CrowdStrike began pushing out the buggy update. Crowdstrike says a reverted version of the file was deployed at 5: 27 UTC.

If these simpler fixes don’t work, you may need to boot your machines into Safe Mode so you can manually delete the file that’s causing the BSOD errors. For virtual machines, Microsoft recommends attaching the virtual disk to a known-working repair VM so the file can be deleted, then reattaching the virtual disk to its original VM.

The file in question is a CrowdStrike driver located at Windows/System32/Drivers/CrowdStrike/C-00000291*.sys. Once it’s gone, the machine should boot normally and grab a non-broken version of the driver.

Deleting that file on each and every one of your affected systems individually is time-consuming enough, but it’s even more time-consuming for customers using Microsoft’s BitLocker drive encryption to protect data at rest. Before you can delete the file on those systems, you’ll need the recovery key that unlocks those encrypted disks and makes them readable (normally, this process is invisible, because the system can just read the key stored in a physical or virtual TPM module).

This can cause problems for admins who aren’t using key management to store their recovery keys, since (by design!) you can’t access a drive without its recovery key. If you don’t have that key, Cryptography and infrastructure engineer Tony Arcieri on Mastodon compared this to a “self-inflicted ransomware attack,” where an attacker encrypts the disks on your systems and withholds the key until they get paid.

And even if you do have a recovery key, your key management server might also be affected by the CrowdStrike bug.

We’ll continue to track recommendations from Microsoft and CrowdStrike about fixes as each company’s respective status pages are updated.

“We understand the gravity of the situation and are deeply sorry for the inconvenience and disruption,” wrote CrowdStrike CEO George Kurtz on X, formerly Twitter. “We are working with all impacted customers to ensure that systems are back up and they can deliver the services their customers are counting on.”

CrowdStrike fixes start at “reboot up to 15 times” and get more complex from there Read More »

major-outages-at-crowdstrike,-microsoft-leave-the-world-with-bsods-and-confusion

Major outages at CrowdStrike, Microsoft leave the world with BSODs and confusion

Y2K24 —

Nobody’s sure who’s at fault for each outage: Microsoft, CrowdStrike, or both.

A passenger sits on the floor as long queues form at the check-in counters at Ninoy Aquino International Airport, on July 19, 2024 in Manila, Philippines.

Enlarge / A passenger sits on the floor as long queues form at the check-in counters at Ninoy Aquino International Airport, on July 19, 2024 in Manila, Philippines.

Ezra Acayan/Getty Images

Millions of people outside the IT industry are learning what CrowdStrike is today, and that’s a real bad thing. Meanwhile, Microsoft is also catching blame for global network outages, and between the two, it’s unclear as of Friday morning just who caused what.

After cybersecurity firm CrowdStrike shipped an update to its Falcon Sensor software that protects mission-critical systems, blue screens of death (BSODs) started taking down Windows-based systems. The problems started in Australia and followed the dateline from there.

TV networks, 911 call centers, and even the Paris Olympics were affected. Banks and financial systems in India, South Africa, Thailand, and other countries fell as computers suddenly crashed. Some individual workers discovered that their work-issued laptops were booting to blue screens on Friday morning. The outages took down not only Starbucks mobile ordering, but also a single motel in Laramie, Wyoming.

Airlines, never the most agile of networks, were particularly hard-hit, with American Airlines, United, Delta, and Frontier among the US airlines overwhelmed Friday morning.

CrowdStrike CEO “deeply sorry”

Fixes suggested by both CrowdStrike and Microsoft for endlessly crashing Windows systems range from “reboot it up to 15 times” to individual driver deletions within detached virtual OS disks. The presence of BitLocker drive encryption on affected devices further complicates matters.

CrowdStrike CEO George Kurtz posted on X (formerly Twitter) at 5: 45 am Eastern time that the firm was working on “a defect found in a single content update for Windows hosts,” with Mac and Linux hosts unaffected. “This is not a security incident or cyberattack. The issue has been identified, isolated and a fix has been deployed,” Kurtz wrote. Kurtz told NBC’s Today Show Friday morning that CrowdStrike is “deeply sorry for the impact that we’ve caused to customers.”

As noted on Mastodon by LittleAlex, Kurtz was the Chief Technology Officer of security firm McAfee when, in April 2010, that firm sent an update that deleted a crucial Windows XP file that caused widespread outages and required system-by-system file repair.

The costs of such an outage will take some time to be known, and will be hard to measure. Cloud cost analyst CloudZero estimated mid-morning Friday that the CrowdStrike incident had already cost $24 billion, based on a previous estimate.

Multiple outages, unclear blame

Microsoft services were, in a seemingly terrible coincidence, also down overnight Thursday into Friday. Multiple Azure services went down Thursday evening, with the cause cited as “a backend cluster management workflow [that] deployed a configuration change causing backend access to be blocked between a subset of Azure Storage clusters and compute resources in the Central US region.”

A spokesperson for Microsoft told Ars in a statement Friday that the CrowdStrike update was not related to its July 18 Azure outage. “That issue has fully recovered,” the statement read.

News reporting on these outages has so far blamed either Microsoft, CrowdStrike, or an unclear mixture of the two as the responsible party for various outages. It may be unavoidable, given that the outages are all happening on one platform, Windows. Microsoft itself issued an “Awareness” regarding the CrowdStrike BSOD issue on virtual machines running Windows. The firm was frequently updating it Friday, with a fix that may or may not surprise IT veterans.

“We’ve received feedback from customers that several reboots (as many as 15 have been reported) may be required, but overall feedback is that reboots are an effective troubleshooting step at this stage,” Microsoft wrote in the bulletin. Alternately, Microsoft recommend customers that have a backup from “before 19: 00 UTC on the 18th of July” restore it, or attach the OS disk to a repair VM to then delete the file (Windows/System32/Drivers/CrowdStrike/C00000291*.sys) at the heart of the boot loop.

Security consultant Troy Hunt was quoted as describing the dual failures as “the largest IT outage in history,” saying, “basically what we were all worried about with Y2K, except it’s actually happened this time.”

United Airlines told Ars that it was “resuming some flights, but expect schedule disruptions to continue throughout Friday,” and had issued waivers for customers to change travel plans. American Airlines posted early Friday that it had re-established its operations by 5 am Eastern, but expected delays and cancellations throughout Friday.

Ars has reached out to CrowdStrike for comment and will update this post with response.

This is a developing story and this post will be updated as new information is available.

Major outages at CrowdStrike, Microsoft leave the world with BSODs and confusion Read More »