outage

massive-cloudflare-outage-was-triggered-by-file-that-suddenly-doubled-in-size

Massive Cloudflare outage was triggered by file that suddenly doubled in size

Cloudflare’s proxy service has limits to prevent excessive memory consumption, with the bot management system having “a limit on the number of machine learning features that can be used at runtime.” This limit is 200, well above the actual number of features used.

“When the bad file with more than 200 features was propagated to our servers, this limit was hit—resulting in the system panicking” and outputting errors, Prince wrote.

Worst Cloudflare outage since 2019

The number of 5xx error HTTP status codes served by the Cloudflare network is normally “very low” but soared after the bad file spread across the network. “The spike, and subsequent fluctuations, show our system failing due to loading the incorrect feature file,” Prince wrote. “What’s notable is that our system would then recover for a period. This was very unusual behavior for an internal error.”

This unusual behavior was explained by the fact “that the file was being generated every five minutes by a query running on a ClickHouse database cluster, which was being gradually updated to improve permissions management,” Prince wrote. “Bad data was only generated if the query ran on a part of the cluster which had been updated. As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.”

This fluctuation initially “led us to believe this might be caused by an attack. Eventually, every ClickHouse node was generating the bad configuration file and the fluctuation stabilized in the failing state,” he wrote.

Prince said that Cloudflare “solved the problem by stopping the generation and propagation of the bad feature file and manually inserting a known good file into the feature file distribution queue,” and then “forcing a restart of our core proxy.” The team then worked on “restarting remaining services that had entered a bad state” until the 5xx error code volume returned to normal later in the day.

Prince said the outage was Cloudflare’s worst since 2019 and that the firm is taking steps to protect against similar failures in the future. Cloudflare will work on “hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input; enabling more global kill switches for features; eliminating the ability for core dumps or other error reports to overwhelm system resources; [and] reviewing failure modes for error conditions across all core proxy modules,” according to Prince.

While Prince can’t promise that Cloudflare will never have another outage of the same scale, he said that previous outages have “always led to us building new, more resilient systems.”

Massive Cloudflare outage was triggered by file that suddenly doubled in size Read More »

“nightmare”-zipcar-outage-is-a-warning-against-complete-app-dependency

“Nightmare” Zipcar outage is a warning against complete app dependency

Zipcar’s rep declined to specify how many people were affected by the outage.

A warning against total app reliance

Zipcar’s app problems have not only cost it money but also traumatized some users who may think twice before using Zipcar again. The convenience of using apps to control physical products only exists if said apps are functioning and prepared for high-volume time periods, such as Thanksgiving weekend.

Despite Zipcar’s claims of a “small percentage” of users being affected, the company’s customer support system seemed overwhelmed. Long wait times coupled with misinformation regarding things like fees make already perturbed customers feel more deserted.

Those are the pitfalls of completely relying on apps for basic functionality. There was a time when Zipcar members automatically received physical “Zipcards” for opening doors. Now, they’re not really advertised, and users have to request one.

A Zipcard.

A Zipcard. Credit: Getty

Zipcars also used to include keys inside of locked cars more frequently. Reducing these physical aspects may have saved the company money but effectively put all of Zipcar’s eggs in one basket.

Nightmarish app problems like the one Zipcar experienced can be a deal-breaker. Just look at Sonos, whose botched app update is costing it millions. Further, turning something like car rentals into a virtually app-only service is a risky endeavor that can quickly overcomplicate simple tasks. Some New Zealand gas stations were out of luck earlier this year, for example, when a Leap Day glitch caused payment processing software to stop working. Gas stations that needed apps for payments weren’t able to make sales, and drivers were inconvenienced.

Apps can simplify and streamline while delivering ingenuity. But that doesn’t mean traditional, app-free measures should be eliminated as backups.

“Nightmare” Zipcar outage is a warning against complete app dependency Read More »

verizon-customers-face-mass-scale-outage-across-the-us

Verizon customers face mass-scale outage across the US

5Gpocalypse —

More than 100,000 reports appeared on Downdetector.

A map showing hotspots of outages primarily in the east coast and central US, but some in California as well

Enlarge / A Downdetector map showing where Verizon outages are reported.

Wireless customers of Verizon and AT&T have found that they cannot make calls, send or receive text messages, or download any mobile data. As of this article’s publication, it appears the problem has yet to be resolved.

Users took to social media throughout the morning to complain that their phones were showing “SOS” mode, which allows emergency calls but nothing else. This is what phones sometimes offer when the user has no SIM registered on the device. Resetting the device and other common solutions do not resolve the issue. For much of the morning, Verizon offered no response to the reports.

Within hours, more than 100,000 users reported problems on the website Downdetector. The problem does not appear isolated to any particular part of the country; users in California reported problems, and so did users on the East Coast and in Chicago, among other places.

By 10 am, some AT&T users also began reporting problems. Outage maps based on user-reported data found that the outages were especially common in parts of the country otherwise affected by Hurricane Helene.

After a period of silence, Verizon acknowledged the problem in a public statement. “We are aware of an issue impacting service for some customers,” a spokesperson told NBC News and others. “Our engineers are engaged and we are working quickly to identify and solve the issue.”

However, the spokesperson did not specify why the outage was occurring. It’s not the first major online service outage this year, though. AT&T experienced an outage previously, and the CrowdStrike-related outage of Microsoft services caused chaos and made headlines in July.

Update 5: 37 PM ET:  Some users are reporting they have regained service, and Verizon confirmed this in another statement: “Verizon engineers are making progress on our network issue and service has started to be restored. We know how much people rely on Verizon and apologize for any inconvenience some of our customers experienced today. We continue to work around the clock to fully resolve this issue.”

Verizon customers face mass-scale outage across the US Read More »