this post was submitted on 19 Nov 2025

358 points (98.4% liked)

Technology

76917 readers

3198 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

358

Cloudfare outage post mortem (blog.cloudflare.com)

submitted 1 day ago by homura1650@lemmy.world to c/technology@lemmy.world

57 comments fedilink hide all child comments

The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

top 50 comments

sorted by: hot top controversial new old

[–] JcbAzPx@lemmy.world 12 points 10 hours ago

This is just the beginning of the coming vibe code apocalypse.

[–] melsaskca@lemmy.ca 25 points 15 hours ago

We are going to see a lot more of this type of bullshit now that there are no standards anymore. Fuck everything else and make that money people!

[–] MonkderVierte@lemmy.zip 42 points 19 hours ago* (last edited 19 hours ago) (2 children)

Meaning, internal error, like the other two prior.

Almost like one big provider with 99.9999% availability is worse than 10 with maybe 99.9%

[–] Jason2357@lemmy.ca 13 points 13 hours ago

Except, if you chose the wrong 1 of that 10 and your company is the only one down for a day, you get fire-bombed. If "TEH INTERNETS ARE DOWN" and your website is down for a day, no one even calls you.

[–] jj4211@lemmy.world 10 points 14 hours ago (1 children)

Note that this outage by itself, based on their chart, was kicking out errors over the span of about 8 hours. This one outage would have almost entirely blown their downtown allowance under 99.9% availability criteria.

If one big provider actually provided 99.9999%, that would be 30 seconds of all outages over a typical year. Not even long enough for people to generally be sure there was an 'outage' as a user. That wouldn't be bad at all.

[–] mech@feddit.org 152 points 1 day ago (2 children)

A permissions change in one database can bring down half the Internet now.

[–] CosmicTurtle0@lemmy.dbzer0.com 9 points 20 hours ago

tbf IAM is the bastard child of many cloud providers.

It exists to provide CISOs and BROs a level of security that no one person has access to their infrastructure. So if a company decides that system A should no longer have access to system B, they can do that quickly.

IAM is so complex now that it's a field all in itself.

[–] SidewaysHighways@lemmy.world 16 points 1 day ago

certainly brought my audiobookshelf to its knees when i decided that that lxc was gonna go ahead and be the jellyfin server also

[–] echodot@feddit.uk 57 points 21 hours ago (3 children)

So I work in the IT department of a pretty large company. One of the things that we do on a regular basis is staged updates, so we'll get a small number of computers and we'll update the software on them to the latest version or whatever. Then we leave it for about a week, and if the world doesn't end we update the software onto the next group and then the next and then the next until everything is upgraded. We don't just slap it onto production infrastructure and then go to the pub.

But apparently our standards are slightly higher than that of an international organisation who's whole purpose is cyber security.

[–] IphtashuFitz@lemmy.world 10 points 13 hours ago

You would do well to go read up on the 1990 AT&T long distance network collapse. A single line of changed code, rolled out months earlier, ultimately triggered what you might call these days a DDoS attack that took down all 114 long distance telephone switches in their global network. Over 50 million long distance calls were blocked in the 9 hours it took them to identify the cause and roll out a fix.

AT&T prided itself on the thoroughness of their testing & rollout strategy for any code changes. The bug that took them down was both timing-dependent and load-dependent, making it extremely difficult to test for, and required fairly specific real world conditions to trigger. That’s how it went unnoticed for months before it triggered.

[–] floquant@lemmy.dbzer0.com 33 points 20 hours ago (3 children)

Their motivation is that that file has to change rapidly to respond to threats. If a new botnet pops up and starts generating a lot of malicious traffic, they can't just let it run for a week

[–] unexposedhazard@discuss.tchncs.de 5 points 19 hours ago* (last edited 19 hours ago) (1 children)

How about an hour? 10 minutes? Would have prevented this. I very much doubt that their service is so unstable and flimsy that they need to respond to stuff on such short notice. It would be worthless to their customers if that were true.

Restarting and running some automated tests on a server should not take more than 5 minutes.

[–] SMillerNL@lemmy.world 10 points 17 hours ago (2 children)

5 minutes of uninterrupted DDoS traffic from a bot farm would be pretty bad.

[–] ramble81@lemmy.zip 11 points 15 hours ago* (last edited 14 hours ago) (1 children)

5 hours of unintended downtime from an update is even worse.

Edited for those who didn’t get the original point.

[–] SMillerNL@lemmy.world 5 points 14 hours ago (1 children)

It wasn’t an unintentional update though, it was an intentional update with a bug.

[–] ramble81@lemmy.zip 1 points 14 hours ago

Edited. My point still stands.

[–] dafta@lemmy.blahaj.zone 7 points 15 hours ago (1 children)

Significantly better than several hours od most of the internet being down.

[–] SMillerNL@lemmy.world 4 points 14 hours ago

Maybe not updating bot mitigation fast enough would cause an even bigger outage. We don’t know from the outside.

load more comments (2 replies)

[–] codemankey@programming.dev 19 points 21 hours ago (2 children)

My assumption is that the pattern you describe is possible/doable on certain scales and in certain combinations of technologies. But doing this across a distributed system with as many nodes and as many different nodes as CloudFlare has, and still have a system that can be updated quickly (responding to DDOS attacks for example) is a lot harder.

If you really feel like you have a better solution please contact them and consult for them, the internet would thank you for it.

load more comments (2 replies)

[–] edgemaster72@lemmy.world 68 points 22 hours ago (1 children)

[–] Whimsical418@aussie.zone 35 points 21 hours ago (2 children)

Wasn’t it crowdstrike? Close enough though

[–] edgemaster72@lemmy.world 3 points 14 hours ago

Shit, you're right. Oh well.

[–] cepelinas@sopuli.xyz 8 points 20 hours ago

The crowd was in the cloud.

[–] dan@upvote.au 62 points 22 hours ago (2 children)

When are people going to realise that routing a huge chunk of the internet through one private company is a bad idea? The entire point of the internet is that it's a decentralized network of networks.

[–] Jason2357@lemmy.ca 4 points 13 hours ago (1 children)

Someone always chimes into these discussions with the experience of being DDOSed and Cloudflare being the only option to prevent it.

Sounds a lot like a protection racket to me.

[–] dan@upvote.au 2 points 11 hours ago

Companies like OVH have good DDoS protection too.

[–] echodot@feddit.uk 7 points 21 hours ago (2 children)

I hate it but there really isn't much in the way of an alternative. Which is why they're dominant, they're the only game in town

[–] Capricorn_Geriatric@lemmy.world 35 points 20 hours ago (1 children)

How come?

You can route traffic without Cloudflare.

You can use CDNs other than Cloudflare's.

You can use tunneling from other providers.

There are providers of DDOS protection and CAPTCHA other than Cloudflare.

Sure, Cloudflare is probably closest to asingle, integrated solution for the full web delivery stack. It's also not prohibitively expensive, depending on who needs what.

So the true explanation, as always, is lazyness.

[–] lena 4 points 14 hours ago (1 children)

I'm lazy and use cloudflare, so that checks out. Due to recent events I'll switch to another CDN, the centralization of the internet is very concerning.

[–] dan@upvote.au 2 points 8 hours ago* (last edited 8 hours ago) (1 children)

I'm a fan of BunnyCDN - somehow they're one of the fastest while also being one of the cheapest, and they're based in Europe (Slovenia).

KeyCDN is good too, and they're also Europe-based (Switzerland), but they have a higher minimum monthly spend of $4 instead of $1 at Bunny.

Fastly have a free tier with 100GB per month, but bandwidth pricing is noticeably higher than Bunny and KeyCDN once you exceed that.

https://www.cdnperf.com/ is useful for comparing performance. They don't list every CDN though.

Some CDN providers are focused only on large enterprise customers, and it shows in their pricing.

[–] lena 1 points 8 hours ago

Wow, bunny is second in query speed, just below cloudflare. Impressive!

[–] dan@upvote.au 10 points 19 hours ago* (last edited 8 hours ago)

there really isn't much in the way of an alternative

Bunny.net covers some of the use cases, like DNS and CDN. I think they just rolled out a WAF too.

There's also the "traditional" providers like AWS, Akamai, etc. and CDN providers like KeyCDN and CDN77.

I guess one of the appeals of Cloudflare is that it's one provider for everything, rather than having to use a few different providers?

[–] Nighed@feddit.uk 8 points 21 hours ago (1 children)

Somewhere, that Dev who was told that having clustered databases in nonprod was two expensive and not needed is now updating the deploy scripts

[–] choopeek@lemmy.world 4 points 12 hours ago

Sadly, in my case, even after almost destroying a production cluster, they still decided a test cluster is to expensive and they'll just live with the risk.

[–] thisbenzingring@lemmy.sdf.org 14 points 23 hours ago

really reminds me of the self owned crowdstrike bullshit

[–] ranzispa@mander.xyz 7 points 22 hours ago (1 children)

Before today, ClickHouse users would only see the tables in the default database when querying table metadata from ClickHouse system tables such as system.tables or system.columns.

Since users already have implicit access to underlying tables in r0, we made a change at 11:05 to make this access explicit, so that users can see the metadata of these tables as well.

I'm no expert, but this feels like something you'd need to ponder very carefully before deploying. You're basically changing the result of all queries to your db. I'm not working in there, but I'm sure in plenty places if the codebase there's a bunch of query this and pick column 5 from the result.

[–] felbane@lemmy.world 2 points 16 hours ago

"Claude said it was fine, ship it."

load more comments