this post was submitted on 19 Nov 2025

361 points (98.4% liked)

Technology

76917 readers

3377 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

361

Cloudfare outage post mortem (blog.cloudflare.com)

submitted 1 day ago by homura1650@lemmy.world to c/technology@lemmy.world

57 comments fedilink hide all child comments

The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

you are viewing a single comment's thread
view the rest of the comments

[–] echodot@feddit.uk 57 points 23 hours ago (3 children)

So I work in the IT department of a pretty large company. One of the things that we do on a regular basis is staged updates, so we'll get a small number of computers and we'll update the software on them to the latest version or whatever. Then we leave it for about a week, and if the world doesn't end we update the software onto the next group and then the next and then the next until everything is upgraded. We don't just slap it onto production infrastructure and then go to the pub.

But apparently our standards are slightly higher than that of an international organisation who's whole purpose is cyber security.

[–] IphtashuFitz@lemmy.world 10 points 14 hours ago

You would do well to go read up on the 1990 AT&T long distance network collapse. A single line of changed code, rolled out months earlier, ultimately triggered what you might call these days a DDoS attack that took down all 114 long distance telephone switches in their global network. Over 50 million long distance calls were blocked in the 9 hours it took them to identify the cause and roll out a fix.

AT&T prided itself on the thoroughness of their testing & rollout strategy for any code changes. The bug that took them down was both timing-dependent and load-dependent, making it extremely difficult to test for, and required fairly specific real world conditions to trigger. That’s how it went unnoticed for months before it triggered.

[–] floquant@lemmy.dbzer0.com 33 points 21 hours ago (2 children)

Their motivation is that that file has to change rapidly to respond to threats. If a new botnet pops up and starts generating a lot of malicious traffic, they can't just let it run for a week

[–] unexposedhazard@discuss.tchncs.de 5 points 20 hours ago* (last edited 20 hours ago) (1 children)

How about an hour? 10 minutes? Would have prevented this. I very much doubt that their service is so unstable and flimsy that they need to respond to stuff on such short notice. It would be worthless to their customers if that were true.

Restarting and running some automated tests on a server should not take more than 5 minutes.

[–] SMillerNL@lemmy.world 10 points 18 hours ago (2 children)

5 minutes of uninterrupted DDoS traffic from a bot farm would be pretty bad.

[–] ramble81@lemmy.zip 11 points 17 hours ago* (last edited 16 hours ago) (1 children)

5 hours of unintended downtime from an update is even worse.

Edited for those who didn’t get the original point.

[–] SMillerNL@lemmy.world 5 points 16 hours ago (1 children)

It wasn’t an unintentional update though, it was an intentional update with a bug.

[–] ramble81@lemmy.zip 1 points 16 hours ago

Edited. My point still stands.

[–] dafta@lemmy.blahaj.zone 7 points 17 hours ago (1 children)

Significantly better than several hours od most of the internet being down.

[–] SMillerNL@lemmy.world 4 points 16 hours ago

Maybe not updating bot mitigation fast enough would cause an even bigger outage. We don’t know from the outside.

[–] echodot@feddit.uk -1 points 18 hours ago (1 children)

There are technical solutions to this. You update half your servers, and then if they die you just disconnect them from the network while you fix them and then have your own unaffected servers take up the load. Now yes, this doesn't get a fixout quickly, but if you update kills your entire system, you're not going to get the fix out quickly anyway.

[–] floquant@lemmy.dbzer0.com 3 points 12 hours ago* (last edited 12 hours ago)

Congratulations, now your "good" servers are dead from the extra load and you also have a queue of shit to go through once you're back up, making the problem worse. Running a terabit-scale proxy network isn't exactly easy, the amount of moving parts interacting with each other is insane. I highly suggest reading some of their postmortems, they're usually really well written and very informative if you want to learn more about the failures they've encountered, the processes to handle them, and their immediate remediations

[–] codemankey@programming.dev 19 points 22 hours ago (1 children)

My assumption is that the pattern you describe is possible/doable on certain scales and in certain combinations of technologies. But doing this across a distributed system with as many nodes and as many different nodes as CloudFlare has, and still have a system that can be updated quickly (responding to DDOS attacks for example) is a lot harder.

If you really feel like you have a better solution please contact them and consult for them, the internet would thank you for it.

[–] echodot@feddit.uk -4 points 18 hours ago (1 children)

They know this, it's not like any of this is a revelation. But the company has been lazy and would rather just test in production because that's cheaper and most of the time perfectly fine.

[–] floquant@lemmy.dbzer0.com 2 points 13 hours ago

It looks like you have never read their blog. They do a lot of research and upstream contributions to improve their stack