this post was submitted on 19 Nov 2025

281 points (99.3% liked)

Technology

76962 readers

3097 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

281

Cloudflare blames massive internet outage on 'latent bug' (techcrunch.com)

submitted 2 days ago by tonytins@pawb.social to c/technology@lemmy.world

32 comments fedilink hide all child comments

Around the same time, Cloudflare’s chief technology officer Dane Knecht explained that a latent bug was responsible in an apologetic X post.

“In short, a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation to our network and other services. This was not an attack,” Knecht wrote, referring to a bug that went undetected in testing and has not caused a failure.

all 34 comments

sorted by: hot top controversial new old

[–] Kolanaki@pawb.social 136 points 2 days ago* (last edited 2 days ago) (1 children)

[–] eager_eagle@lemmy.world 66 points 2 days ago* (last edited 2 days ago) (1 children)

thanks for illustrating the corpo speak

I hope the bug is fine

[–] moseschrute@lemmy.world 46 points 2 days ago* (last edited 2 days ago) (2 children)

Nobody ever asks if the bug is ok

[–] aeronmelon@lemmy.world 13 points 2 days ago (1 children)

Fun fact time:

That’s why they’re called computer bugs.

In 1947, the Harvard Mark II computer was malfunctioning. Engineers eventually found a dead moth wedged between two relay points, causing a short. Removing it fixed the problem. They saved the moth and it’s on display at a museum to this day.

The moth was not okay.

And to be fair, the word bug had been used to describe little problems and glitches before that incident, but this was the first case of a computer bug.

[–] FauxLiving@lemmy.world 6 points 2 days ago

The moth was not okay.

They didn't tell us this part when they taught it in school #RIP Bug, the OG bug who died to the OG pull request.

[–] tdawg@lemmy.world 7 points 2 days ago (1 children)

Poor guy :(

[–] Ceruleum@lemmy.wtf 1 points 11 hours ago

Nah, he was the first computer criminal.

[–] FauxLiving@lemmy.world 40 points 2 days ago (1 children)

If you want a technical breakdown that isn't "lol AI bad":

https://blog.cloudflare.com/18-november-2025-outage/

Basically, a permission change cause an automated query to return more data than was planned for. The query resulted in a configuration file with a large amount of duplicate entries which was pushed to production. The size of the file went over the prealloctaed memory limit for a downstream system which died due to an unhandled error state resulting from the large configuration file. This caused a thread panic leading to the 5xx errors.

It seems that Crowdstrike isn't alone this year in the 'A bad config file nearly kills the Internet' club.

[–] AldinTheMage@ttrpg.network -1 points 2 days ago (1 children)

So the actual outage comes down to pre-allocating memory, but not actually having error handling to gracefully fail if that limit is or will be exceeded... Bad day for whoever shows up on the git blame for that function

[–] witx@lemmy.sdf.org 13 points 2 days ago (2 children)

This is the wrong take. Git blame only show who wrote the line. What about the people who reviewed the code?

[–] floquant@lemmy.dbzer0.com 8 points 2 days ago* (last edited 2 days ago) (1 children)

Plus the guys who are hired to ensure that systems don't fail even under inexperienced or malicious employees, management who designs and enforces the whole system, etc... "one guy fucked up and needs to be fired" is just a toxic mentality that doesn't actually address the chain of conditions that led to the situation

[–] AldinTheMage@ttrpg.network 4 points 2 days ago

That should also come up in a reviews also. Not trying to imply one guy should get fired as a scapegoat, just talking from experience how much it sucks to know your code caused major issues.

[–] sugar_in_your_tea@sh.itjust.works 1 points 1 day ago

If you have reasonable practices, git blame will show you the original ticket, a link to the code review, and relevant information about the change.

[–] PattyMcB@lemmy.world 15 points 2 days ago (1 children)

Blame it on the massive tech sector layoffs

[–] sunbeam60@feddit.uk 3 points 2 days ago (1 children)

Evidence or speculation?

[–] iglou@programming.dev 0 points 2 days ago

Obviousness? If you mass layoff your tech staff, you take the risk of more technical failures.

A smaller staff cannot do the same work as a larger one, and I guarantee you they're being asked to progress at the same speed. So, the tradeoff is on the quality of the product and the testing, not on the speed of development.

[–] floquant@lemmy.dbzer0.com 4 points 2 days ago

Did they though? Aside from the "every outage is a latent bug" angle, from their postmortem it doesn't seem to me like they tried to blame it on anything but their failure to contain the spread of (and timely diagnose) the issue

[–] A_norny_mousse@feddit.org 4 points 2 days ago (3 children)

a routine configuration change

Honest question (I don't work in IT): this sounds like a contradiction or at the very least deliberately placating choice of words. Isn't a config change the opposite of routine?

[–] monkeyslikebananas2@lemmy.world 7 points 2 days ago (2 children)

Not really. Sometimes there are processes designed where engineers will make a change as a reaction or in preparation for something. They could have easily made a mistake when making a change like that.

[–] 123@programming.dev 3 points 2 days ago* (last edited 2 days ago)

E.g.: companies that advertise on a large sporting event might preemptively scale up (maybe warm up depending on language) their servers in preparation for a large load increase following some ad or mention of a coupon or promo code. Failure to capture the market it could generate would be seen as wasted $$$

Edit: auto-scale does not count on non essential products, people would not come back if the website failed to load on the first attempt.

[–] NotMyOldRedditName@lemmy.world 1 points 2 days ago* (last edited 2 days ago) (1 children)

I don't think it was a bug making the configuration change, I think there was a bug as a result of that change.

That specific combination of changes may not have been tested, or applied in production for months, and it just happened to happen today when they were needed for the first time since an update some time ago, hence the latent part.

But they do changes like that routinely.

[–] monkeyslikebananas2@lemmy.world 1 points 2 days ago

Yeah, I just read the postmortem. My response was more about the confusion that any configuration change is inherently non-routine.

[–] floquant@lemmy.dbzer0.com 2 points 2 days ago

No, in "DevOps" environments "configuration changes" is most of what you do every day

[–] fushuan@lemmy.blahaj.zone 2 points 2 days ago (1 children)

They probably mean that they did a change in a config file that is uploaded in their weekly or bi-weekly change window, and that that file was malformed for whichever reason that made the process that reads it crash. The main process depends on said process, and all the chain failed.

Things to improve:

make the pipeline more resilient, if you have a "bot detection module" that expects a file,and that file is malformed, it shouldn't crash the whole thing: if the bot detection module crahses, control it, fire an alert but accept the request until fixed.
Have a control of updated files to ensure that nothing outside of expected values and form is uploaded: this file does not comply with the expected format, upload fails and prod environment doesn't crash.
Have proper validation of updated config files to ensure that if something is amiss, nothing crashes and the program makes a controlled decision: if file is wrong, instead of crashing the module return an informed value and let the main program decide if keep going or not.

I'm sure they have several of these and sometimes shit happens, but for something as critical as CloudFlare to not have automated integration tests in a testing environment before anything touches prod is pretty bad.

[–] groet@feddit.org 5 points 2 days ago (1 children)

it shouldn't crash the whole thing: if the bot detection module crahses, control it, fire an alert but accept the request until fixed.

Fail open vs fail closed. Bot detection is a security feature. If the security feature fails, do you disable it and allow unchecked access to the client data? Or do you value Integrity over Availability

Imagine the opposite: they disable the feature and during that timeframe some customers get hacked. The hacks could have been prevented by the Bot detection (that the customer is paying for).

Yes, bot detection is not the most critical security feature and probably not the reason someone gets hacked but having "fail closed" as the default for all security features is absolutely a valid policy. Changing this policy should not be the lesson from this disasters.

[–] fushuan@lemmy.blahaj.zone 1 points 2 days ago (1 children)

You don't get hacking protection from bots, you get protection from DDoS attacks. Yeah some customers would have gone down, instead everyone went down... I said that instead of crashing the system they should have something that takes an intentional decision and informs properly about what's happening. That decision might have been to clo

You can keep the policy and inform everyone much better about what's happening. Half a day is a wild amount of downtime if it were properly managed.

Yes, bot detection is not the most critical....

So you agree that if this were controlled instead of open crahsing everything them being able to make an informed decision and opening or closing things, with the suggestion of opening in the case of not detection is the correct approach. What's the point of your complaint if you do agree? C'mon.

[–] groet@feddit.org 1 points 2 days ago

You don't get hacking protection from bots

I disagree. I don't know the details of cloudflares bot detecion, but there are many automated vulnerability scanners that this could protect against.

I said that instead of crashing the system they should have something that takes an intentional decision and informs properly about what's happening.

I agree. Every crash is a failure by the designers. Instead it should be caught by the program and result in a useful error state. They probably have something like that but it didn't work because the crash was to severe.

What's the point of your complaint if you do agree?

I am not complaining. I am informing you that you are missing an angle in your consideration. You can never prevent every crash ever. So when designing your product you have to consider what should happen if every safeguard fails and you get an uncontrolled crash. In that case you have to design for "fail open" or "fail closed". Cloudflare fucked up. The crash should not have happened and if it did it should have been caught. They didn't. They fucked up. But, i agree with the result of the fuck up causing a fail closed state.

[–] DaMummy@lemmy.world 4 points 2 days ago (2 children)

Why's he saying it's not an attack? Sounds like he's protesting too much.

[–] tonytins@pawb.social 15 points 2 days ago* (last edited 2 days ago)

It's not the first time Cloudflare has shot themselves in the foot.

[–] grumpasaurusrex@lemmy.world 11 points 2 days ago (1 children)

There's nothing to be gained from Cloudflare lying about this. It honestly makes them look worse if the outage was caused internally vs if it had been due to an attack

[–] DaMummy@lemmy.world 0 points 2 days ago

Unless it's from a government they're not allowed to criticize.