this post was submitted on 02 Jul 2025

154 points (97.0% liked)

Technology

72266 readers

2618 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

154

Millions of websites to get 'game-changing' AI bot blocker (www.bbc.com)

submitted 17 hours ago by Davriellelouna@lemmy.world to c/technology@lemmy.world

20 comments fedilink hide all child comments

top 20 comments

sorted by: hot top controversial new old

[–] Scrollone@feddit.it 4 points 6 hours ago

I wish there was an alternative (possibly European) to Cloudflare, because it's so scary to put all eggs in one basket.

[–] isVeryLoud@lemmy.ca 6 points 8 hours ago

So... Proprietary Anubis?

[–] Concave1142@lemmy.world 45 points 16 hours ago (3 children)

Until the AI companies find a way around it. Love the idea so hopefully it causes at least 3 days of struggle for the AI crawlers.

Having said that... Can someone else put this in place so we do not have Cloudflare hosting everything where we would just be one intern away from a global outage. Please? Pretty please?

[–] auraithx@piefed.social 2 points 9 hours ago (1 children)

Yeah this will have absolutely no impact to gathering training data.

I assumed it was to block ai agents crawling it during requests, which they’d be unlikely to bypass in the web ui.

But no company spending millions on training will hesitate to have an agent appear as a regular desktop user to scrape data.

[–] boonhet@sopuli.xyz 2 points 8 hours ago (1 children)

Does cloudflare still look at the agent? I thought they have more reliable data points.

[–] auraithx@piefed.social 1 points 8 hours ago (1 children)

I meant an ai agent not the browser agent. All data points can be spoofed and if not they’ll pay a human to scrape before they pay for content.

[–] boonhet@sopuli.xyz 1 points 7 hours ago

Okay, fair enough, I thought you meant just the user agent. Trouble with having a bot make it look like an actual user is looking at the data, is that it's slow and inefficient. Trouble with paying humans to scrape the data is that it's slow and inefficient. These companies want to ingest data ridiculously fast because there's so much of it. If all else fails, they'll resort to paying the content creators. But only if it's data they really do think gives their model a competitive edge in some metric and they can't pirate it. E.g I can see them paying for scientific research they can't get from libgen, but not some rando's blog post or local news website.

[–] orclev@lemmy.world 10 points 16 hours ago (2 children)

The problem is that the biggest service Cloudflare provides is DDoS protection, and doing that requires that you have more bandwidth available than your attacker. Having enough bandwidth to withstand modern botnet powered DDoS attacks is ridiculously expensive (and it's also a finite resource, there's only so much backbone infrastructure). Basically it's economically infeasible to have multiple companies providing the service Cloudflare does. You might be able to get away with two companies doing so, but it's unlikely you could manage more than that without some of them starting to go bankrupt.

[–] acosmichippo@lemmy.world 16 points 15 hours ago (1 children)

when a critical service is not economical for more than one business to do (natural monopoly), that's when govt should be stepping in.

[–] antonim@lemmy.dbzer0.com 9 points 13 hours ago (1 children)

Which govt? I'm not comfortable with the idea of the current US govt having control over this sort of service.

[–] acosmichippo@lemmy.world 2 points 10 hours ago* (last edited 10 hours ago) (1 children)

are you comfortable with a single corporation having control over this sort of service? the current government is obviously not ideal but that shouldn’t stop us from regulating monopolies.

[–] antonim@lemmy.dbzer0.com 3 points 7 hours ago

are you comfortable with a single corporation having control over this sort of service?

Honestly? A tiny bit more than a single country. I have at least some miniscule control over the corporation through voting and local regulations that international corporations must follow, whereas I have absolutely no formal influence on US govt.

[–] Kowowow@lemmy.ca 8 points 16 hours ago (1 children)

I wonder if it would be a good investment for a country to have their own then down the line expand to sell the same service to others

[–] altkey@lemmy.dbzer0.com 8 points 15 hours ago

It's OurFlare, comrade.

[–] baduhai@sopuli.xyz 2 points 13 hours ago

Proof of work seems to be working pretty well for many websites.

[–] chunes@lemmy.world 11 points 12 hours ago

I can't wait to be denied access to websites because of it. Even more than I already am, that is.

[–] yesman@lemmy.world 15 points 13 hours ago (1 children)

This is not about stopping bot-scrapers, it's about charging them.

[–] spankmonkey@lemmy.world 3 points 11 hours ago

Hopefully people will price their content out of reach of the bot-scrapers, effectively stopping them.

[–] Imgonnatrythis@sh.itjust.works 9 points 16 hours ago (1 children)

I really wish the answer was a legally enforced robots.txt file that very easily allowed any web data any organization or individual user is posting to script out what the permissions are. I often use a LLM as a search and most of the time the citations are pretty decent and I use those to link out to source content. I run a small blog and I'd love to get indexed in a LLM, not blocked, as long as I was assured a reference link for any content used and had some legal recourse if I found my data was being misused. I don't love the answer being another mega corporation posing as a white knight looking to skim some money off of the "loophole" that is AI copyright infringement.

[–] drmoose@lemmy.world 0 points 2 hours ago

How would you legally enforce robots.txt? It's not a legally sound system.