this post was submitted on 24 May 2025
1273 points (98.9% liked)

Science Memes

14653 readers
3080 users here now

Welcome to c/science_memes @ Mander.xyz!

A place for majestic STEMLORD peacocking, as well as memes about the realities of working in a lab.



Rules

  1. Don't throw mud. Behave like an intellectual and remember the human.
  2. Keep it rooted (on topic).
  3. No spam.
  4. Infographics welcome, get schooled.

This is a science community. We use the Dawkins definition of meme.



Research Committee

Other Mander Communities

Science and Research

Biology and Life Sciences

Physical Sciences

Humanities and Social Sciences

Practical and Applied Sciences

Memes

Miscellaneous

founded 2 years ago
MODERATORS
 
top 50 comments
sorted by: hot top controversial new old
[–] antihumanitarian@lemmy.world 19 points 1 hour ago (1 children)

Some details. One of the major players doing the tar pit strategy is Cloudflare. They're a giant in networking and infrastructure, and they use AI (more traditional, nit LLMs) ubiquitously to detect bots. So it is an arms race, but one where both sides have massive incentives.

Making nonsense is indeed detectable, but that misunderstands the purpose: economics. Scraping bots are used because they're a cheap way to get training data. If you make a non zero portion of training data poisonous you'd have to spend increasingly many resources to filter it out. The better the nonsense, the harder to detect. Cloudflare is known it use small LLMs to generate the nonsense, hence requiring systems at least that complex to differentiate it.

So in short the tar pit with garbage data actually decreases the average value of scraped data for bots that ignore do not scrape instructions.

[–] fossilesque@mander.xyz 2 points 30 minutes ago

The fact the internet runs on lava lamps makes me so happy.

[–] mlg@lemmy.world 6 points 1 hour ago

--recurse-depth=3 --max-hits=256

[–] stm@lemmy.dbzer0.com 28 points 6 hours ago

Such a stupid title, great software!

[–] Iambus@lemmy.world 13 points 7 hours ago

Typical bluesky post

[–] Zacryon@feddit.org 51 points 13 hours ago (5 children)

I suppose this will become an arms race, just like with ad-blockers and ad-blocker detection/circumvention measures.
There will be solutions for scraper-blockers/traps. Then those become more sophisticated. Then the scrapers become better again and so on.

I don't really see an end to this madness. Such a huge waste of resources.

[–] arararagi@ani.social 7 points 7 hours ago

Well, the adblockers are still wining, even on twitch where the ads como from the same pipeline as the stream, people made solutions that still block them since ublock origin couldn't by itself.

[–] enbiousenvy@lemmy.blahaj.zone 9 points 8 hours ago

the rise of LLM companies scraping internet is also, I noticed, the moment YouTube is going harsher against adblockers or 3rd party viewer.

Piped or Invidious instances that I used to use are no longer works, did so may other instances. NewPipe have been broken more frequently. youtube-dl or yt-dlp sometimes cannot fetch higher resolution video. and so sometimes the main youtube side is broken on Firefox with ublock origin.

Not just youtube but also z-library, and especially sci-hub & libgen also have been harder to use sometimes.

[–] pyre@lemmy.world 19 points 12 hours ago

there is an end: you legislate it out of existence. unfortunately the US politicians instead are trying to outlaw any regulations regarding AI instead. I'm sure it's not about the money.

load more comments (2 replies)
[–] ZeffSyde@lemmy.world 9 points 9 hours ago (1 children)

I'm imagining a break future where, in order to access data from a website you have to pass a three tiered system of tests that make, 'click here to prove you aren't a robot' and 'select all of the images that have a traffic light' , seem like child's play.

[–] Tiger_Man_@lemmy.blahaj.zone 2 points 8 hours ago (1 children)

All you need to protect data from ai is use non-http protocol, at least for now

[–] Bourff@lemmy.world 7 points 8 hours ago

Easier said than done. I know of IPFS, but how widespread and easy to use is it?

[–] Tiger_Man_@lemmy.blahaj.zone 3 points 8 hours ago (1 children)

How can i make something like this

load more comments
view more: next ›