Tech

2445 readers

10 users here now

A community for high quality news and discussion around technological advancements and changes

Things that fit:

New tech releases
Major tech changes
Major milestones for tech
Major tech news such as data breaches, discontinuation

Things that don't fit

Minor app updates
Government legislation
Company news
Opinion pieces

Community Wiki

founded 2 years ago

MODERATORS

Vacant@programming.dev

136

Wikipedia Asks AI Companies to Stop Scraping Data and to Start Paying Up (www.cnet.com)

submitted 2 months ago* (last edited 1 month ago) by throws_lemy@lemmy.nz to c/tech@programming.dev

9 comments fedilink hide all child comments

top 9 comments

sorted by: hot top controversial new old

[–] mindbleach@sh.itjust.works 28 points 2 months ago (1 children)

"Use the torrents, damn!"

[–] Gullible@sh.itjust.works 20 points 2 months ago* (last edited 2 months ago) (1 children)

I mean, yeah. AI companies are nearly universally, and objectively, piloted by fucksticks. Lemmy instances are also constantly scraped rather than just spinning up an instance and pulling the entire threadiverse.

[–] baltakatei@sopuli.xyz 11 points 2 months ago

I have a public gitweb repository. I am constantly being hit by dumb crawlers that, left to their own devices, request every single diff of every single commit simply because links requesting such operations are presented. All of which are unnecessary if they would only do a simple git pull, then my server would happily provide the 50 MB of the entire git repo history. Instead, they download gigabytes of HTML boilerplate, probably never actually get a full commit history, and probably can't even use what data they do scrape since they're just randomly pulling commits in between blocks and bans.

All of this only became an issue around a year ago. Since then, I just accept my public facing static pages are all that's reliable anymore.

[–] mycodesucks@lemmy.world 26 points 2 months ago (1 children)

Doesn't Wikipedia host torrents of its entire library? Why are they scraping?

[–] bamboo@lemmy.blahaj.zone 14 points 2 months ago (2 children)

I would imagine the torrents are a snapshot in time. I don't think they can be updated after being created? Also, picture the average dev. Half of them are lazier and wouldn't deal with rss or torrents, when you can just make get thousands of redundant GET get requests to use for training data.

[–] theunknownmuncher@lemmy.world 13 points 2 months ago (1 children)

This makes no sense, the snapshots are updated regularly and Wikipedia isn't even that big. Like 25GB.

[–] Gullible@sh.itjust.works 19 points 2 months ago* (last edited 2 months ago) (1 children)

The answer is simpler than you could ever conceive. Companies piloted by incompetent, selfish pricks are just scraping the entire internet in order to grab every niblet of data they can. Writing code to do what they’re doing in a less destructive fashion would require effort that they are entirely unwilling to put in. If that weren’t the case, the overwhelming majority of scrapers wouldn’t ignore robot.txt files. I hate AI companies so fucking much.

[–] pivot_root@lemmy.world 8 points 2 months ago

"robots.txt files? You mean those things we use as part of the site index when scraping it?"

— AI companies, probably

[–] taiyang@lemmy.world 2 points 2 months ago

Close, I think the average dev can't even imagine a product that isn't for profit and is available to everyone as a public service. Scraping uses a lot more resources than just regularly downloading the snapshot, so it genuinely makes no sense. It's like shoplifting from a soup kitchen.

But... I know tech people personally. They're really that dumb.