this post was submitted on 21 Aug 2024
321 points (100.0% liked)

196

19153 readers
303 users here now

Be sure to follow the rule before you head out.


Rule: You must post before you leave.



Other rules

Behavior rules:

Posting rules:

NSFW: NSFW content is permitted but it must be tagged and have content warnings. Anything that doesn't adhere to this will be removed. Content warnings should be added like: [penis], [explicit description of sex]. Non-sexualized breasts of any gender are not considered inappropriate and therefore do not need to be blurred/tagged.

If you have any questions, feel free to contact us on our matrix channel or email.

Other 196's:

founded 2 years ago
MODERATORS
 
top 34 comments
sorted by: hot top controversial new old
[–] SteveFromMySpace@lemmy.blahaj.zone 113 points 2 years ago (1 children)

but not the misuse of public content

HA

[–] unrelatedkeg@lemmy.sdf.org 15 points 2 years ago* (last edited 2 years ago) (1 children)

but not the misuse of public content

Is that an admission that they don't own the content others posted on their site?

[–] GroupNebula563@lemmy.world 4 points 2 years ago

you would be a good lawyer

[–] shikogo@pawb.social 71 points 2 years ago (3 children)

I am confused, does this mean Reddit is not going to be searchable on search engines anymore?

[–] Aeri@lemmy.world 67 points 2 years ago (4 children)

oh no, Reddit is like, the only way to have google still be useful.

[–] germanatlas@lemmy.blahaj.zone 55 points 2 years ago

Funnily enough, google is also the only way to have Reddit be useful.

Their own search function has been nothing but garbage.

[–] morgunkorn@discuss.tchncs.de 44 points 2 years ago (2 children)

That's the catch, Google made a deal with Reddit and remains the only search engine allowed to access its data for indexing. It cuts off every other search engine

[–] Vorticity@lemmy.world 28 points 2 years ago (1 children)

Tell me that there is an anti trust suit over this.

[–] GroupNebula563@lemmy.world 27 points 2 years ago

There's a suit over google in general so this may well be part of it

[–] TriflingToad@lemmy.world 3 points 2 years ago (1 children)

really? ddg will show me reddit links, did they have to make a webscraper or something

[–] morgunkorn@discuss.tchncs.de 4 points 2 years ago

There's a cutoff date, anything indexed before the robots.txt was changed stays in the index

[–] riodoro1@lemmy.world 31 points 2 years ago (1 children)

We fucked the internet. It’s proprietary now.

[–] GroupNebula563@lemmy.world 11 points 2 years ago* (last edited 2 years ago) (1 children)

we fucked the internet

kinky

[–] pupbiru@aussie.zone 8 points 2 years ago (1 children)
[–] Swedneck@discuss.tchncs.de 2 points 2 years ago

cat5-o-nine-tails

[–] princessnorah@lemmy.blahaj.zone 9 points 2 years ago (1 children)

Good news! Google paid up and still has access I'm pretty sure.

[–] GroupNebula563@lemmy.world 1 points 2 years ago (1 children)

That's bad news, that means the internet is dying

[–] princessnorah@lemmy.blahaj.zone 1 points 2 years ago (1 children)

Sorry, the /s was sort of implied.

[–] GroupNebula563@lemmy.world 2 points 2 years ago

Ah, sorry. I have trouble with that sometimes :P

[–] GroupNebula563@lemmy.world 9 points 2 years ago (1 children)

Perhaps, likely depends on the crawler though

Yeah i dont think ignoring robots.txt is even illegal. They can ofcourse just block your crawlers IP but that would be a cat and mouse game that they would lose in the end.

[–] JusticeForPorygon@lemmy.world 54 points 2 years ago (1 children)

Not gonna lie this seems like ultimately a win for the Internet. The years of troubleshooting solutions Reddit Provided can be archived (hopefully) but the less people rely on the site itself, the better. At least in my opinion.

[–] TriflingToad@lemmy.world 2 points 2 years ago

I disagree, kinda. Stackoverflow is the other option for questions which is a lot less user friendly, and Lemmy has never shown up in search results for me. If something comes along and makes it simple, great! however I just see a lot more of ad filled hellhole sites in the meantime.

[–] Kojichan@lemmy.world 52 points 2 years ago

I remember finding Google's robots.txt when they first came out. It was a cute little text ASCII art of a robot with a heart that said, "We love robots!"

[–] jabathekek@sopuli.xyz 50 points 2 years ago (1 children)

An ancient text from the before-fore.

[–] GroupNebula563@lemmy.world 60 points 2 years ago (1 children)

this is actually quite recent. the old one was much funnier and clearly had actual soul put into it.

[–] AsudoxDev@programming.dev 6 points 2 years ago

my shiny metal ass

[–] itsnicodegallo@lemm.ee 8 points 2 years ago (4 children)

As annoying as this is, it's to prevent LLMs from training themselves using Reddit content, and that's probably the greater of the two evils.

[–] GroupNebula563@lemmy.world 37 points 2 years ago (1 children)

That's all well and good, but how many LLMs do you think actually respect robots.txt?

[–] colin@lemmy.uninsane.org 14 points 2 years ago

from my limited experience, about half? i had to finally set up a robots.txt last month after Anthropic decided it would be OK to crawl my Wikipedia mirror from about a dozen different IP addresses simultaneously, non-stop, without any rate limiting, and bring it to its knees. fuck them for it, but at least it stopped once i added robots.txt.

Facebook, Amazon, and a few others are ignoring that robots.txt, on the other hand. they have the decency to do it slowly enough that i'd never notice unless i checked the logs, at least.

[–] jbk@discuss.tchncs.de 32 points 2 years ago

I thought major LLMs ignored robots.txt

[–] anas@lemmy.world 12 points 2 years ago

It’s to prevent LLMs from training themselves using reddit content, unless they pay the party that took no part in creating said content

FTFY