this post was submitted on 13 Jan 2026
313 points (97.9% liked)
Technology
78661 readers
3506 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related news or articles.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I have around 10-20GB github / gitlab mirror. I am constantly under attack from crawlers from top US technology corporations and LLM startups. Whenever I ban one IP range they switch to other - I don't know if those fuckers have tickets in their systems to do it manually or they just deploy this shit all over the planet. From what I observe during attacks that I mitigate the best way to poison them is to just create gitea instance with poisoned code repository and couple hundred revisions. It's because what they are most interested in is html representation of diff between two git revisions.
Why isn't there anything in the DMCA for stopping crawlers? They have stuff about requiring crawlers to follow attribution and whatnot, but nothing for not allowing crawlers in the first place. Stupid as shit.
I can get a 50Gb/s residential link where I am, and have a whole rack of servers.
Sounds like a good opportunity to crowd fund thousands and thousands of common scrapeable instances that have random poisoning.