this post was submitted on 09 Jan 2025
57 points (95.2% liked)

Selfhosted

60366 readers
679 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Detailed Rules Post

  1. Be civil.

  2. No spam.

  3. Posts are to be related to self-hosting.

  4. Don't duplicate the full text of your blog or readme if you're providing a link.

  5. Submission headline should match the article title.

  6. No trolling.

  7. Promotion posts require active participation, with an account that is at least 30 days old. F/LOSS without a paywall has exceptions, with requirements. See the rules link for details.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago
MODERATORS
 

Now that we know AI bots will ignore robots.txt and churn residential IP addresses to scrape websites, does anyone know of a method to block them that doesn't entail handing over your website to Cloudflare?

you are viewing a single comment's thread
view the rest of the comments
[–] r00ty@kbin.life 4 points 1 year ago

So on my mbin instance, it's on cloudflare. So I filter the AS numbers there. Don't even reach my server.

On the sites that aren't behind cloudflare. Yep it's on the nginx level. I did consider firewall level. Maybe just make a specific chain for it. But since I was blocking at the nginx level I just did it there for now. I mean it keeps them off the content, but yes it does tell them there's a website there to leech if they change their tactics for example.

You need to block the whole ASN too. Those that are using chrome/firefox UAs change IP every 5 minutes from a random other one in their huuuuuge pools.