Zum Inhalt der Seite gehen


Is there a list of things to add to a robots.txt file to prevent AI scraping and such? (Or at least tell the scrapers to F off, I am well aware they mostly don’t care.)

teilten dies erneut

Als Antwort auf Jan Beta

That's actually quite an interesting question, never thought about it. I guess this Repo I just googled is already known? github.com/ai-robots-txt/ai.ro…

Looks kind of promising to me.

Als Antwort auf datort

@datort Looks good, thanks! I have something similar in place already but haven’t updated in a while. This seems to be more up to date.
Als Antwort auf Jan Beta

Many of the AI bots don't respect Robots.txt 🫤

They have to be blocked with agent strings or by IP range.

I have something configured for my Wordpress site. I'll have to dig out the details when I'm home from work 🙂

Als Antwort auf Papierzeit

@Papierzeit 😅 I‘m not sure if that tip is viable for me at this point. (Although admittedly it’s the safest option.)
Als Antwort auf Jan Beta

Of course you can decide for yourself. I've just reached the point where I don't feel like sharing anymore. I've liked it since the 90s in my BBS days and now I've reached the point where I say: don't share and I feel better. Why: it's not like it used to be. Mass instead of class. You understand.
Als Antwort auf Papierzeit

@Papierzeit I totally understand. I miss the times when you basically could control every single bit you sent out into the world. And just hang up when you were done with communicating.
Als Antwort auf Jan Beta

You know, it's not just that, it was like that back then:

I have a thirst for knowledge - I drink a glass of water.

Today: This damn fire hose is too much for me. It makes me crazy.

Als Antwort auf Jan Beta

Perhaps “please don't scape me” if it makes you feel good.

As you say, it will not be respected anyway.

Als Antwort auf Jan Beta

Adding a reply from mastodon too, because it seems my freshly installed friendica server isn't all that functional yet, but there's this: This? github.com/ai-robots-txt/ai.ro… - i've configured them to be redirect rules in my web server config to give a 404 when a request with one of those user agents comes in, because as you said, what scraper respects robots.txt?
Als Antwort auf Jan Beta

Robots.txt won't work.
I have a long list of useragents and ip ranges in my nginx config, that I gathered from various sources. Any request matching those, is an instant 403. If any IP accumulates enough 403s, fail to ban will ban it for a week.

Apparently Google hates me for it and decided that I won't get on the first page any more even if the search term is the domain of my website. Then again google results are shitty already and keep getting shittier, so that might not matter in the long run.

Als Antwort auf Jan Beta

darkvisitors.com/ is one I bookmarked a while back to check out, but I have not implemented it on my site yet.