Is there a list of things to add to a robots.txt file to prevent AI scraping and such? (Or at least tell the scrapers to F off, I am well aware they mostly don’t care.)
That's actually quite an interesting question, never thought about it. I guess this Repo I just googled is already known? github.com/ai-robots-txt/ai.ro…
Of course you can decide for yourself. I've just reached the point where I don't feel like sharing anymore. I've liked it since the 90s in my BBS days and now I've reached the point where I say: don't share and I feel better. Why: it's not like it used to be. Mass instead of class. You understand.
@Papierzeit I totally understand. I miss the times when you basically could control every single bit you sent out into the world. And just hang up when you were done with communicating.
Adding a reply from mastodon too, because it seems my freshly installed friendica server isn't all that functional yet, but there's this: This? github.com/ai-robots-txt/ai.ro… - i've configured them to be redirect rules in my web server config to give a 404 when a request with one of those user agents comes in, because as you said, what scraper respects robots.txt?
Robots.txt won't work. I have a long list of useragents and ip ranges in my nginx config, that I gathered from various sources. Any request matching those, is an instant 403. If any IP accumulates enough 403s, fail to ban will ban it for a week.
Apparently Google hates me for it and decided that I won't get on the first page any more even if the search term is the domain of my website. Then again google results are shitty already and keep getting shittier, so that might not matter in the long run.
AndiS 🌞🍷🇪🇺
Als Antwort auf Jan Beta • •Jan Beta mag das.
Akseli
Als Antwort auf Jan Beta • • •ZADZMO code
zadzmo.orgdatort
Als Antwort auf Jan Beta • • •That's actually quite an interesting question, never thought about it. I guess this Repo I just googled is already known? github.com/ai-robots-txt/ai.ro…
Looks kind of promising to me.
GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.
GitHubJan Beta
Als Antwort auf datort • • •Jenny753
Als Antwort auf Jan Beta • • •ai.robots.txt/robots.txt at main · ai-robots-txt/ai.robots.txt
GitHubJan Beta
Als Antwort auf Jenny753 • • •Simon Zerafa
Als Antwort auf Jan Beta • • •Many of the AI bots don't respect Robots.txt 🫤
They have to be blocked with agent strings or by IP range.
I have something configured for my Wordpress site. I'll have to dig out the details when I'm home from work 🙂
Papierzeit
Als Antwort auf Jan Beta • • •Jan Beta
Als Antwort auf Papierzeit • • •Papierzeit
Als Antwort auf Jan Beta • • •Jan Beta
Als Antwort auf Papierzeit • • •Papierzeit
Als Antwort auf Jan Beta • • •You know, it's not just that, it was like that back then:
I have a thirst for knowledge - I drink a glass of water.
Today: This damn fire hose is too much for me. It makes me crazy.
Martin Gausby
Als Antwort auf Jan Beta • • •Perhaps “please don't scape me” if it makes you feel good.
As you say, it will not be respected anyway.
Koen Martens
Als Antwort auf Jan Beta • • •GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.
GitHubsebastian
Als Antwort auf Jan Beta • • •Robots.txt won't work.
I have a long list of useragents and ip ranges in my nginx config, that I gathered from various sources. Any request matching those, is an instant 403. If any IP accumulates enough 403s, fail to ban will ban it for a week.
Apparently Google hates me for it and decided that I won't get on the first page any more even if the search term is the domain of my website. Then again google results are shitty already and keep getting shittier, so that might not matter in the long run.
KungFuDiscoMonkey
Als Antwort auf Jan Beta • • •Dark Visitors - Track AI Agents and Control Bot Traffic
Dark Visitors