Profile: 671d0e04...
nostr:npub1vuwsup87kt73vczp64mztq3txmkr9mhprlmwk7a9q7fmxzvlgxzqlsyv2p
I bet a few other non-profit/free websites also keep a good index. Archive.org is surprisingly permissive. They were my next guess.
I for one, tend to like robots.txt like your instance has.
nostr:npub1ygrf5pf0ahkyyyrngv3twymqcz7pav6ssymrhdj90luscm9llszq56cg6z That might be a default tbh but it is effective and terse!
I recently got into using my web server configs to redirect known bad user-agent strings and paths (like wp-login.php when I’m not using Wordpress) to a CGI script that prints about a gig of text.
As I’m typing this though, I’m finally working on a more robust CAPTCHA solution. I found a module that takes a number and returns a unicode Roman numeral, so I’m thinkin “What number does this Roman numeral represent?” kinda challenge (idk how to phrase that more intelligently rn)
nostr:npub1vuwsup87kt73vczp64mztq3txmkr9mhprlmwk7a9q7fmxzvlgxzqlsyv2p
I like the guy person who wrote the comments for their robots.txt. I can feel their cynicism and it feels so good.
nostr:npub1ygrf5pf0ahkyyyrngv3twymqcz7pav6ssymrhdj90luscm9llszq56cg6z At no point did I actually think to look at their robots.txt… Honestly this is a good ledger of ‘known bad crawlers’ lol.
I found out years ago that (at that time anyways) they had a bunch of regular expressions saved off somewhere to match known spammy/URL shortner URLs so I grabbed those to use to filter my guestbook spam lol
