I forgot to hit send on this earlier today, but speaking of David vs. Goliath, if you haven’t done it already, make sure to tarpit Alexandria, TheForest1, GitCitadel, and anything else you’re deploying that is expected to hold a lot of data. Otherwise, scrapers will come for the goods, overload the infrastructure and cost you quite a few sats.

https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/

https://zadzmo.org/code/nepenthes/

https://blog.cloudflare.com/ai-labyrinth/

Reply to this note

Please Login to reply.

Discussion

Yeah, nostr:npub1qdjn8j4gwgmkj3k5un775nq6q3q7mguv5tvajstmkdsqdja2havq03fqm7 is looking into that. They attack his home servers, as well.

this stuff annoys the hell out of me too, i run my test #realy fairly infrequently but within minutes some bot is trying to scrape it for WoT relevant events, and i wish there was an effective thing to slow that down

i tried adding one kind of rate limiter but that didn't really work out so well, in the past what i've seen work best tends to be where the relay just stops aswering and dropping everything that comes in... probably if that included pings the other side would automatically drop

i've now added plain HTML and i think that requires something also but maybe more simple, like, if it gets a query more than once every 5 seconds for 5 such periods it steadily adds more and more delay in processing, the difference to sockets is to do it on http it has to associate with an IP address

I'm assuming your first-level WoT can auth and get around delays. Otherwise, uploading publications would collapse.

yeah, they are only reading, not writing, so all that would really be required is to slow them down

probably could split the direct followed vs follows of follows into two tiers also so that the second level of depth get less service quality

probably maybe should work on it today because this problem with scrapers is fairly annoying

it's only nostr WoT spiders at this point, i wish i could just suggest to the spider people to just open a subscription for these and catch all the new stuff live, that would be more efficient for them and less expensive for relay operators

at this point it's fairly early in the game for that stuff so maybe some education of WoT devs might help...

cc: nostr:npub176p7sup477k5738qhxx0hk2n0cty2k5je5uvalzvkvwmw4tltmeqw7vgup there is no point in doing sliding window spidering on events once you have the history, from then on just open a subscriptiion

this would be more friendly from the AI spiders if they had a way to just catch new updates, but unless dey pay me! i'm just gonna tarpit them

Yeah, I don't get the point of spidering data you can just stream. Like, they spider Nostr websites and it's like... just subscribe to the relay. *roll eyes*

Yeah right now I just have a user agent list and that does most of the work, the bots that are hitting me mostly have "~bot" in the user agent field so it's been working to stave them off for now. I have to hop in and update the list every few days it helps.

I was interested in rDNS lookups for blocking since most of the IPs i've dealt with come from rDNS bot sources like bytedance, openai and so on.