At this point a vast majority of traffic to my network is from AI scraping bots. It's getting much more difficult to restrict them. I need to build a way to start doing reverse IP lookups i think. It seems to be the best way to reveal the bots now, because when I rdns many of the IP's they return openai, or bytedance etc.

If anyone has any idea on how to do that myself (not using a 3rd party service) please let me know.

Reply to this note

Please Login to reply.

Discussion

I have personal git instance and I used nginx testcookie module to show JS challenge for anyone. But I have whitelisted paths that are used for CLI tooling like `git pull`

Interesting. Yeah most of the traffic is for my git website (using CGit) so it's just scraping all of my code and git stuff chewing up a bunch of CPU time building the diffs and whatnot.

I also host a few small business things and nip05 stuff I can't have blocked but that's really not the issue.

Nginx module I am talking about: https://github.com/kyprizel/testcookie-nginx-module

Path are relevant to gitea/forgejo, but I guess it is similar for CGit

```

map $uri $giteach_testcookie_pass {

default 0;

"~*^/api/" 1;

"~*^/assets/" 1;

"~*^/avatars/" 1;

"~*^/\S+/\S+\.git/" 1;

"~*^/\S+/\S+/raw/branch/" 1;

}

```

and use `testcookie_pass $giteach_testcookie_pass;` to selectively disable testcookie in a `location / { ... }` block

That's awesome thank you!

I don't really have a solution but more or less commiserating here. If I see any decent solutions, I'll let you know because they absolutely do not respect robot.txt files.

Yeah 0 shits given to the robots in my experience. I'm thinking rdns might be the best but it's reactive and I have to write a module XD

Is your network the type that you could charge 1 sat / request?

I'd say 99.999% of the traffic to my network wouldn't know what a sat is. :/

I may be a bit green, but sats/requests seems the future - otherwise it becomes a game of increasingly difficult wack-a-mole no?