So much work!
Could this be useful?
https://spam.nostr.band/spam_api?method=get_current_spam
I collect events from all relays for the last hour, group events w/ common words/ngrams, find clusters of >100 events.
This API prints the stop-words for big clusters - if event contains all of words, it's most likely spam. Relays/clients could proactively match new events against these words, or periodically delete specific events/pubkeys.
Was playing with this today, will be using in my relay. It's updated close to real-time.
Also
https://spam.nostr.band/spam_api?method=get_current_spam&view=pubkeys
https://spam.nostr.band/spam_api?method=get_current_spam&view=events - BIG!
Discussion
You don't like it?
It’s out of spec and creates a dependency on one relay. There must be a way this is decentralized
AI based decentralized spam detection
#[3] ’s approach seems reasonably lightweight and could be added to any relay. classifying using larger models and frequently retraining them would be heavier but surely possible on larger nodes.
also, federated training is a possibility but makes everything a bit more complicated. i guess the most important question is: what features turn out to be predictive. maybe the protocol itself can be leveraged more (network/metadata).