Subnostr

The majority of Damus (likely most Nostr Clients) global feed spam comes from around 50-100 quasi template messages. Make them disappear and the noise drops to very little again.

I’ve almost finished an around 100k event labeled dataset for spam detection. The current spam by volume is Asian language biased, and the non-spam content is English biased - however my relay testing looks promising.

I’ll try do more testing this weekend with the latest events and see how it performs.

If it works well, perhaps relays can use it before accepting published events. Not sure yet how best to do continual training, however I did use event kind=1984 reports to help identify and tag spam events.

Reply to this note

Please Login to reply.

Discussion

someone 2y ago

If that was a bot and anyone could follow, relays could follow too and help reduce spam

Blake 2y ago

Following up with a repo and some results on Nostr spam detection experimentation. It would seem like we can achieve upwards of 98% accuracy.

The dataset is a little targeted, and less generic, however it appears to perform well today. The repo has a README with more details. Happy to experiment and collaborate with anyone interested.

Even if Nostr relays don't filter these events, even embedding an event key of meta.spam_score=0.00 into events they serve would mean clients can choose to ignore it, or set their own score thresholds for event visibility.

https://github.com/blakejakopovic/nostr-spam-detection

RE: #[0]

CC: #[1] #[2] #[3] #[4] #[5] #[6] #[7]

someone 2y ago

This is great work! is nsfw or impersonation classified spam in this?

Mazin 2y ago

Awesome work! Thank you for providing this.

phil 2y ago

This is great! I was thinking along the same lines that it would be good for relays to be able to tag a note as potential spam with a spam score so that clients can filter based on a user’s personal preferences. That way we don’t completely lose notes due to false positives.

Lurking Cat 2y ago

Interesting stuff, machine learning for spam detection. Btw, is there any occurance of high false positive in additional data outside the test data? Overfitting case?

Maybe can be tested more on ham data from paid relay

@note1ldntl5ll238w5gav0sedrya2knnh9feu8aj3trh7yrq8zvt5c2lsm52rqr