Ok, so for those you were keen for the latest Nostr spam ML models and training data, I've updated the GitHub repo. 92k examples with 13k labelled as spam. Sits around 98% accuracy - in practise I’ve found it eliminates all effective spam for kinds 1/42.

I still review high scoring non-spam, but under 90% to help expand the training set.

CC: #[0]

#[1]

#[2] #[3]

https://github.com/blakejakopovic/nostr-spam-detection

Reply to this note

Please Login to reply.

Discussion

Thank you for #[1] and his dedicated works to analyze and combat spam in Nostr. Anyone (devs, relay operator, end users) can use this resources to minimize the spam. 🙏

#[0]

Thank you for your dedicated works 🙏

Nice work! My only nitpick here is don't report accuracy in a vacuum because we don't know the accuracy of a dumb model that always makes the same prediction, which is what we're trying to improve upon

Its all a moving target and accuracy is against a random 10% sample of split data before training.

The real world testing is the last couple weeks where I’m seeing validated accuracy against significant volume - I pre-filter 6,000 spam events/minute at present.

Can it be beaten, yes. Does it prevent flooding or other spam attacks? No. But content based spam like email should have some level of manageability with spam detection models.

I think something like this might be a service relay operators (mainly paid ones) woukd be willing to pay for. Pay x number of SATs for an API key.

I’m happy to chat to any relay operators who would like a service for this.

Aggregators will have the best data to build and train datasets - and detect spam sooner than relays. It’s certainly a space where one can add value.

To add value for relays, why not extend NIPs to store pictures/videos directly on the relays and client pay for storage, pay for write only.

Currently it is so hard to get content of the link from TW/YT if client is being in censorship ocean. With above function, a single sync from relay to relay makes things easy.

Definitely I see a revenue model here, especially for video hosting.

Dunno why I continue to have problems zapping for WoS users. 🤔