Nice.
Spam definitely is tough to define. When I made the dataset I mostly used the definitions of “is this content intended to deceive” (is there a victim?) or “is this promotional in an excessive way” (basically like Adblock). This means that explicit content or strong opinions and other content wasn’t labelled as spam.
One example I saw a lot of with kind=1984 was “marked as spam because foreign language”. I translated a lot of content that certainly wouldn’t fall under those above definitions - it was more a form of censorship.
There will never be a common standard or definition for Nostr spam that suits everyone - however something closer to Adblock and your spam email inbox is where I see this being most useful. Occasional false positive, however it’s not a big deal - especially when you can whitelist people as contacts.