It’s super advanced

parallel -a dump.json --pipepart --block $(bc <<<"$(stat -c %s dump.json) / 15”) -j15 “jq -r 'select(.kind == 1) | .content' | tr ' ' ‘\n’” > tokens.txt

Reply to this note

Please Login to reply.

Discussion

sorting is the slowest part, I ended up doing:

parallel -a tokens.txt --block 370899363 --pipepart 'sort > tokenstore/{#}'

sort -m tokenstore/* > tokens-sorted.txt

uniq -c tokens-sorted.txt | sort -S 80% -n > spammy.txt

Correct me if I’m misunderstanding, but this is not spammiest per se, but /commonest/, correct? The admixture of innocuous tokens will make it tricky and require more work. What other criteria could we add as a pre filter?

I think it will be interesting to see more general trend of spammy keyword between popular relay. Although not sure, maybe #[6] already done this based on his research of spam filter in Nostr (his github).

Will generate this raw data like this

#[3]