It’s super advanced
parallel -a dump.json --pipepart --block $(bc <<<"$(stat -c %s dump.json) / 15”) -j15 “jq -r 'select(.kind == 1) | .content' | tr ' ' ‘\n’” > tokens.txt
It’s super advanced
parallel -a dump.json --pipepart --block $(bc <<<"$(stat -c %s dump.json) / 15”) -j15 “jq -r 'select(.kind == 1) | .content' | tr ' ' ‘\n’” > tokens.txt
sorting is the slowest part, I ended up doing:
parallel -a tokens.txt --block 370899363 --pipepart 'sort > tokenstore/{#}'
sort -m tokenstore/* > tokens-sorted.txt
uniq -c tokens-sorted.txt | sort -S 80% -n > spammy.txt
Correct me if I’m misunderstanding, but this is not spammiest per se, but /commonest/, correct? The admixture of innocuous tokens will make it tricky and require more work. What other criteria could we add as a pre filter?
I think it will be interesting to see more general trend of spammy keyword between popular relay. Although not sure, maybe #[6] already done this based on his research of spam filter in Nostr (his github).
Will generate this raw data like this
#[3]