sorting is the slowest part, I ended up doing:
parallel -a tokens.txt --block 370899363 --pipepart 'sort > tokenstore/{#}'
sort -m tokenstore/* > tokens-sorted.txt
uniq -c tokens-sorted.txt | sort -S 80% -n > spammy.txt
sorting is the slowest part, I ended up doing:
parallel -a tokens.txt --block 370899363 --pipepart 'sort > tokenstore/{#}'
sort -m tokenstore/* > tokens-sorted.txt
uniq -c tokens-sorted.txt | sort -S 80% -n > spammy.txt
Correct me if I’m misunderstanding, but this is not spammiest per se, but /commonest/, correct? The admixture of innocuous tokens will make it tricky and require more work. What other criteria could we add as a pre filter?
I think it will be interesting to see more general trend of spammy keyword between popular relay. Although not sure, maybe #[6] already done this based on his research of spam filter in Nostr (his github).
Will generate this raw data like this
#[3]