the word index i've been building could probably help you find this kind of clustering at some primitive level of precision, was quite funny trying to figure out how to make it language agnostic, found a nice library for segmenting unicode UTF-8 text that did a pretty good job, then i just had to filter out common things like filename extensions and nostr entities and whatnot
i gotta finish building that thing... i'm actually done with the draft now and really just need to hook it up to a query endpoint