Yeah I figured they would have different data and computational penalties. Even some of the scanning ones may be too much, for example one additional rule I left out or this could be a standalone alternative could be similarity to other notes. As the Levenshtein distance created from the notes after removing punctuation, white space, and making all letters uppercase. The distance matrix gives you the uniqueness from measuring against all other notes. Show only the most unique notes from the time window. But I figured that would be too heavy on cpu but perhaps not?

Reply to this note

Please Login to reply.

Discussion

You could also do unique word comparisons with a similar weight matrix that might be faster?

Strip a note down to its set of unique words. Then compare the sets in a matrix of all notes compared to all other notes in the matrix, Count the words in common between two sets. Divide the count by the number of unique words or something else to normalize.

There are a lot of directions you could go in.