Replying to Avatar fiatjaf

Can anyone teach me how to do this? https://emschwartz.me/binary-vector-embeddings-are-so-cool/

There is so much jargon about this stuff I don't even know where to start.

Basically I want to do what https://scour.ing/ is doing, but with Nostr notes/articles only, and expose all of it through custom feeds on a relay like wss://algo.utxo.one/ -- or if someone else knows how to do it, please do it, or talk to me, or both.

Also I don't want to pay a dime to any third-party service, and I don't want to have to use any super computer with GPUs.

Thank you very much.

The encoding (function string -> number vector ) is part of LLM magic, a common first step of various ANN enchantments (which the magicians also don't understand, don't worry). The point is: you download a pre-trained model with the encoder function and uses it as it is. On this thread @sersleppy posted a blog with an example:

https://huggingface.co/blog/embedding-quantization#retrieval-speed

Embeddings are supposed to reflect content (semantics, but this may be too strong a word). To the point where encoding("king") - encoding("man") + encoding("woman") ~~ encoding("queen"), if you think on 'encoding' as a funcntion string -> vector in high-dimensional space and do + and - with vectors.

Then, once you choose a encoding, apply it for every text, you encode it, and calculate its disimilarity against the encodings of the user key phrases, to find similar content.

Conceptually, the binary encoding is the same. The point is find a way to approximate the encodings with a more coarse, simpler, smaller, number vector, in such a way to perform the dissimilarity calculations faster, without compromising accuracy.

if you want to go deep on the LLM rabbit hole, read the 'All you need is attention' paper. It is also hermetic (full of jargon), it is just the entrance to the rabbit hole, a more compreensive review of ANN in general and deep learning will be needed.

Reply to this note

Please Login to reply.

Discussion

No replies yet.