Hey! Yes, you could follow what nostr:npub12262qa4uhw7u8gdwlgmntqtv7aye8vdcmvszkqwgs0zchel6mz7s6cgrkj is recommending. Basically, create a synthetic dataset with two columns: col 1 for questions and col 2 for answers. You can use an LLM to generate this dataset, then embed the answers. I would also recommend using an embedding model like Nomic ( https://ollama.com/library/nomic-embed-text ) since they have an interesting prefix system that generally improves the performance and accuracy of queries ( https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#usage ). I can also share the code for the Beating Heart, a RAG system that ingests MD documents, chunks them semantically, and then embeds them https://github.com/gzuuus/beating-heart-nostr . Additionally, I find the videos by Matt Williams very instructive https://www.youtube.com/watch?v=76EIC_RaDNw . To finalize, I would say that generating a synthetic dataset is not necessarily needed if you embed the data smartly.
Discussion
Lots of things to study, I will take a look and experiment!