Locally you mean? Maybe gemma 3 12b

Reply to this note

Please Login to reply.

Discussion

That’s my default model in my local setup. Haven’t tried it on video though.

Not on actual video file, on vtt files, they are essentially text files with timestamps and content).

I think you could convert the VTT file to a txt file and then load that into Msty desktop as a knowledge stack and then make queries about it. I can try it when I get home.

I cleaned up the vtt file removing the timestamps and the author, but it seems that Llama3.1-8B cannot handle a 100KB file, it's too much for its 128K token context window.

It's a shame.

Yes, I'm searching something to run locally.

I will try it, thanks.

I normally use Ollama for that 👌

Me too.

I'm testing some GUI to understand if they can tweak the text splitting and chunking.

Gemma gives me results similar to Llama, totally wrong.

I'm going to share some steps I would take if I were doing what you're doing (afaik summarizing a large text document with local models) Some of these might help you improve your results.

- First, I would check that the context window configuration in Ollama meets your needs and is not set to the default 2048. If it is, the issue is that the model will only perceive what fits within that context window.

- I would ensure that I'm providing a good system prompt to clearly specify the role, task, and traits.

If you've already considered these points and are still not getting good results, I would then try more sophisticated methods to feed the data to the model. For example, I would chunk the large text into contextual segments, summarize each chunk, and then summarize all the outputs.

If that still doesn't work, I would focus on creating an agentic workflow as a data pipeline to experiment with these chunking and summarization steps in a more controlled manner. I would also consider using DSPy.

Just keep in mind that a 12B model is significantly smaller compared to the enormous proprietary models. To achieve similar results, we'll need to be smarter in how we use the local models.

> and is not set to the default value of 2048.

You got it, that's the bottom line! Thanks!

I set the context to 64KB and the response is MUCH better, quite similar to Claude; it's only quite slow on my PC, I need to wait some minutes for every query. But it's promising.

Now I need to think about a chunking and summarization workflow to manage dozen of similar files.

Awesome! I'm glad 🤙

Just to understand. Your idea is to query that document using prompts? Like chat with your docs? You might get much better results using rag if that's the case

Exactly.

I already tested RAG using AnythingLLM without any success, but I suppose that the bottom problem was the context length. Now I will test it again using the modified models.

I tested RAG but I'm getting poor results, I played a little with chunk size and chunk overlap, but it doesn't seem to help. I only got decent results (but no better than the standard query) with Open WebUI enabling "Full Context Mode" (so the whole document is fed), but it took 30% more time to reply compared to the standard mode.

Any suggestions?

Since you are dealing with things that could be non-self-descriptive and probably are not what embeddings are trained for, consider feeding your text to an LLM first to summarize and turn into more explaining content.

Then feed that to the embedding model

I will try that, thanks.

You can also build sequential embeddings this way:

The summary of the last segment was as follows:

The current segment is:

Please return a summary for the current segment, using the previous segment for context, and also return the current context.

Uhm, this is hardcore, I need to understand all these pipeline stuff.

Hey! Yes, you could follow what nostr:npub12262qa4uhw7u8gdwlgmntqtv7aye8vdcmvszkqwgs0zchel6mz7s6cgrkj is recommending. Basically, create a synthetic dataset with two columns: col 1 for questions and col 2 for answers. You can use an LLM to generate this dataset, then embed the answers. I would also recommend using an embedding model like Nomic ( https://ollama.com/library/nomic-embed-text ) since they have an interesting prefix system that generally improves the performance and accuracy of queries ( https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#usage ). I can also share the code for the Beating Heart, a RAG system that ingests MD documents, chunks them semantically, and then embeds them https://github.com/gzuuus/beating-heart-nostr . Additionally, I find the videos by Matt Williams very instructive https://www.youtube.com/watch?v=76EIC_RaDNw . To finalize, I would say that generating a synthetic dataset is not necessarily needed if you embed the data smartly.

Lots of things to study, I will take a look and experiment!