I built a deep document search running off chatgpt.
It basically sends a bunch of documents, page by page, through an LLM and asked it if the query is present. And collects relevant pages via regexs
I refactored it to run 100% local off llama 3.1 8b (or anything else) and the speed was crazy. 0.5s response times.
With open ai it was like 20s round trip. You can batch/send parallel requests but still doesn't come close to an 8b on the same machine