With the rise of really good transcription models, TTS that’s actually enjoyable to listen to, and LLMs that can carry a conversation and understand complex commands, why haven’t we seen an explosion of really good voice interfaces?

It seems obvious to me but I’ve only seen Apple making a serious attempt with the latest Siri update. There are so many times that I’m doing something with my hands, driving, etc. and wish I could give commands to my RSS reader or just chat with an LLM that has the Arxiv and Wikipedia connected with RAG.

Reply to this note

Please Login to reply.

Discussion

Really good question, maybe its because LLMs can only complete your sentences and not “carry a conversation and understand complex commands”? 🤔

Intent recognition (i.e https://cloud.google.com/dialogflow/cx/docs/concept/intent) topic is a bit different to LLM.

Can LLM mimic intent recognition? Absolutely! If-else statements can do that too though

“LLMs can only complete sentences” is true of base models, but fine tuning for instructions with RLHF has been a thing for 3-4 years at this point. I’m talking about reading news with something like “What’s the summary of this article?”, “Alright save that for later and go to the next one.”

You don’t necessarily have to use LLMs but they seem the easiest way so far to understand relatively complex commands and call functions.