Look at LocalAI and Open WebUI, especially the latter's integrations. The former is basically YAML based model specs with presets andcontrolling what is loaded when. RAG is annoying as fuck though, but it's totally doable. ^^
Working on the exact same thing, but based off of the Milk-V Oasis, currently considering to get a TensTorrent Wormhole or attempt to tinker with amdgpu to get ROCm working to use an RX7000 card as the basis...
Alternatively, Ampere + NVIDIA works, because NVIDIA has ARM drivers - partially, at least. CUDA is included though. Why ampere? Look at the TDP; paring that with high RAM allows you to configure LocalAI to utilize both CPU and GPU and you can specify exactly what goes where and how many layers.
This way, you can allocate several models with some kind of priority, allowing you to run the embeddings model, Whisper and other tiny things all the time, but swap out bigger models depending on which Pipeline you end up running. :)