end to end it goes roughly like this:
1. user uploads an audio file
2. backend uses ffmpeg to convert it into HLS files (chunked mp3s + text manifest)
3. HLS files made available on basic fileserver over http
4. HLS manifest url goes into note metadata
5. client gets note, fetches manifest, (manifest includes time markers + mp3 chunk urls) and JIT loads chunks into player