whisper for speech to text, then a model like openvoice2 that can render text to speech in infinite accents and styles. distortion is not enough.
as for the identities, SimpleX has a nice model. no permanent identities, just rendezvous strings.
could be a fun hacking project. the latency is going to be horrible though.