I think the key component we still miss is training it without requiring ever increasing amounts of energy (and specific hardware). Right now it seems to scale horribly, probably why i see only small incremental increases on previous models.
I don't know if there's a difference on training on text or training on imagery but it's still the "neural" approach as i understand it so that probably makes it about equal.
Anyway this more about a speculative future than a prediction as to when (or if) this will take place.