To me this is a critical step in generative AI. We ask it to reason about the world without any intuitive model of what the world is. Training should involve translations between, text, audio, images, video, 3d models, and 3d animations. That would force it to develop an intermediary understanding of the world.