Subnostr

Why would a language model make good visuals? Diffusion models are what make good visuals.

this new model does audio, images and text all at the same time, so was hoping it would be smarter here. not sure what their architecture looks like

Reply to this note

Please Login to reply.

Discussion

🐢 1y ago

Surely they will eventually combine the various types of models into a single product to create a cohesive product, but maybe instead of waiting for that you should just move over to a diffusion model to make your desired imagery.

GPT is primarily a language model. OpenAI is working on diffusion models, primarily their video generation model called sora, but language models will never do imagery as good as diffusion models, so until they have something that’s combining both types of models I wouldn’t expect anything close to perfection or even “good”