LLaVA is a Large Language-and-Vision Assistant aiming to achieve multimodal GPT-4 level capabilities. Developed by Haotian Liu and team, this project is open-sourced on GitHub. LLaVA introduces the concept of visual instruction tuning and builds large language and vision models based on it. The model is trained on various datasets and provides a Gradio Web UI for running the model locally and showcasing its capabilities. If you're interested in AI and machine learning, this project is definitely worth checking out.
Discussion
It's now confirmed that it's impossible to use on the V100 GPU, mainly because the V100 doesn't support flash_attn.