2023
Visual Instruction Tuning
Introducing a method for aligning large language models (LLMs) with visual information by instruction tuning on a massive dataset of image-text pairs.
Explore machine learning papers and reviews related to multimodal learning. Find insights, analysis, and implementation details.
Introducing a method for aligning large language models (LLMs) with visual information by instruction tuning on a massive dataset of image-text pairs.
Introducing BLIP-2, a new vision-language model that leverages frozen image encoders and large language models to achieve improved efficiency and performance in various multimodal tasks.
Introducing CLIP, a neural network trained on a massive dataset of image-text pairs that learns to connect images with their textual descriptions, enabling zero-shot image classification and other powerful capabilities.