Introducing a method for aligning large language models (LLMs) with visual information by instruction tuning on a massive dataset of image-text pairs.

Visual Instruction Tuning

Introducing BLIP-2, a new vision-language model that leverages frozen image encoders and large language models to achieve improved efficiency and performance in various multimodal tasks.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Introducing CLIP, a neural network trained on a massive dataset of image-text pairs that learns to connect images with their textual descriptions, enabling zero-shot image classification and other powerful capabilities.

multimodal learning

Papers Related to multimodal learning

Visual Instruction Tuning

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Learning Transferable Visual Models From Natural Language Supervision