Upcoming Talk

Understanding Multimodal Models: A Brief History and How They Work

Explore how AI can integrate text, images, audio, and video. Learn about architecture innovations in CLIP and Gemma, and discover real-world applications in robotics and beyond.

30 minutes
Bangalore, India
Advanced

Want to book this talk?

Understanding Multimodal Models: CLIP and PaLI Explained Clearly

Introduction to Multimodal Models

Multimodal models integrate multiple data types (such as text, images, audio, or video) to perform tasks more effectively compared to single-modal models. These models are powerful because they can capture richer context and improve performance by leveraging complementary information from different modalities.

Starting with CLIP: Why it Matters

OpenAI's CLIP (Contrastive Language-Image Pre-training) was one of the first models to successfully align text and images into a single, shared embedding space. CLIP's simplicity, combined with its effectiveness, makes it an ideal entry point into multimodal learning.

Technical Breakdown of CLIP

Core Architecture

CLIP consists of two parallel encoders:

  • Image Encoder: Typically implemented using architectures such as ResNet or Vision Transformers (ViT), converting image pixels into embedding vectors.
  • Text Encoder: Often a Transformer-based model (similar to GPT), converting text inputs into embedding vectors.

How Contrastive Loss Works

The essence of CLIP's training involves contrastive learning:

  • Goal: Bring related image-text pairs close in embedding space and push unrelated pairs far apart.

Mathematically:

Loss = − 1N Σi=1N log
esim(Ii,Ti)
Σj=1N esim(Ii,Tj)

Where:

  • Embedding for the i-th image: Ii
  • Embedding for the j-th text: Tj
  • Cosine similarity function: sim(·)

CLIP thus efficiently learns to group similar modalities closely while separating unrelated ones.

Conclusion

CLIP laid critical groundwork for multimodal AI by demonstrating the power of embedding alignment via contrastive learning. PaLI further extends this capability, providing robust performance on complex multimodal tasks. Future advancements will likely incorporate additional modalities such as audio and video, further enriching AI's capability to interact naturally and effectively in real-world scenarios.

CLIP Image Tokenization & Projection

How CLIP processes images into tokens and projects them to the shared space

Step 4 of 5
Image Projection Transformation (Linear Algebra View)Image Encoder Output Vectorx = [x₁, x₂, ..., xᵈ]ᵀ ∈ ℝᵈWhere d depends on encoder:• ViT-B: d = 768• ViT-L: d = 1024• ResNet-50: d = 2048Vector representation:...x₁, x₂, ..., xᵈProjection MatrixW ∈ ℝ^(d×512)d512w₁,₁w₁,₂...w₁,₅₁₂w₂,₁w₂,₂...w₂,₅₁₂wᵈ,₁wᵈ,₂...wᵈ,₅₁₂Linear Projection: Matrix Multiplicationy = W · xy₁ = w₁,₁·x₁ + w₁,₂·x₂ + ... + w₁,ᵈ·xᵈy₂ = w₂,₁·x₁ + w₂,₂·x₂ + ... + w₂,ᵈ·xᵈy₅₁₂ = w₅₁₂,₁·x₁ + w₅₁₂,₂·x₂ + ... + w₅₁₂,ᵈ·xᵈResulting vector y ∈ ℝ^(512)y = [y₁, y₂, ..., y₅₁₂]ᵀL2 normalization: z = y / ||y||₂||y||₂ = √(y₁² + y₂² + ... + y₅₁₂²)Key Mathematical Properties1. Linear projection preserves relative distances between similar vectors2. Dimensionality reduction: d → 512 compresses representation while preserving semantics3. L2 normalization maps all vectors to unit sphere: ||z||₂ = 1 for all vectors4. On unit sphere, cosine similarity equals dot product: cos(θ) = z₁ᵀz₂ = z₁·z₂

Shared Latent Space in CLIP

The CLIP model learns a joint embedding space for text and images.

Understanding CLIPs Shared Latent Space

The key to cross-modal understanding

Pre-trainingEarly TrainingMid TrainingFully Trained
Semantic Dimension 1Semantic Dimension 2Shared Latent SpaceConcept-Organized Shared SpaceAnimalsVehiclesPlantsLandscapesCatDogBirdCarBusBoatTreeFlowerMountainBeachImage EmbeddingText EmbeddingAligned PairUnit Hypersphere (L2 Normalized Embeddings)Contrastive Learning:Pulls matching pairs togetherPushes non-matching pairs apartMission accomplished!

Now, we compare with PaliGemma model

Beyond CLIP: Introducing PaLI

PaLI (Pathways Language and Image model), developed by Google, significantly expands upon CLIP's foundational approach. Unlike CLIP, PaLI handles a variety of vision-language tasks beyond simple embedding alignment, including visual question-answering (VQA), detailed image captioning, and object detection.

PaLI Architecture: A Technical Dive

PaLI leverages:

  • Vision Transformer (ViT) for processing images into embeddings.
  • Transformer-based Encoder-Decoder frameworks for language understanding.
  • Cross-Attention mechanisms in transformers for effective integration between textual and visual data.

Detailed Comparison: CLIP vs. PaLI

AspectCLIPPaLI
ModalitiesText-Image pairsText-Image (advanced tasks: captioning, VQA, detection)
ArchitectureSeparate encoders (contrastive)Unified multimodal transformer architecture
Training MethodsContrastive learningCross-entropy, masked language modeling
Task ComplexityPairwise matching (basic)Multitask learning (complex vision-language tasks)

CLIP vs PaliGemma2 Architecture Comparison

Comparing contrastive dual-encoder vs generative multimodal approaches

CLIPContrastive Language-Image Pre-trainingPaliGemma2Multimodal Generative ModelOpenAI (2021)Google (2024)VisionTransformerTextTransformerViT Architecture:• Image patches: 16×16 pixels• Position embeddings• Multi-head self-attentionTransformer Encoder:• BPE tokenization (49K tokens)• 12-layer architecture• 77 tokens maximum lengthImage Embeddings[batch, 512]Text Embeddings[batch, 512]L2L2Independent Encoders (No Cross-Attention)Modalities processed separatelyNo direct interactionContrastive LossL_contrastive = -log [exp(sim(i,t)/τ) / Σⱼexp(sim(i,t_j)/τ)]• Aligns matching image-text pairs• τ = 0.07 (temperature)Similarity MatrixEmbedding SpaceContrastive learning brings matching pairs together, pushes mismatches apartVisionTransformerGemmaLLM BaseEnhanced ViT Architecture:• High-resolution patch encoding• Multiple scales processing• Dense feature extractionGemma Architecture:• Decoder-only transformer• Multi-query attention• SwiGLU activation functionsImage Features[batch, h, w, dim]Text Embeddings[batch, seq_len, dim]Cross-Modal Attention MechanismImage FeaturesW_kW_vK[768×64]V[768×64]Text FeaturesW_qQ[77×64]QK^T / √dsoftmax(·)· V[77×64]×[64×768] → [77×768]Visually-enhanced Text Representations8 or 12 attention heads in parallel (only one shown)• Each image token attends to all text tokens: O(hw × L) complexity• Enables fine-grained visual grounding of text featuresUnified Multimodal DecoderEmbIn[B,L,D]TransformerN decoder blocksEmbOut[B,L,D]• Fused representation: Text conditioned on relevant visual contextsLanguage Modeling LossL_LM = -∑ᵢ log P(xᵢ | x₁, x₂, ..., xᵢ₋₁, Image)• Autoregressive next-token prediction• Generation conditioned on imageArchitectural EvolutionObjective: Representation Learning• Maps image and text to a shared embedding space• No generation capabilities• Efficient but limited to retrieval tasksObjective: Conditional Generation• Enables text generation conditioned on images• Allows for complex visual-to-text reasoning• Computationally more expensive

Multimodal Integration Techniques Explained

Integration typically occurs through:

  • Early Fusion: Combining inputs at the raw data stage (rare, computationally intensive).
  • Intermediate Fusion: Encoders first process each modality independently; then embeddings are integrated.
  • Cross-Attention (Transformer Fusion): Allows dynamic and contextualized interactions between modalities, common in advanced multimodal transformers like PaLI.
text
image
attention
fused
decoder
output
Cross-Modal Attention in Multimodal AIVisualizing how attention mechanisms fuse information from different modalitiesText Token EmbeddingSemantic representation of text tokens01234567891011Image Region EmbeddingVisual features from image patches01234567891011ACross-Modal AttentionFused FeaturesCombined text and attended image information01234567891011Decoder Output01234567891011Fused FeaturesDecoder NetworkFinal Output RepresentationUnified multimodal representation for downstream tasks👁️Visual Understanding💬Text Generation🔄Cross-Modal TasksFinal RepresentationReady for downstream tasks01234567891011

Above it is the attention visualization of PaliGemma model.

How Cross-Modal Attention Works as shown in the above diagram

  1. Calculates similarity between textand image feature dimensions
  2. Normalizes scores via softmax
  3. Weights visual features based on relevance to text representation
  4. Produces a fused feature vector
View:
Show:

Hover over elements for more information. Scroll to see all content.

PaliGemma: Cross-Modal Attention ArchitectureImage Input Pathway🐕🐎🚗Visual Encoder(Vision Transformer)Image FeaturesVisual Features• Sequence of patches• Spatial information preserved• Shape: [B, H*W, D]Keys (K)Values (V)Text Input Pathway"A dog running in a field""Photo of a horse jumping"Text Model(Gemma LLM)Text FeaturesQueries (Q)K/VText Features• Sequence of tokens• Semantic understanding• Shape: [B, Seq_Len, D]To Cross-ModalAttentionCross-Modal Attention• Text Q × Image K → Attention weights• Weights × Image V → Visual context• Combines information across modalities+Zoom inAttention FormulaAttention(Q, K, V) = softmax(QK^T / √d) · VUnified Multimodal DecoderSelf-attention + Feed-forwardLanguage Model HeadNext Token PredictionCLIP vs PaliGemmaCLIPPaliGemmaDual-encoderEncoder-decoderNo cross-attentionCross-modal attentionContrastive lossLM lossRetrieval focusGeneration focusSeparate pathwaysDeep integrationMulti-Head Attention OverviewHead 1Dog RegionHead 2BackgroundHead 3PositionConcatenate & ProjectDecoder Properties• Visual context integration• Multi-layer processing• Autoregressive generation• Text generation with visual contextTraining MethodologyPre-trainingImage-text paired dataFine-tuningTask-specific instruction data

Conclusion

CLIP laid critical groundwork for multimodal AI by demonstrating the power of embedding alignment via contrastive learning. PaLIGemmma further extends this capability, providing robust performance on complex multimodal tasks. Future advancements will likely incorporate additional modalities such as audio and video, further enriching AI's capability to interact naturally and effectively in real-world scenarios.

If you have any questions or feedback on this topic, please let me know. I would love to hear from you.

Share this talk

Interested in booking this talk?

I'd love to bring this topic to your event! Get in touch to discuss logistics, timing, and any specific areas you'd like me to focus on.

About the Speaker

Abhik Sarkar

Abhik Sarkar

AI researcher and engineer specializing in machine learning systems. Passionate about making complex AI concepts accessible.

Mastodon