Deconstructing Multi-Head Self-Attention

The Multi-Head Self-Attention layer is arguably the most critical component driving the success of Transformer models in domains like Natural Language Processing (NLP). It allows the model to dynamically weigh the importance of different tokens in a sequence when updating the representation of a specific token, thereby creating highly context-aware embeddings.

The "Multi-Head" aspect enables the model to perform this attention process multiple times in parallel, each time potentially focusing on different types of relationships or information subspaces.

This page provides an interactive, step-by-step walkthrough of this mechanism. Use the visualization below to follow the calculations and build your intuition.

Setting the Scene: Input Embeddings & Query Token

Input: The attention layer receives a sequence of input embeddings (vectors). These typically represent tokens (like words or subwords) and often already include positional encoding information from previous steps.
Query Token: We focus the calculation from the perspective of one token at a time, referred to as the Query token. The goal is to compute an updated, contextualized embedding for this specific token.
Interaction: In the component below, select one of the example text sequences (e.g., "Simple", "Question"). The initial input embeddings are displayed. You can click on a token's embedding in the "Input Embeddings" step to select it as the Query token (its border will highlight).

The Multi-Head Attention Calculation: Step-by-Step Exploration

Now, let's walk through the process. Use the step indicator (1, 2, 3...) or the 'Next'/'Prev' buttons in the component below to advance through each stage.

Input Embeddings: The starting point. Each token has an associated input vector. (Observe the initial vectors in the component).
Project Q, K, V: This is where the "Multi-Head" aspect begins. For each attention head, the input embeddings are linearly projected using separate learned weight matrices (Wq, Wk, Wv) to create Query (Q), Key (K), and Value (V) vectors specific to that head. This allows each head to potentially focus on different aspects of the input. (See the component generate Q, K, V vectors for each head – notice the different colors representing different heads).
Head-Specific Attention Calculation (Repeated for each head): The core scaled dot-product attention happens independently within each head. Follow the steps for a single head (e.g., Head 1, then Head 2, etc.):
- Scores (Q·Kᵀ): The Query vector of our chosen token is compared against the Key vectors of all tokens in the sequence using a dot product. This yields raw attention scores, indicating the relevance of each token's Key to the Query token's Query. (Observe the score calculation for the current head).
- Scale Scores: The raw scores are scaled down by dividing by the square root of the head's dimension (√d_k). This scaling helps stabilize training, especially with larger dimensions. (See the scores being scaled).
- Softmax: The scaled scores are passed through a softmax function. This converts the scores into attention weights – a probability distribution where weights sum to 1. Tokens deemed more relevant get higher weights. (Observe the weights visualized in the bar chart and numerical values).
- Weight Values: Each token's Value vector is multiplied by its corresponding attention weight. This effectively amplifies the information from relevant tokens and diminishes information from less relevant ones. (See the Value vectors being scaled by the weights).
- Sum Values (Output Z): The weighted Value vectors are summed together element-wise. This produces the final output vector Z for this specific head, representing an aggregated summary of information relevant to the Query token, as determined by this head's attention pattern. (Observe the summation resulting in the head's output vector Z).
Concatenate Heads: After each head has independently computed its output vector (Z1, Z2,... Zh), these vectors are concatenated together into one larger vector. This combines the different perspectives learned by each head. (See the Z vectors from all heads being combined).
Final Projection: The large concatenated vector is passed through one final linear projection layer (using weight matrix Wo). This mixes the information from all heads and projects it back down to the original embedding dimension, producing the final output embedding for the Query token. (Observe the final projection step).
Output Embedding: This final vector is the contextualized embedding for the original Query token. It now incorporates information gathered from other relevant tokens in the sequence, weighted according to the attention mechanism across multiple heads. Compare this visually to the original input embedding.

Multi-Head Attention Visualization

An interactive exploration of how attention mechanisms work in transformer models

Select an Example

...

Output

Input Embeddings

Input sequence of 4 token embeddings (dim=8). Focus on "I" as the Query.

(Query)

machine

learning

X = [x₁, x₂, ..., xₙ]

Each token is converted to a vector representation (embedding) of dimension 8. Token 0 ("I") is our focus as the query token.

Multi-Head Attention Concepts

Scaled Dot-Product Attention

The core mechanism computes attention scores between a Query (Q) and all Keys (K). Scores are calculated as `Attention(Q, K, V) = softmax( (Q * K^T) / sqrt(d_k) ) * V`. The scaling factor prevents vanishing gradients with large key dimensions.

Multiple Heads

Instead of one large attention calculation, multi-head attention projects Q, K, and V multiple times with different learned matrices. This allows the model to jointly attend to information from different representation subspaces.

Interactive Multi-Head Attention Visualization

Key Concepts & Further Exploration

Scaled Dot-Product Attention: The core calculation involving Q, K, V, scaling, and softmax.
Multi-Head: The strategy of running the attention mechanism multiple times in parallel with different projections (Wq, Wk, Wv) to capture diverse relationships. Concatenation and final projection integrate these diverse perspectives.
Contextualization: The key outcome is an output embedding that reflects not just the token itself, but also its context within the sequence, as determined by the attention weights.

For concise definitions of these concepts, expand the 'Multi-Head Attention Concepts' section within the interactive tool.

Conclusion: The Engine of Context

Multi-Head Self-Attention is the sophisticated engine that allows Transformers to understand context. By projecting inputs into multiple Query, Key, and Value spaces (heads) and calculating weighted sums based on relevance, it produces rich, contextualized representations for each element in a sequence. This ability to dynamically weigh information across the entire input is fundamental to the power of Transformer models in various domains. Exploring the step-by-step calculation interactively helps demystify this complex but crucial mechanism.

Multi-Head Attention in Vision Transformers

Table of Contents