Hierarchical AttentionVision TransformerViTComputer VisionTransformersDeep LearningInteractive VisualizationCore Concept

Hierarchical Attention in Vision Transformers

Explore how hierarchical attention enables Vision Transformers (ViT) to process sequential data by encoding relative positions.

ViT Explained: The Role of the CLS Token

1
2
3
4
Vision Transformer (ViT) with CLS TokenExplanation of how a Vision Transformer uses a CLS (Classification) token to classify images. The diagram shows the CLS token being added to the image patches, gathering information through attention, and finally predicting the class of the image.
Step 1: Adding the CLS Token
CLS
(Special Learnable Token)
+
P1
P2
P3
P4
P5
P6
P7
P8
P9
(Sequence of Image Patches)
Inspired by BERT, a special [CLS] token is added to the start of the image patch sequence. Its goal is to aggregate information from all patches and represent the entire image for classification.

If you found this explanation helpful, consider sharing it with others.

Mastodon