Hierarchical AttentionVision TransformerViTComputer VisionTransformersDeep LearningInteractive VisualizationCore Concept

Hierarchical Attention in Vision Transformers

April 8, 2025

Explore how hierarchical attention enables Vision Transformers (ViT) to process sequential data by encoding relative positions.

ViT Explained: The Role of the CLS Token

Adding the CLS Token

CLS Token Gets Position

CLS Token Gathers Info (Attention)

CLS Token Predicts Class

Step 1: Adding the CLS Token

CLS

(Special Learnable Token)

(Sequence of Image Patches)

Inspired by BERT, a special [CLS] token is added to the start of the image patch sequence. Its goal is to aggregate information from all patches and represent the entire image for classification.

Table of Contents

ViT Explained: The Role of the CLS Token