Hierarchical Attention Visualization

Illustrating multi-stage processing with local attention and feature merging.

Step 0: Input Image

Selected: Stage 1, Window [0, 0]

Processing Steps

Image

S1 Attn

Merge 1

S2 Attn

Merge 2

S3 Attn

Input Image

The process starts with the input image. It's conceptually divided into patches (like ViT), but attention will operate on windows of these patches.

Image

Technical Concepts

Local Window Attention

Instead of computing attention across all patches (like standard ViT), attention is restricted to non-overlapping local windows (e.g., 7x7 patches). This significantly reduces computational complexity from quadratic to linear with respect to the number of patches. It captures local interactions effectively. Models like Swin Transformer also use shifted windows in alternating layers to allow cross-window connections (Shifted windows are conceptually important but not explicitly drawn in this simplified visualization).

Hierarchical Structure & Merging

As the network goes deeper (stages), patch merging layers reduce the number of tokens (spatial resolution) while increasing the feature dimension. For example, features from a 2x2 group of neighboring patches/tokens can be concatenated and then linearly projected to a smaller dimension. This creates a hierarchical representation, similar to CNNs, allowing the model to learn features at different scales. The receptive field of attention windows effectively increases at deeper stages.

Benefits

Linear computational complexity w.r.t. image size (vs. quadratic for ViT).
Suitable for high-resolution images and dense prediction tasks (segmentation, detection).
Captures multi-scale features naturally through hierarchy.

Interactive Hierarchical Attention Visualization. Simplified representation. Click on canvas windows to select.

Hierarchical Attention in Vision Transformers

Table of Contents