Hierarchical Attention Visualization
Illustrating multi-stage processing with local attention and feature merging.
Processing Steps
Input Image
The process starts with the input image. It's conceptually divided into patches (like ViT), but attention will operate on windows of these patches.
Technical Concepts
Local Window Attention
Instead of computing attention across all patches (like standard ViT), attention is restricted to non-overlapping local windows (e.g., 7x7 patches). This significantly reduces computational complexity from quadratic to linear with respect to the number of patches. It captures local interactions effectively. Models like Swin Transformer also use shifted windows in alternating layers to allow cross-window connections (Shifted windows are conceptually important but not explicitly drawn in this simplified visualization).
Hierarchical Structure & Merging
As the network goes deeper (stages), patch merging layers reduce the number of tokens (spatial resolution) while increasing the feature dimension. For example, features from a 2x2 group of neighboring patches/tokens can be concatenated and then linearly projected to a smaller dimension. This creates a hierarchical representation, similar to CNNs, allowing the model to learn features at different scales. The receptive field of attention windows effectively increases at deeper stages.
Benefits
- Linear computational complexity w.r.t. image size (vs. quadratic for ViT).
- Suitable for high-resolution images and dense prediction tasks (segmentation, detection).
- Captures multi-scale features naturally through hierarchy.