A Survey of Techniques for Optimizing Transformer Inference

Paper Overview

This paper presents a comprehensive survey of techniques for optimizing the inference phase of transformer networks. Transformers have achieved remarkable success in natural language processing (NLP) and computer vision (CV), but their increasing complexity demands efficient inference solutions. This survey explores a wide range of optimization techniques, spanning algorithmic approaches like knowledge distillation, pruning, and quantization, as well as hardware-level optimizations and the design of dedicated hardware accelerators.

Key Contributions

Motivation and Challenges:
- The paper emphasizes the growing computational and memory demands of transformers, highlighting the need for optimization to enable their deployment in resource-constrained environments and real-time applications. [cite: 156]
- It also discusses the specific challenges associated with optimizing transformers, such as their complex architecture, data/weight distribution, and the need for specialized hardware to fully realize performance benefits. [cite: 169, 170]
Algorithmic Optimization Techniques:
- Knowledge Distillation (KD): The paper reviews various KD methods that transfer knowledge from a large "teacher" transformer to a smaller "student" network, enabling model compression while preserving accuracy. It covers different KD approaches, including task-agnostic and task-specific distillation, as well as distillation at different granularities (network-level, layer-level, attention-based, etc.). [cite: 180, 181]
- Pruning: This section explores pruning techniques that identify and remove redundant parameters in transformer models. It categorizes pruning methods based on saliency quantification (zeroth-, first-, and second-order), sparsity patterns (unstructured, semi-structured, and structured), and pruning granularity (element-wise, row/column, block, head, layer, etc.). The paper also discusses post-training pruning and hardware-aware pruning techniques. [cite: 246, 247, 248]
- Quantization: This part delves into quantization methods that reduce the precision of model parameters (weights and activations) to lower bitwidths, such as 8-bit integers. It covers various quantization approaches, including static vs. dynamic, uniform vs. mixed precision, and post-training quantization vs. quantization-aware training. The paper also discusses techniques for handling outliers and combining pruning with quantization. [cite: 525, 526, 528, 529]
Efficient Transformer Design:
- The paper reviews efficient transformer architectures designed to reduce computational complexity and memory footprint. This includes techniques like LSH attention, linear attention, and the use of positive random features to approximate softmax attention. [cite: 709, 710]
- For computer vision, the paper discusses lightweight transformers like MobileViT, which combine convolution and attention mechanisms for efficient local-global feature representation. [cite: 746, 747]
Neural Architecture Search (NAS):
- This section explores the application of NAS to automate the design of efficient transformer architectures. It covers different search spaces (attention-only, hybrid attention-convolution), search strategies (reinforcement learning, one-shot, evolutionary, etc.), and the use of NAS for model compression and mixed-precision quantization. [cite: 808, 809]
Hardware Optimization Techniques:
- The paper reviews hardware-level optimization techniques for transformers, including pipelining to overlap computations, optimizing matrix multiplication operations, and skipping redundant or trivial computations. [cite: 946, 978, 1012, 1013]
- It also discusses dataflows that exploit reuse opportunities and the use of block-circulant matrices to compress weight storage. [cite: 1134, 1208]

Conclusion

This survey provides a comprehensive overview of the state-of-the-art in optimizing transformer inference. By covering a wide range of algorithmic and hardware-level techniques, it offers valuable insights for researchers and practitioners seeking to deploy transformers in various applications and hardware platforms. The paper also highlights future research directions, including the development of more efficient hardware architectures, the exploration of co-design approaches, and the need for standardized benchmarks to facilitate fair comparison and reproducibility.

Table of Contents

Paper Overview

Key Contributions

Conclusion