Transformers: from self-attention to performance optimizations
The purpose of this post is to understand what is under the hood and the performance factors involved when fine-tuning and running local Transformer models, keeping multi-modality in mind, with an emphasis on the decoder-only transformers (e.g. GPT series). To accomplish this, we first present a brief account of the transformer architecture, including its design intuitions and the underlying mathematics, concretized by illustrative diagrams and code snippets. Then we aim to achieve a comprehensive understanding of the widely adopted performance optimizations for the original transformer architecture....