Transformer Model

Transformer is a deep learning model based on the attention mechanism, originally proposed by Vaswani et al. in the 2017 paper "Attention is All You Need". It has revolutionized the field of Natural Language Processing (NLP) and gradually expanded to almost all AI directions including computer vision. > The core idea of Transformer is to completely abandon the traditional word-by-word processing method (RNN), and instead use the attention mechanism to let the model see the entire sentence at once, while determining the relationship between each word and other words, thereby achieving faster training and stronger understanding capabilities. * * * ## Why Do We Need Transformer? Before Transformer appeared, the NLP field mainly relied on RNN (Recurrent Neural Network) series models (such as LSTM and GRU), which process text sequentially and have two key limitations. ### Limitations of RNN RNN processes text like humans "reading word by word", which brings the following problems: * **Vanishing Gradient:** When processing long texts, the model will "forget" earlier information. For example, in "I borrowed a book about quantum physics from the library yesterday", when reading "book", it has long forgotten that "I" is the subject, making long-distance dependencies extremely difficult to capture. * **Inability to Parallelize:** RNN must process each word in order, and cannot utilize GPU's parallel computing capability, making training extremely slow for very long texts. ### Transformer's Solution Through the self-attention mechanism, the model processes all words simultaneously and dynamically calculates the correlation strength between each pair of words, completely solving the two problems mentioned above. RNN sequential processing vs Transformer parallel + global attention comparison * * * ## Overall Architecture of Transformer Transformer consists of two major parts: Encoder and Decoder, each composed of multiple layers of identical modules stacked together. > Analogy understanding: The Encoder is like a "reader", understanding the input Chinese sentence into a set of semantically rich vectors; the Decoder is like a "translator", referring to these vectors, generating English output word by word. Below is the Transformer architecture diagram, with the Encoder on the left and the Decoder on the right. !(#) Transformer Encoder + Decoder complete architecture (orange arrows show Cross-Attention information flow) The Transformer model consists of Encoder and Decoder, each composed of multiple layers of stacked identical modules. !(#) ### Encoder The Encoder is composed of N layers of identical modules, each layer contains two sub-layers: * **Multi-Head Self-Attention:** Calculates the correlation between each word and other words in the input sequence. * **Feed-Forward Neural Network:** Performs independent non-linear transformation on each word. Each sub-layer is followed by Residual Connection and Layer Normalization. ### Decoder The Decoder is also composed of N layers of identical modules, each layer contains three sub-layers: * **Masked Multi-Head Self-Attention:** Calculates the correlation between each word and previous words in the output sequence (using masking to prevent future information leakage). * **Encoder-Decoder Attention:** Calculates the correlation between the output sequence and the input sequence. * **Feed-Forward Neural Network:** Performs independent non-linear transformation on each word. Similarly, each sub-layer is followed by residual connection and layer normalization. Before the Transformer model appeared, the mainstream models in the NLP field were RNN-based architectures, such as Long Short-Term Memory networks (LSTM) and Gated Recurrent Units (GRU). These models capture dependencies in sequences by sequentially processing input data, but have the following problems: 1. **Vanishing Gradient Problem:** Long-distance dependencies are difficult to capture. 2. **Limitations of Sequential Computation:** Unable to fully utilize the parallel computing capability of modern hardware, resulting in low training efficiency. !(#) Transformer solves these problems by introducing the self-attention mechanism, allowing the model to process the entire input sequence simultaneously and dynamically assign different weights to each position in the sequence. * * * ## Core: Self-Attention Mechanism The self-attention mechanism is the most important component of Transformer, answering a question: "When processing this word, which other words in the sentence should I focus on?" ### What are Q, K, V? Each word's vector is linearly transformed into three roles: Query, Key, and Value. Source of Q, K, V and attention calculation process > Understanding Q/K/V by analogy with search engines: > > > Q (Query) = The keyword you input in the search box, representing "What am I looking for?" > > > K (Key) = The title tag of each web page, representing "Which words can I match?" > > > V (Value) = The actual content of each web page, representing "What information can I provide?" > > > Attention weight = Similarity between Q and each K, final result = Weighted sum of all V using weights. ### Attention Formula $$ text{Attention} left(right. Q , K , V left.right) = text{softmax} left(right. frac{Q K^{T}}{sqrt{d_{k}}} left.right) V $$ Where: * $Q$ is the query matrix, $K$ is the key matrix, and $V$ is the value matrix. * $d_{k}$ is the dimension of the vector, used to scale the dot product and prevent gradient explosion. * softmax converts raw scores to probability distribution between 0~1 (weights sum to 1) ### Multi-Head Attention A single attention perspective is limited, like only looking at a problem from one angle. Multi-head attention splits the input into h subspaces, each "head" independently learns different attention patterns, and finally concatenates the results. !(#) * * * ## Positional Encoding Transformer processes all words simultaneously and naturally has no "sense of order" — "cat eats fish" and "fish eats cat" would be treated as the same. Positional encoding adds a "seat number" to each word, telling the model the position of each word in the sentence. > Analogy: Just like marking "Question 1, Question 2" on an exam paper, positional encoding lets Transformer know that "I" is the 1st word and "love" is the 2nd word. Since Transformer does not have explicit sequence information (like time steps in RNN), positional encoding is used to add position information to each word in the input sequence. Positional encoding is usually generated using sine and cosine functions: $$ P E_{left(right. p o s , 2 i left.right)} = sin ⁡ left(right. frac{p o s}{1000 0^{2 i / d_{text{model}}}} left.right) $$ $$ P E_{left(right. p o s , 2 i + 1 left.right)} = cos ⁡ left(right. frac{p o s}{1000 0^{2 i / d_{text{model}}}} left.right) $$ Where: $p o s$ is the position of the word, and $i$ is the dimension index. Positional encoding is added to word embeddings to inject position information into the model ### Encoder-Decoder Architecture The Transformer model consists of two parts: Encoder and Decoder: * **Encoder:** Converts the input sequence into a series of hidden representations. Each encoder layer contains a self-attention mechanism and a feed-forward neural network. * **Decoder:** Generates the target sequence based on the encoder's output. Each decoder layer contains two attention mechanisms (self-attention and encoder-decoder attention) and a feed-forward neural network. * * * ## Residual Connection and Layer Normalization After the output of each sub-layer (self-attention, feed-forward network), two operations are performed: residual connection and layer normalization, which help stabilize training for deep networks. Residual connection allows gradients to "short-circuit" through, layer normalization stabilizes training * **Residual Connection:** Directly adds the input of the sub-layer to the output (output = F(x) + x), avoiding gradient vanishing in deep networks, and also allowing the model to "selectively ignore" the transformation of a certain layer. * **Layer Normalization:** Normalizes the activation values of each layer to make training more stable and converge faster. * * * ## Advantages of Transformer Compared with traditional RNN architectures, Transformer has the following significant advantages: #### Parallel Computing

YouTip

Transformer Model

📂 Categories