At its heart, the Transformer replaces recurrence with a mechanism called Attention. The intuition is that for any given word in a sentence, the model should 'attend' to other words that provide necessary context. For instance, in the phrase 'The animal didn't cross the street because it was too tired,' the word 'it' must be mathematically linked to 'animal.' This is achieved by mapping input embeddings into three distinct vector spaces: Queries, Keys, and Values, allowing the model to perform a fuzzy lookup based on similarity.
Mathematically, we begin with an input matrix $X \\∈ \\ℝ^{n \\× d}$, where $n$ is the sequence length and $d$ is the embedding dimension. We project $X$ into three spaces using learned weight matrices $W^Q, W^K, W^V \\∈ \\ℝ^{d \\× d_k}$. The Scaled Dot-Product Attention is defined as: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ Here, $QK^T$ computes the alignment between every pair of tokens. We divide by $\sqrt{d_k}$ to prevent the dot products from growing too large in magnitude, which would push the softmax function into regions with extremely small gradients.
Single-head attention is limiting because it only captures one type of relationship. Multi-Head Attention (MHA) allows the model to jointly attend to information from different representation subspaces. By utilizing $h$ parallel attention heads, each with its own set of projection matrices $W_i^Q, W_i^K, W_i^V$, the model can simultaneously track syntactic dependencies (like subject-verb agreement) and semantic relations (like coreference) in different heads.
The formal process for MHA involves computing the output of each head independently and then concatenating them. Let $\text{head}_i = \text{Attention}(XW_i^Q, XW_i^K, XW_i^V)$. The final output is computed by multiplying the concatenated heads by a final linear projection matrix $W^O \\∈ \\ℝ^{hd_k \\× d}$: $$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O $$ This structure ensures that the multi-dimensional insights captured by the heads are fused back into the original model dimension $d$.
Following the attention mechanism, the Transformer employs a Position-wise Feed-Forward Network (FFN). While attention allows tokens to interact with each other, the FFN allows each token to be processed independently. The intuition is to project the token into a higher-dimensional space to extract complex features and then project it back. It operates as a localized 'knowledge base' where the model stores patterns learned during training.
The FFN consists of two linear transformations with a non-linear activation function (typically ReLU or GeLU) in between. Given an input $x$, the operation is expressed as: $$ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 $$ Where $W_1 \\∈ \\ℝ^{d \\× d_{ff}}$ and $W_2 \\∈ \\ℝ^{d_{ff} \\× d}$, with $d_{ff}$ typically being much larger than $d$ (e.g., 2048 vs 512). This 'expansion-contraction' architecture allows the network to perform a highly non-linear transformation on each position independently, ensuring the model can approximate complex functions across the sequence.