To understand the Transformer, one must first grasp the intuition of 'Attention'. At its core, attention is a differentiable mechanism for calculating a weighted average of values, where the weights are determined by the similarity between a query and a set of keys. Imagine a library: the 'Query' is what you are looking for, the 'Keys' are the labels on the spines of books, and the 'Values' are the actual contents of those books. By calculating the alignment between your query and the keys, the system decides which values to aggregate to form the most relevant representation of the current token.
Mathematically, we begin with Scaled Dot-Product Attention. Given an input matrix $X$, we project it into three distinct spaces using learned weight matrices $W^Q, W^K,$ and $W^V$. The Query, Key, and Value matrices are computed as $Q = XW^Q, K = XW^K,$ and $V = XW^V$. The attention score is derived from the dot product of the query and key, scaled by the square root of the dimension $d_k$ to prevent gradient vanishing in the softmax function: $$ ext{Attention}(Q, K, V) = ext{softmax}\\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ This operation ensures that the model focuses on the most relevant parts of the sequence relative to the current token.
Single-head attention is limited because it can only attend to one linguistic pattern at a time. Multi-Head Attention (MHA) solves this by running multiple attention mechanisms in parallel. Each 'head' uses different learned projections, allowing the model to simultaneously attend to different types of information—for example, one head might capture syntactic relationships while another captures semantic dependencies. The outputs of these $h$ heads are concatenated and projected back to the original model dimension: $$ ext{MultiHead}(Q, K, V) = ext{Concat}( ext{head}_1, \dots, \text{head}_h)W^O$$ where each $\text{head}_i = ext{Attention}(QW_i^Q, KW_i^K, VW_i^V)$.
Following the attention mechanism, the Transformer employs a Position-wise Feed-Forward Network (FFN). While attention captures global relationships between tokens, the FFN operates on each position independently and identically. Its purpose is to process the information extracted by the attention heads and project it into a higher-dimensional space to introduce non-linearity and increase the model's capacity to learn complex patterns. It acts as a local transformation that refines the representation of each token based on the context gathered by the attention layer.
The mathematical structure of the FFN consists of two linear transformations separated by a non-linear activation function, typically the Rectified Linear Unit (ReLU) or GeLU. For a given input vector $x$, the operation is defined as: $$ ext{FFN}(x) = \\max(0, xW_1 + b_1)W_2 + b_2$$ Here, $W_1$ projects the input from dimension $d_{model}$ to a larger inner-dimension $d_{ff}$ (often 4 times larger), and $W_2$ projects it back down to $d_{model}$. This 'expansion-contraction' architecture allows the network to map input features to a high-dimensional latent space where they are linearly separable before projecting them back.
Integrating these two components—MHA and FFN—creates a powerful duality. MHA allows for 'communication' across the sequence, enabling tokens to exchange information based on relevance. The FFN then performs 'computation' on each token's resulting representation. This cycle of communication and computation, repeated across multiple layers, allows the Transformer to build hierarchical representations, moving from simple token-level meanings to complex, context-aware semantic embeddings.