To understand Multi-Head Attention (MHA), we must first grasp the intuition of 'attention' as a soft-lookup mechanism. In a typical sequence, a token's meaning depends on its context; for instance, the word 'bank' differs in meaning depending on whether 'river' or 'money' appears nearby. Mathematically, we represent this by mapping each input vector into three distinct spaces: Queries ($Q$), Keys ($K$), and Values ($V$). The Query represents what the token is looking for, the Key represents what the token contains, and the Value is the information to be extracted if a match is found.
The core operation is Scaled Dot-Product Attention. For a set of queries $Q \\∈ \\ℝ^{n \\× d_k}$ and keys $K \\∈ \\ℝ^{m \\× d_k}$, the alignment is computed via the dot product, which measures similarity. To prevent the gradients from vanishing or exploding during training due to high-dimensional dot products, we scale by the square root of the head dimension $d_k$. The formal operation is defined as: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ Here, the softmax ensures that the attention weights sum to 1, creating a weighted average of the Value vectors $V$.
Single-head attention is limited because it can only attend to one aspect of the sequence at a time. Multi-Head Attention solves this by performing the attention process $h$ times in parallel. Each 'head' utilizes different weight matrices $W_Q^{(i)}, W_K^{(i)},$ and $W_V^{(i)}$, allowing the model to simultaneously attend to different types of relationships—such as syntactic dependencies and semantic similarities. The output of each head is concatenated and then projected back to the original dimension using a final weight matrix $W^O$:
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$, where $\text{head}_i = \text{Attention}(QW_Q^{(i)}, KW_K^{(i)}, VW_V^{(i)})$. This architectural choice allows the model to jointly attend to information from different representation subspaces at different positions, significantly increasing the expressivity of the network compared to a single large attention head.
Following the attention mechanism, the Transformer employs a Position-wise Feed-Forward Network (FFN). While attention manages the interactions *between* tokens, the FFN processes each token *individually* and identically. This ensures that the model can apply a non-linear transformation to the integrated context. The FFN consists of two linear transformations with a non-linear activation—typically ReLU or GELU—sandwiched between them.
The mathematical formulation of the FFN is expressed as: $$ ext{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$. Here, $W_1$ typically projects the input into a higher-dimensional space (e.g., from $d_{model} = 512$ to $d_{ff} = 2048$), and $W_2$ projects it back to $d_{model}$. This 'expansion-contraction' cycle allows the network to learn complex patterns and store factual knowledge in the weights of the linear layers, effectively acting as a key-value memory for the specific features extracted by the attention heads.
The synergy between MHA and FFN is critical: MHA dynamically routes information across the sequence, while the FFN refines that information at each position. By utilizing residual connections and layer normalization, formulated as $\text{LayerNorm}(x + \text{Sublayer}(x))$, the architecture maintains stable gradients across many layers. This combination enables the Transformer to scale to billions of parameters while remaining computationally efficient due to the inherent parallelism of these linear algebra operations.