The Mathematical Foundations of Multi-Head Attention and Position-wise Feed-Forward Networks

The core intuition behind the Transformer is the ability to dynamically weigh the importance of different parts of an input sequence. Imagine a word in a sentence; its meaning depends entirely on the surrounding context. Multi-Head Attention (MHA) allows the model to attend to different types of relationships—such as syntactic dependencies and semantic associations—simultaneously by projecting the input into multiple subspace representations. This prevents the model from averaging out distinct contextual signals into a single, blurry representation.

Mathematically, we begin with a sequence of embeddings $X \\∈ \\ℝ^{n \\× d_{model}}$. For each head $i$, we define three learnable weight matrices: $W_{Q,i}, W_{K,i}, W_{V,i} \\∈ \\ℝ^{d_{model} \\× d_k}$. The input $X$ is projected into Queries ($Q$), Keys ($K$), and Values ($V$) as follows: $Q_i = XW_{Q,i}$, $K_i = XW_{K,i}$, and $V_i = XW_{V,i}$. The attention mechanism then computes a weighted sum of the values, where the weights are determined by the scaled dot-product between the query and the keys: $\text{Attention}(Q_i, K_i, V_i) = \text{softmax}(\frac{Q_i K_i^T}{\sqrt{d_k}}) V_i$.

The term $\sqrt{d_k}$ is critical for numerical stability. As $d_k$ increases, the magnitude of the dot products grows, pushing the softmax function into regions where gradients are vanishingly small. By scaling the product, we maintain a variance of $1$ for the inputs to the softmax, ensuring smoother convergence during backpropagation. Each head $i$ effectively learns a different linear transformation, allowing the model to 'look' at the sequence through different lenses.

To combine the insights from all $h$ heads, we concatenate the results and apply a final linear projection. Let $\text{head}_i$ be the output of the $i$-th attention head. The multi-head output is formulated as: $\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O$, where $W^O \\∈ \\ℝ^{hd_k \\× d_{model}}$ is a learnable projection matrix. This operation ensures that the output remains in the same dimensionality as the input, preserving the residual connection property $X + \text{MultiHead}(X)$.

Once the attention mechanism has aggregated contextual information, the model requires a way to process these features independently at each position. This is achieved by the Position-wise Feed-Forward Network (FFN). The intuition is that while MHA handles interaction between tokens, the FFN handles the refinement of the representation for each individual token. It acts as a local, non-linear transformation that projects the combined attention signal into a higher-dimensional space to extract more complex patterns.

The FFN consists of two linear transformations separated by a non-linear activation function, typically ReLU or GELU. For a vector $x$, the operation is defined as: $\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$. In matrix form for the entire sequence $X$, this is $\text{FFN}(X) = \text{ReLU}(XW_1 + b_1)W_2 + b_2$. Usually, the inner dimension $d_{ff}$ is significantly larger than $d_{model}$ (e.g., $2048$ vs $512$), creating a bottleneck-expansion-compression structure that increases the model's capacity to memorize and transform patterns.