The Mathematical Foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

At its core, the Attention mechanism is designed to solve the 'bottleneck' problem found in sequential models. While Recurrent Neural Networks (RNNs) process tokens one by one, Attention allows a model to dynamically weight the importance of every other token in a sequence regardless of distance. The intuition is akin to a retrieval system: given a specific 'query', the model searches across a set of 'keys' to determine which 'values' are most relevant. By computing a weighted sum of these values, the model creates a context-aware representation of the current token.

Mathematically, we begin with Scaled Dot-Product Attention. Given an input matrix $X \\∈ \\ℝ^{n \\× d}$, we project it into three distinct spaces using weight matrices $W^Q, W^K, W^V \\∈ \\ℝ^{d \\× d_k}$. This yields the Query, Key, and Value matrices: $Q = XW^Q$, $K = XW^K$, and $V = XW^V$. The attention scores are computed via the dot product of $Q$ and $K^T$, scaled by the square root of the dimension $d_k$ to prevent gradients from vanishing in the softmax function: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Single-head attention is limited because it only allows the model to attend to one type of relationship per layer. Multi-Head Attention (MHA) overcomes this by running $h$ attention mechanisms in parallel. Each 'head' uses different learned linear projections, allowing the model to simultaneously attend to diverse aspects of the sequence—for example, one head might capture syntactic dependencies while another captures semantic references. The outputs of these $h$ heads are concatenated and projected back to the original dimension: $$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$ where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$.

Following the attention mechanism, the Transformer employs a Position-wise Feed-Forward Network (FFN). While attention handles the interaction *between* tokens, the FFN handles the transformation *of* each token independently. The intuition is that the FFN acts as a local knowledge base, processing the information extracted by the attention heads to refine the representation. Because the same FFN is applied to every position, it is termed 'position-wise', ensuring that the model maintains permutation equivariance if attention masks are absent.

The FFN consists of two linear transformations with a non-linear activation function (typically ReLU or GeLU) in between. Specifically, the input $x$ is projected into a higher-dimensional space $d_{ff}$ (often $4 \\× d_{model}$) and then projected back. This 'expansion-contraction' architecture allows the model to map the input into a higher-dimensional feature space where linear separability is more likely. The mathematical expression is: $$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$ where $W_1 \\∈ \\ℝ^{d_{model} \\× d_{ff}}$ and $W_2 \\∈ \\ℝ^{d_{ff} \\× d_{model}}$.

In summary, the synergy between Multi-Head Attention and Position-wise FFNs provides the Transformer with its immense power. MHA provides the 'global' view, allowing the model to shift its focus across the sequence, while the FFN provides the 'local' capacity to process those features. Together, these components transform a raw sequence of embeddings into a rich, hierarchical representation, governed by the interplay of linear projections and non-linear activations.