The Mathematical Foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

At its core, the Attention mechanism is designed to solve the 'bottleneck' problem in sequence processing by allowing each element in a sequence to dynamically weigh its relationship with every other element. Instead of compressing a whole sentence into a single vector, attention treats the sequence as a database. By using three distinct roles—Query, Key, and Value—the model can perform a differentiable 'lookup' to determine which parts of the input are most relevant to the current processing step.

The mathematical operation begins with the Scaled Dot-Product Attention. Given input embeddings $X$, we project them into three spaces using weight matrices $W^Q, W^K, W^V$ to obtain $Q = XW^Q$, $K = XW^K$, and $V = XW^V$. The attention score is computed as the dot product of the query and key, scaled by the square root of the dimension $d_k$ to prevent gradients from vanishing during the softmax stage: $\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$.

While a single attention head captures one type of relationship, Multi-Head Attention (MHA) allows the model to attend to information from different representation subspaces simultaneously. We define $h$ heads, where each head $i$ has its own set of linear projections. The output is the concatenation of these heads, followed by a final linear transformation $W^O$: $\text{MHA}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$, where each $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$.

Following the attention mechanism, the Transformer utilizes a Position-wise Feed-Forward Network (FFN). While attention handles the interactions between different positions, the FN processes each position independently and identically. This acts as a local transformation that allows the model to process the 'meaning' extracted by the attention heads into a higher-level representation, effectively functioning as a per-token neural network.

Mathematically, the FFN consists of two linear transformations with a non-linear activation function (typically ReLU or GeLU) in between. For an input $x$, the operation is defined as $\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$. Here, $W_1$ typically projects the vector into a higher-dimensional space (often $4 \\×$ the model dimension) to allow for a more expressive transformation, which is then projected back to the original dimension by $W_2$.

The synergy between MHA and FFNs creates a powerful duality: MHA provides the 'context' (where to look), and the FFN provides the 'computation' (what it means). By stacking these layers, the model learns a hierarchy of features, moving from simple lexical associations in the lower layers to complex semantic abstractions in the higher layers, all while maintaining constant-time connectivity between distant tokens through the attention matrix.