The Mathematical Foundations of Multi-Head Attention and Position-wise Feed-Forward Networks

The fundamental intuition behind Attention is the ability of a model to dynamically focus on different parts of an input sequence based on their relevance to a specific query. In a sequence of tokens, not all words are equally important for understanding the context of a current word. By calculating a weighted sum of all input representations, the model can 'attend' to the most salient information, effectively creating a dynamic connectivity pattern that transcends the fixed-distance constraints of Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs).

Mathematically, the core primitive is Scaled Dot-Product Attention. Given an input matrix $X$, we derive three linear projections: Queries $Q = XW^Q$, Keys $K = XW^K$, and Values $V = XW^V$. The attention score is computed as the dot product of the query with all keys, scaled by the square root of the head dimension $d_k$ to prevent gradient vanishing in the softmax function. The formulation is: $\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$. Here, the softmax operation transforms raw alignment scores into a probability distribution that weights the contribution of each value vector.

While single-head attention is powerful, Multi-Head Attention (MHA) allows the model to jointly attend to information from different representation subspaces at different positions. Instead of one large attention mechanism, we split the $d_{model}$ dimension into $h$ heads, each with dimension $d_k = d_{model}/h$. Each head $i$ performs its own scaled dot-product attention. The outputs are then concatenated and projected back to the original dimension using a weight matrix $W^O$: $\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$, where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$.

After the global dependencies are captured by the attention mechanism, the model requires a way to process this information locally and introduce non-linearity. This is achieved via the Position-wise Feed-Forward Network (FFN). Unlike the attention layer, which interacts across positions, the FFN is applied to each position independently and identically. It consists of two linear transformations with a non-linear activation function (typically ReLU or GeLU) in between, acting as a point-wise associative memory that refines the representations learned by the attention heads.

The mathematical structure of the FFN is expressed as: $\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$. In a standard Transformer, $W_1$ typically projects the vector into a higher-dimensional space (e.g., from $d_{model} = 512$ to $d_{ff} = 2048$) and $W_2$ projects it back. This 'expansion-contraction' architecture allows the model to map the input into a high-dimensional manifold where features can be more easily separated and transformed before being compressed back into the residual stream.

To ensure stability during training and enable the flow of gradients in deep architectures, both MHA and FFN layers are wrapped in a residual connection and followed by layer normalization. The output of a transformer block can be summarized as: $\text{Output} = \text{LayerNorm}(x + \text{MultiHead}(x)) \rightarrow \text{LayerNorm}(\text{Output} + \text{FFN}(\text{Output}))$. This additive structure prevents the vanishing gradient problem and allows the network to learn identity mappings, ensuring that the model can grow in depth without degrading performance.