The Mathematical Foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

The core intuition behind the Transformer architecture is the ability to dynamically weight the importance of different parts of an input sequence. Unlike Recurrent Neural Networks, which process data sequentially, Attention allows the model to 'look' at the entire sequence simultaneously. This is achieved through a retrieval mechanism where a 'query' is matched against a set of 'keys' to extract relevant information from 'values'. Mathematically, this is a soft-lookup table where the weights are determined by the similarity between the query and key vectors.

At the heart of this process is the Scaled Dot-Product Attention. Given an input matrix $X \\∈ \\ℝ^{n \\× d}$, we derive three matrices: Queries $Q = XW^Q$, Keys $K = XW^K$, and Values $V = XW^V$, where $W^Q, W^K, W^V \\∈ \\ℝ^{d \\× d_k}$. The attention output is computed as: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$. The term $QK^T$ computes the pairwise similarity between all tokens, while the scaling factor $\sqrt{d_k}$ prevents the gradients of the softmax function from vanishing as the dimensionality increases.

Multi-Head Attention (MHA) extends this concept by allowing the model to attend to information from different representation subspaces simultaneously. Instead of performing a single attention function with dimension $d_{model}$, we linearly project the queries, keys, and values $h$ times into lower-dimensional spaces of size $d_k = d_{model}/h$. The formula becomes: $\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$, where each $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$. This allows the model to jointly attend to different aspects of the input, such as syntactic structure and semantic meaning.

After the attention mechanism has aggregated global context, the model requires a method to process this information at each individual position independently. This is the role of the Position-wise Feed-Forward Network (FFN). The FFN consists of two linear transformations with a non-linear activation function in between, applied to each position $i$ separately: $\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$. In this structure, $W_1$ typically projects the data into a higher-dimensional space (expansion), and $W_2$ projects it back to the model dimension (contraction).

The mathematical significance of the FFN is that it acts as a set of 'key-value' memories. The first linear layer identifies which 'pattern' is present in the token's representation via the ReLU activation, and the second layer retrieves the associated value for that pattern. Because the same FFN parameters are applied to every token in the sequence, it behaves like a $1 \\× 1$ convolution, ensuring that the model maintains permutation invariance if no positional encodings are added.

To ensure stability during the training of these deep networks, the attention and FFN blocks are wrapped in residual connections and Layer Normalization. The output of a layer is defined as $\text{LayerNorm}(x + \text{Sublayer}(x))$. This additive structure prevents the vanishing gradient problem by providing a direct path for the gradient to flow backward through the network, while normalization ensures that the mean and variance of the activations remain consistent across different layers.