The Mathematical Foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

At its core, the Attention mechanism is designed to solve the problem of static representation. In traditional RNNs, a word is represented by a single vector regardless of context. Attention allows a model to dynamically 'attend' to different parts of the input sequence based on the current focus. Mathematically, this is treated as a retrieval system where a Query ($\text{Q}$) searches for relevant information across a set of Keys ($\text{K}$), and once the relevance is determined via a similarity metric, the corresponding Values ($\text{V}$) are aggregated to form the output.

The foundation of this process is the Scaled Dot-Product Attention. Given matrices $\text{Q} \\∈ \\ℝ^{n \\× d_k}$ and $\text{K} \\∈ \\ℝ^{m \\× d_k}$, we compute the alignment scores using the dot product. To prevent the gradients from vanishing or exploding during the softmax operation as the dimensionality $d_k$ increases, we scale the product by $\frac{1}{\sqrt{d_k}}$. The operation is defined as: $\text{Attention}(\text{Q}, \text{K}, \text{V}) = \text{softmax}\left(\frac{\text{QK}^T}{\sqrt{d_k}}\right)\text{V}$. The result is a weighted sum of values, where the weights are determined by the compatibility of the query with the keys.

Multi-Head Attention (MHA) extends this concept by allowing the model to jointly attend to information from different representation subspaces. A single attention head may focus on syntactic relationships, while another focuses on semantic meaning. We project the original inputs $\text{X}$ into $h$ different sets of $\text{Q, K, V}$ matrices using learnable weights $\text{W}_i^Q, \text{W}_i^K, \text{W}_i^V$. The output of each head is computed independently: $\text{head}_i = \text{Attention}(\text{X}\text{W}_i^Q, \text{X}\text{W}_i^K, \text{X}\text{W}_i^V)$. These heads are then concatenated and projected back to the original dimension: $\text{MultiHead}(\text{Q}, \text{K}, \text{V}) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)\text{W}^O$.

While MHA captures global relationships, it consists primarily of linear operations. To introduce the capacity for complex non-linear mapping, the Transformer employs a Position-wise Feed-Forward Network (FFN). This network is applied to each position identically and independently. It consists of two linear transformations separated by a non-linear activation function (typically ReLU or GeLU). The mathematical form is $\text{FFN}(\text{x}) = \max(0, \text{x}\text{W}_1 + \text{b}_1)\text{W}_2 + \text{b}_2$, where $\text{W}_1$ expands the dimensionality to a higher-dimensional space (often $d_{model} \to 4d_{model}$) and $\text{W}_2$ projects it back.

The 'Position-wise' nature of the FFN is critical. Unlike the attention layer, which mixes information across the sequence dimension $n$, the FFN operates solely on the feature dimension $d$. If we view the input as a matrix $\text{X} \\∈ \\ℝ^{n \\× d}$, the FFN treats each row $\text{x}_i$ as an independent sample. This allows the model to process the 'contextualized' tokens generated by the attention heads and refine the internal representation of each token autonomously.

Integrating these two components creates a powerful duality: Multi-Head Attention handles *inter-token* communication (spatial routing), while the FFN handles *intra-token* processing (feature transformation). Together, they allow the Transformer to map an input sequence into a high-dimensional manifold where semantic relationships are linearly separable. This architecture ensures that the model can scale efficiently across massive datasets while maintaining the ability to resolve nuanced dependencies across long distances.