The Mathematical Foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

At its core, the attention mechanism is designed to allow a model to focus on different parts of an input sequence when processing a specific token. The intuition is akin to a database retrieval system: given a 'query', the model searches for the most relevant 'keys' and retrieves the corresponding 'values'. By computing a compatibility score between queries and keys, the model can dynamically weigh the importance of every other token in the sequence, effectively creating a context-specific representation of the data.

Mathematically, we represent the input sequence as a matrix $X \\∈ \\ℝ^{n \\× d_{model}}$. To compute attention, we project $X$ into three distinct spaces using weight matrices $W^Q, W^K, W^V \\∈ \\ℝ^{d_{model} \\× d_k}$. This results in the Query, Key, and Value matrices: $Q = XW^Q$, $K = XW^K$, and $V = XW^V$. The Scaled Dot-Product Attention is then defined as: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$. The division by $\sqrt{d_k}$ is critical as it prevents the dot products from growing too large in magnitude, which would otherwise push the softmax function into regions with extremely small gradients.

Multi-Head Attention (MHA) extends this concept by performing the attention process multiple times in parallel. The intuition is that a single attention head might only capture one type of relationship (e.g., syntactic dependency), while multiple heads allow the model to simultaneously attend to different types of information (e.g., semantic meaning, temporal order, and coreference). By splitting the embedding dimension $d_{model}$ into $h$ heads, each with dimension $d_k = d_{model}/h$, the model gains the ability to project the input into multiple representation subspaces.

The formulation for MHA involves calculating $h$ independent attention outputs, denoted as $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$. These heads are then concatenated and projected back into the original model dimension using a final weight matrix $W^O \\∈ \\ℝ^{hd_k \\× d_{model}}$. The complete operation is expressed as: $\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$. This linear projection at the end ensures that the combined knowledge from all heads is integrated back into the residual stream of the Transformer.

Following the attention mechanism, the Transformer employs a Position-wise Feed-Forward Network (FFN). While attention handles the interaction between different tokens, the FFN is designed to process each token independently. The intuition here is to apply a non-linear transformation to the representations, allowing the model to project the attention-weighted features into a higher-dimensional space to learn more complex patterns before projecting them back down.

The FFN consists of two linear transformations with a non-linear activation function (typically ReLU or GELU) in between. Mathematically, for each position, the operation is: $\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$. Here, $W_1 \\∈ \\ℝ^{d_{model} \\× d_{ff}}$ and $W_2 \\∈ \\ℝ^{d_{ff} \\× d_{model}}$, where $d_{ff}$ is typically much larger than $d_{model}$ (e.g., 2048 vs 512). This 'expansion-contraction' architecture acts as a key-value memory where the first layer detects features and the second layer reconstructs the output representation.