All Lessons

The Mathematical Foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

An exploration of the linear algebraic operations and projection spaces that enable Transformers to process global dependencies. This lesson decomposes the attention mechanism into its constituent query, key, and value transformations.

AI Narration Press play to listen
0  / 6 paragraphs
Click any paragraph to jump · Scroll freely without breaking narration

The core intuition behind the Attention mechanism is the ability to dynamically assign weight to different parts of an input sequence based on their relevance to a specific token. Rather than treating a sequence as a static vector, attention allows the model to 'look' at other tokens to derive context. Mathematically, this is framed as a retrieval system where a 'Query' seeks information from a set of 'Keys', and the resulting alignment determines how much of the corresponding 'Value' is extracted.

To formalize this, we start with Scaled Dot-Product Attention. Given an input matrix $X \\∈ \\ℝ^{n \\× d}$, we project it into three distinct spaces using learned weight matrices $W^Q, W^K, W^V \\∈ \\ℝ^{d \\× d_k}$. This yields the Query, Key, and Value matrices: $Q = XW^Q, K = XW^K,$ and $V = XW^V$. The attention weights are computed via the softmax of the scaled dot product between queries and keys: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ The scaling factor $\sqrt{d_k}$ is critical to prevent the dot products from growing too large in magnitude, which would push the softmax function into regions with vanishing gradients.

Multi-Head Attention (MHA) extends this concept by allowing the model to jointly attend to information from different representation subspaces. Instead of one large attention operation, we perform $h$ parallel attention operations (heads). Each head $i$ has its own set of projections $W_i^Q, W_i^K, W_i^V$. The output of the MHA layer is the concatenation of these heads, projected back into the original dimension $d$ via a final learnable matrix $W^O$: $$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$ where each $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$.

While MHA handles the global interaction between tokens, it lacks a mechanism for non-linear transformation of the individual token representations. This is where the Position-wise Feed-Forward Network (FFN) comes into play. The FFN is applied to each position identically and independently. It consists of two linear transformations with a non-linear activation—typically ReLU or GeLU—sandwiched between them: $$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$ This structure effectively projects the token into a higher-dimensional space (often $4 \\× d$) to extract complex features and then projects it back to the model dimension.

The synergy between MHA and FFNs is fundamental to the Transformer's power. MHA acts as a 'communication' layer, where tokens exchange information across the sequence, while the FFN acts as a 'processing' layer, refining the representation of each token based on the information it gathered during the attention phase. Together, they form a universal function approximator capable of capturing both local textual patterns and long-range semantic dependencies.

Finally, to ensure numerical stability and facilitate the flow of gradients through deep architectures, these components are wrapped in residual connections and Layer Normalization. Given a layer function $f(x)$, the output is transformed as $\text{LayerNorm}(x + f(x))$. This prevents the vanishing gradient problem and allows the model to learn identity mappings, ensuring that the original input signal is preserved while the network learns the necessary incremental refinements.