All Lessons

The Mathematical Foundations of Multi-Head Attention and Position-wise Feed-Forward Networks

An analytical exploration of the Transformer's core components, focusing on the linear projections and non-linearities that enable global context capture. We examine the transition from single-head scaled dot-product attention to the parallelized multi-head architecture.

AI Narration Press play to listen
0  / 6 paragraphs
Click any paragraph to jump · Scroll freely without breaking narration

At its core, the Transformer architecture replaces recurrence with a mechanism called Attention. The-intuition is simple: for any given word (or token) in a sequence, the model should dynamically determine which other tokens are most relevant to its meaning. Mathematically, this is treated as a retrieval problem where a 'Query' searches for matching 'Keys' to extract information from corresponding 'Values'. This allows the model to capture global dependencies regardless of the distance between tokens in the input sequence.

The fundamental building block is the Scaled Dot-Product Attention. Given an input matrix $X \\∈ \\ℝ^{n \\× d}$, we project it into three distinct spaces using learned weight matrices $W^Q, W^K, W^V \\∈ \\ℝ^{d \\× d_k}$. The resulting matrices are $Q = XW^Q$, $K = XW^K$, and $V = XW^V$. The attention scores are computed as the dot product of $Q$ and $K^T$, scaled by $1/\sqrt{d_k}$ to prevent the gradients from vanishing during the softmax operation: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Single-head attention is limiting because it forces the model to average all relationship types into one distribution. Multi-Head Attention (MHA) solves this by running $h$ attention 'heads' in parallel. Each head $i$ has its own set of weight matrices $W_i^Q, W_i^K, W_i^V$. This allows the model to simultaneously attend to different types of information—for instance, one head might focus on syntactic dependencies while another tracks semantic coreference. The outputs of all heads are concatenated and projected back to the original dimension: $$\text{MHA}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$

Crucially, the softmax operation creates a probability distribution over the sequence length $n$. Let $A = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})$. Each element $A_{ij}$ represents the weight the $i$-th token assigns to the $j$-th token. The final output is a weighted sum of the value vectors: $Z_i = \sum_{j=1}^{n} A_{ij} V_j$. This linear combination transforms the static embedding into a context-aware representation, effectively 'mixing' information across the sequence.

Following the attention mechanism, the Transformer employs a Position-wise Feed-Forward Network (FFN). While attention handles the interactions *between* tokens, the FFN processes each token *individually* and identically. The intuition is to apply a non-linear transformation to the context-aware embeddings, allowing the model to learn complex higher-order features. It consists of two linear transformations with a ReLU activation in between: $$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

The FFN's architecture is typically designed to project the dimension $d_{model}$ up to a larger hidden dimension $d_{ff}$ (often $4\\×$ larger) and then project it back down. This 'expansion-contraction' cycle creates a bottleneck that forces the network to extract and condense the most salient features. Because this operation is applied to each position independently, it can be computed as a large matrix multiplication across the entire batch, ensuring high computational efficiency on modern GPU hardware.