All Lessons

The mathematical foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

An exploration into the linear algebraic operations and dimensionality transformations that enable the Transformer architecture to capture global dependencies. This lesson dissects the interaction between query, key, and value projections and the role of point-wise non-linearity.

AI Narration Press play to listen
0  / 6 paragraphs
Click any paragraph to jump · Scroll freely without breaking narration

At its core, the Attention mechanism is designed to allow a model to focus on different parts of an input sequence dynamically. Instead of treating a sequence as a fixed vector, we treat it as a set of 'conceptual lookups.' Imagine a library: you have a query (what you are looking for), keys (the labels on the spines of books), and values (the actual content inside those books). The goal is to calculate a weighted sum of values based on how well your query matches the available keys. This allows the model to capture long-range dependencies regardless of their distance in the sequence, overcoming the 'vanishing gradient' issues found in recurrent architectures.

Mathematically, we begin with a Scaled Dot-Product Attention. Given an input matrix $X \\∈ \\ℝ^{n \\× d}$, we project it into three distinct spaces using weight matrices $W^Q, W^K, W^V \\∈ \\ℝ^{d \\× d_k}$. This yields the Query, Key, and Value matrices: $Q = XW^Q, K = XW^K, V = XW^V$. The attention scores are computed via the dot product of $Q$ and $K^T$, scaled by the square root of the dimension $d_k$ to prevent gradients from vanishing during the softmax operation. The formulation is expressed as: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Multi-Head Attention (MHA) extends this concept by performing the attention process multiple times in parallel. The intuition is that a single attention head can only focus on one type of relationship (e.g., syntactic dependency). By using $h$ different heads, the model can simultaneously attend to different representation subspaces—one head might track subject-verb agreement while another tracks coreference. Each head $i$ has its own set of learnable projections $W_i^Q, W_i^K, W_i^V$. The output of these heads is concatenated and then projected back to the original model dimension using a final weight matrix $W^O$: $$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$, where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$.

Following the attention mechanism, the Transformer employs a Position-wise Feed-Forward Network (FFN). While Attention allows tokens to 'communicate' with each other, the FFN allows the model to 'process' the information extracted from that communication. It is 'position-wise' because the same linear transformation is applied to each token independently across the sequence dimension. This effectively acts as a local feature extractor that transforms the representation of each token based on the context it gathered during the attention phase.

The FFN consists of two linear transformations with a non-linear activation function (typically ReLU or GeLU) in between. If $x$ is the input vector for a single position, the transformation is defined as: $$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$. Here, $W_1$ typically projects the dimension from $d_{model}$ to a larger hidden dimension $d_{ff}$ (often $4 \\× d_{model}$), and $W_2$ projects it back. This expansion-contraction architecture allows the model to project the data into a higher-dimensional space to find more complex patterns before compressing it back.

Integrating these components creates a powerful duality: MHA handles the global spatial relationships (the 'where' and 'what' of the sequence), while the FFN handles the point-wise semantic transformation (the 'meaning' of the resulting vector). Together, these operations ensure that the network is both permutation-invariant and capable of high-capacity representation. This mathematical structure allows the Transformer to be highly parallelizable, moving away from the sequential bottleneck of RNNs and enabling the scale of modern Large Language Models.