The Mathematical Foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

At its core, the Transformer architecture replaces recurrence with a mechanism called Attention. The intuition is simple: not all parts of an input sequence are equally relevant to a specific word. By projecting the input into three distinct spaces—Queries, Keys, and Values—the model can dynamically determine which parts of the sequence to 'attend' to. Imagine a database retrieval system where the Query is what you are looking for, the Key is the index of the stored information, and the Value is the actual content you retrieve based on the match between the Query and the Key.

Mathematically, we begin with an input matrix $X \\∈ \\ℝ^{n \\× d_{model}}$, where $n$ is the sequence length. We define three weight matrices $W^Q, W^K, W^V \\∈ \\ℝ^{d_{model} \\× d_k}$. The projections are computed as $Q = XW^Q$, $K = XW^K$, and $V = XW^V$. The core operation is the Scaled Dot-Product Attention, formulated as: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ Here, the dot product $QK^T$ computes the raw similarity between every pair of tokens. We divide by $\sqrt{d_k}$ to prevent the gradients from vanishing during the softmax operation when $d_k$ is large.

Single-head attention is limited because it can only focus on one type of relationship per layer. Multi-Head Attention (MHA) solves this by running $h$ attention mechanisms in parallel. Each 'head' possesses its own set of learnable weights $W_i^Q, W_i^K, W_i^V$. This allows the model to simultaneously attend to different aspects of the sequence—for instance, one head might track syntactic dependencies while another tracks semantic relationships. The outputs of these heads are concatenated and projected back to the original dimension: $$\text{MHA}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$, where $W^O \\∈ \\ℝ^{hd_k \\× d_{model}}$.

Once the attention mechanism has aggregated global information, the model needs to process this information at each position independently. This is achieved via the Position-wise Feed-Forward Network (FFN). While the attention layer allows tokens to 'talk' to each other, the FFN allows the model to process the resulting representation in a non-linear way. It is applied to each position $i$ identically and independently, meaning it operates as a shared MLP across the temporal dimension.

The FFN consists of two linear transformations with a non-linear activation function (typically ReLU or GeLU) in between. The mathematical expression for the FFN at position $i$ is: $$ ext{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$. Here, $W_1 \\∈ \\ℝ^{d_{model} \\× d_{ff}}$ expands the dimensionality (typically by a factor of 4), and $W_2 \\∈ \\ℝ^{d_{ff} \\× d_{model}}$ projects it back. This expansion-contraction structure allows the model to project the data into a higher-dimensional space to capture more complex patterns before compressing it back.

The synergy between MHA and FFNs is the key to the Transformer's power. MHA acts as a dynamic routing mechanism that re-weights information based on context, while the FFN acts as a localized knowledge processor. Together, wrapped in residual connections and layer normalization, they ensure that the signal flows efficiently through the network, allowing the model to learn deep, hierarchical representations of language without the vanishing gradient problems seen in RNNs.