The Mathematical Foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

At its core, the Attention mechanism solves the problem of 'contextualization'. In traditional RNNs, information is compressed into a single hidden state, creating a bottleneck. Attention allows a model to dynamically weigh the importance of different parts of the input sequence. Imagine a library where you don't just look at the last book you touched, but can instantly glance at every relevant page across all books to answer a specific query. This is achieved by mapping each input token into three distinct roles: the Query ($Q$), the Key ($K$), and the Value ($V$).

Mathematically, for a given input matrix $X \\∈ \\ℝ^{n \\× d}$, where $n$ is the sequence length and $d$ is the embedding dimension, we derive $Q, K, V$ using learned weight matrices $W^Q, W^K, W^V \\∈ \\ℝ^{d \\× d_k}$. The Scaled Dot-Product Attention is then defined as: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$. The term $QK^T$ computes the pairwise similarity between all queries and keys, and the division by $\sqrt{d_k}$ prevents the gradients of the softmax function from vanishing when $d_k$ is large.

Single-head attention is limiting because it only allows the model to focus on one type of relationship per layer. Multi-Head Attention (MHA) overcomes this by running $h$ attention mechanisms in parallel. Each 'head' uses a different set of projection matrices, allowing one head to attend to syntactic dependencies (e.g., subject-verb agreement) while another attends to semantic relationships (e.g., coreference). The output of each head $i$ is calculated as $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$, and the final result is a linear projection of the concatenation: $$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$.

Once the attention mechanism has aggregated global context, the model requires a way to process this information locally for each token. This is the role of the Position-wise Feed-Forward Network (FFN). While attention handles the interaction *between* tokens, the FFN handles the transformation *within* each token. It is 'position-wise' because the same linear transformation is applied independently to every vector in the sequence, effectively acting as a per-token expert system.

The FFN consists of two linear transformations with a non-linear activation function (typically ReLU or GeLU) in between. Mathematically, it is expressed as: $$ ext{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$. Here, $W_1$ typically projects the dimension $d$ to a higher-dimensional space $d_{ff}$ (often $4\\× d$), and $W_2$ projects it back down to $d$. This 'expansion-contraction' architecture allows the model to project the data into a higher-dimensional manifold where non-linear patterns are easier to separate and extract.

The synergy between MHA and FFNs creates the Transformer's power. MHA acts as a dynamic routing system, rearranging information based on current context, while FFNs act as a knowledge base, processing that routed information into a more refined representation. Together, these components ensure that the model can maintain a global receptive field while performing complex, point-wise non-linear feature engineering, all while remaining highly parallelizable across GPUs.