The Mathematical Foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

The core intuition behind Multi-Head Attention (MHA) is the ability to attend to different parts of a sequence simultaneously from different 'representational subspaces.' In a simple attention mechanism, a word only has one way to relate to others. However, a word might have a syntactic relationship with one token (e.g., a verb following a noun) and a semantic relationship with another (e.g., a pronoun referring back to a subject). By using multiple 'heads,' the model can parallelize these distinct relationship types, allowing the network to capture a richer set of dependencies than a single attention head could possibly encode.

Mathematically, we begin with the Scaled Dot-Product Attention. Given an input matrix $X$, we project it into three distinct spaces: Queries ($Q$), Keys ($K$), and Values ($V$) using learned weight matrices $W^Q, W^K, W^V$. The attention score is computed by the dot product of queries and keys, normalized by the square root of the dimension $d_k$ to prevent gradient vanishing during the softmax operation: $$ ext{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ Here, the term $QK^T$ generates a compatibility matrix where each entry $(i, j)$ represents the relevance of token $j$ to token $i$.

Multi-Head Attention extends this by splitting the model dimension $d_{model}$ into $h$ heads, each with dimension $d_k = d_{model} / h$. Each head $i$ performs the attention operation independently: $$ ext{head}_i = \text{Attention}(XW_i^Q, XW_i^K, XW_i^V)$$ The results of these $h$ heads are then concatenated and projected back into the original model dimension using a final linear transformation $W^O$. This ensures that the output remains compatible with subsequent layers while aggregating the unique perspectives captured by each head: $$ ext{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$

Once the attention mechanism has aggregated global context, the Transformer employs a Position-wise Feed-Forward Network (FFN). While attention captures relationships *between* tokens, the FFN is responsible for processing the information *within* each token. The 'position-wise' nature means the same linear transformation is applied independently to every position in the sequence. This acts as a local processing unit that transforms the aggregated context into a higher-level representation, effectively acting as a key-value memory store for the model's learned knowledge.

The FFN consists of two linear transformations with a non-linear activation function (typically ReLU or GELU) in between. Mathematically, for a vector $x$ at a given position, the operation is: $$ ext{FFN}(x) = \text{max}(0, xW_1 + b_1)W_2 + b_2$$ The first transformation projects the input into a higher-dimensional space (typically $4 \\× d_{model}$), and the second projects it back down. This expansion-contraction bottleneck allows the model to learn complex non-linear mappings and separate the 'attention-gathering' phase from the 'information-processing' phase.

In summary, the synergy between MHA and FFNs allows the Transformer to solve the fundamental trade-off between global connectivity and local specialization. MHA uses linear algebra to compute weighted averages of features across the entire sequence, while the FFN uses depth and non-linearity to refine those features. Together, they form a powerful architecture where $X$ is first dynamically re-weighted based on context and then statically transformed to extract deeper semantic meaning.