The Mathematical Foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

The core intuition behind Multi-Head Attention (MHA) is the ability to attend to different parts of a sequence from different 'perspectives' simultaneously. While a single attention head computes a weighted sum of values based on a single compatibility metric, multiple heads allow the model to capture diverse relationships—such as syntactic dependencies in one head and semantic associations in another. Mathematically, this is achieved by projecting the input embeddings into multiple subspaces, performing attention in parallel, and concatenating the results.

The fundamental building block is Scaled Dot-Product Attention. Given an input matrix $X ∈ ℝ^{n × d}$, we derive three matrices: Queries ($Q$), Keys ($K$), and Values ($V$) using learned weight matrices $W^Q, W^K, W^V ∈ ℝ^{d × d_k}$. The attention mechanism computes a compatibility score via the dot product: $$ ext{Attention}(Q, K, V) = ext{softmax}\\left(rac{QK^T}{\\sqrt{d_k}} ight)V$$ The scaling factor $rac{1}{\\sqrt{d_k}}$ is critical; without it, for large $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where gradients are extremely small (the vanishing gradient problem).

Multi-Head Attention extends this by employing $h$ independent heads. For each head $i$, we have unique projection matrices $W_i^Q, W_i^K, W_i^V$. The operation is defined as: $$ ext{MultiHead}(Q, K, V) = ext{Concat}( ext{head}_1, \\dots, ext{head}_h)W^O$$ where each $ ext{head}_i = ext{Attention}(QW_i^Q, KW_i^K, VW_i^V)$. The final linear projection $W^O ∈ ℝ^{hd_k × d}$ reintegrates the concatenated multi-head information back into the original model dimension $d$, allowing the network to synthesize information gathered from all heads.

Following the attention mechanism, the Transformer applies a Position-wise Feed-Forward Network (FFN). While the attention mechanism handles the interaction between different tokens (inter-token communication), the FFN operates on each position independently (intra-token processing). Think of the MHA as the 'context gatherer' and the FFN as the 'knowledge processor' that transforms the gathered context into a higher-level representation.

Mathematically, the FFN consists of two linear transformations with a non-linear activation function (typically ReLU or GeLU) in between. For an input $x ∈ ℝ^d$, the operation is: $$ ext{FFN}(x) = \\max(0, xW_1 + b_1)W_2 + b_2$$ Here, $W_1 ∈ ℝ^{d × d_{ff}}$ typically projects the dimension to a larger space (e.g., $d_{ff} = 4d$), and $W_2 ∈ ℝ^{d_{ff} × d}$ projects it back to the original dimension. This expansion and contraction allow the model to learn complex non-linear mappings and store factual knowledge within the weights of $W_1$ and $W_2$.

The synergy between MHA and FFN is what gives the Transformer its power. MHA provides the 'where to look' logic by computing dynamic weights based on the current input, while the FFN provides the 'what to do' logic by applying static, learned transformations to those gathered features. Together, they form a universal function approximator capable of handling sequences of arbitrary length by maintaining a constant path length between any two positions in the input sequence.