The Mathematical Foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

At its core, the Attention mechanism addresses a fundamental limitation of sequential processing: the inability to efficiently capture long-range dependencies. The intuition is analogous to a database retrieval system where a 'Query' is compared against a set of 'Keys' to determine how much 'Value' from each entry should be extracted. By computing a compatibility score between the query and keys, the model can dynamically focus on relevant parts of the input sequence, regardless of their distance, effectively replacing recurrence with a global weighted average of representations.

Mathematically, we define the Scaled Dot-Product Attention. Given input matrices $Q$ (Queries), $K$ (Keys), and $V$ (Values), the attention is computed as: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\$\right)V$$. Here, $QK^T$ computes the raw similarity scores via dot products. We divide by $\sqrt{d_k}$ to prevent the dot product from growing too large in magnitude, which would push the softmax function into regions with extremely small gradients, hindering convergence during backpropagation.

Multi-Head Attention (MHA) extends this concept by allowing the model to jointly attend to information from different representation subspaces. Instead of one large attention function, we use $h$ independent 'heads'. Each head $i$ has its own learnable projection matrices $W_i^Q, W_i^K, W_i^V$. The output is the concatenation of all heads, projected back into the original dimension via a final linear layer $W^O$: $$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$, where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$. This enables the model to capture diverse relationships, such as syntactic dependencies and semantic associations, simultaneously.

Following the attention mechanism, the Transformer employs a Position-wise Feed-Forward Network (FFN). While MHA aggregates information across the sequence (inter-token communication), the FN focuses on processing each token's representation independently (intra-token refinement). The intuition is that MHA identifies *where* to look, and the FFN transforms *what* has been found into a more useful higher-level feature. This creates a powerful duality between global communication and local processing.

The FFN is implemented as two linear transformations with a non-linear activation function—typically ReLU or GELU—applied in between. For a given vector $x$, the operation is: $$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$. The first linear layer typically projects the vector into a higher-dimensional space (e.g., from $d_{model} = 512$ to $d_{ff} = 2048$), and the second layer projects it back. This expansion-contraction cycle allows the model to project data into a manifold where non-linear separable features can be extracted more effectively.

Crucially, because the same FFN is applied to every position in the sequence independently, it is termed 'position-wise'. In matrix notation, for an input matrix $X$ of shape $(n, d_{model})$, the FFN operation can be viewed as applying the same function to each row. This ensures that the model preserves the structural alignment of the sequence while enriching the latent representations through learned non-linearities, providing the necessary capacity to approximate complex functions across the entire transformer block.