The Mathematical Architecture of Transformer Components: Multi-Head Attention and Position-wise Feed-Forward Networks

The core intuition behind Multi-Head Attention is the ability of a model to jointly attend to information from different representation subspaces at different positions. Instead of performing a single attention function with $d_{model}$-dimensional keys, queries, and values, we linearly project the queries, keys, and values $h$ times with different, learned linear projections to $d_k$, $d_k$, and $d_v$ dimensions respectively. This allows the model to capture diverse syntactic and semantic relationships simultaneously, much like a committee of experts each focusing on a different aspect of the input sequence.

Mathematically, the foundation is the Scaled Dot-Product Attention mechanism. Given a query matrix $Q$, a key matrix $K$, and a value matrix $V$, the attention output is computed as: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ The scaling factor $\sqrt{d_k}$ is critical; without it, for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients, effectively halting learning during backpropagation.

Multi-Head Attention extends this by concatenating the outputs of $h$ parallel attention heads and applying a final linear projection. If we denote the $i$-th head's projection matrices as $W_i^Q, W_i^K, W_i^V$, the operation is defined as: $$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$ where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$. This structure enables the model to attend to information from different representation subspaces, allowing it to capture both local dependencies and long-range global context within the same layer.

Following the attention mechanism, the Position-wise Feed-Forward Network (FFN) applies a non-linear transformation to each position separately and identically. While attention mixes information across positions, the FFN processes information within each position independently. It consists of two linear transformations with a ReLU activation in between: $$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$ Here, $W_1 \\∈ \\ℝ^{d_{model} \\× d_{ff}}$ and $W_2 \\∈ \\ℝ^{d_{ff} \\× d_{model}}$ are weight matrices, and typically $d_{ff}$ is significantly larger than $d_{model}$ (e.g., 2048 vs 512) to create a high-dimensional intermediate representation.

The expansion to dimension $d_{ff}$ in the first linear layer of the FFN serves as a feature-rich bottleneck. By projecting the input into a higher-dimensional space, applying a non-linearity, and then projecting back, the network can learn complex, non-linear interactions between the features extracted by the attention heads. This is analogous to the kernel trick in SVMs, where data is mapped to a higher dimensional space to become linearly separable, though here the mapping is learned end-to-end via gradient descent.

Residual connections and Layer Normalization are applied around both the Multi-Head Attention and the FFN sub-layers to stabilize training. For a sub-layer denoted as $\text{SubLayer}(x)$, the output is $\text{LayerNorm}(x + \text{SubLayer}(x))$. This formulation ensures that the gradient can flow directly through the network without vanishing, even in very deep architectures, by allowing the model to learn an identity mapping if the optimal transformation is close to zero.

In summary, the synergy between Multi-Head Attention and Position-wise Feed-Forward Networks creates a powerful mechanism for sequence modeling. The attention mechanism dynamically weights the importance of different tokens relative to each other, while the feed-forward network refines these representations through non-linear feature extraction. Together, they form the building blocks that allow Transformers to achieve state-of-the-art performance across natural language processing and beyond.