The Mathematical Foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

At its core, the Attention mechanism is designed to allow a model to dynamically weigh the importance of different parts of an input sequence regardless of their distance. While Recurrent Neural Networks (RNNs) process tokens sequentially, Attention treats the sequence as a set of vectors in a high-dimensional space. The intuition is akin to a retrieval system: we have a 'Query' ($\text{Q}$) representing what we are looking for, a 'Key' ($\text{K}$) representing the metadata of all available tokens, and a 'Value' ($\text{V}$) representing the actual content we wish to aggregate based on the match between the query and the key.

The fundamental unit of this operation is the Scaled Dot-Product Attention. Given an input matrix $X \\∈ \\ℝ^{n \\× d}$, we project it into three distinct spaces using learned weight matrices $W^Q, W^K, W^V \\∈ \\ℝ^{d \\× d_k}$. The attention weights are computed by taking the dot product of the queries and keys, scaling them to prevent gradient vanishing in the softmax function, and normalizing them. Mathematically, this is expressed as: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ where $Q=XW^Q$, $K=XW^K$, and $V=XW^V$. The term $\sqrt{d_k}}$ is critical because as the dimensionality increases, the variance of the dot product grows, which could push the softmax into regions with extremely small gradients.

Multi-Head Attention (MHA) extends this concept by allowing the model to attend to information from different representation subspaces simultaneously. Instead of performing a single attention function, the model runs $h$ 'heads' in parallel. Each head $i$ has its own set of projections $W_i^Q, W_i^K, W_i^V$. The outputs of these heads are then concatenated and projected back into the original embedding dimension using a final weight matrix $W^O \\∈ \\ℝ^{hd_k \\× d}$: $$ \text{MHA}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O $$ where each $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$. This allows the model to capture diverse relationships, such as syntactic dependencies and semantic associations, in a single layer.

Following the attention mechanism, the Transformer employs Position-wise Feed-Forward Networks (FFN). While attention manages the interactions between different tokens, the FFN operates on each position independently and identically. The intuition here is to apply a non-linear transformation to the features extracted by the attention layer, effectively acting as a 'key-value memory' that transforms the attention-weighted representations into a more refined feature space.

Mathematically, the FFN consists of two linear transformations with a ReLU (or GeLU) activation function in between. For a vector $x$ at a specific position, the operation is defined as: $$ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 $$ where $W_1 \\∈ \\ℝ^{d \\× d_{ff}}$ and $W_2 \\∈ \\ℝ^{d_{ff} \\× d}$. Typically, $d_{ff}$ is significantly larger than $d$ (often $d_{ff} = 4d$), creating a 'bottleneck' structure that expands the dimensionality to capture complex patterns before projecting it back to the original size.

The synergy between MHA and FFN defines the Transformer's power. MHA provides the 'contextual' view, allowing tokens to communicate across the entire sequence length, while the FFN provides the 'individual' view, refining the features of each token based on the global context gathered. Together, these components ensure that the model can represent complex hierarchical structures in language, transforming an initial embedding $X$ into a highly contextualized representation $Z$ through repeated applications of these two blocks.