The Mathematical Foundations of Multi-Head Attention and Position-wise Feed-Forward Networks

At its core, the Attention mechanism is designed to solve the bottleneck of fixed-length vector representations in sequence modeling. Instead of compressing an entire sentence into a single vector, Attention allows the model to dynamically focus on different parts of the input sequence based on the current context. This is achieved by treating each token as a query that searches for relevant information among a set of keys, retrieving a weighted sum of corresponding values. This 'retrieval' process ensures that the model can capture long-range dependencies regardless of the distance between tokens in the sequence.

Mathematically, we define the Scaled Dot-Product Attention. Given an input matrix $X$, we project it into three distinct spaces using learned weight matrices $W^Q, W^K, \text{ and } W^V$. The Query, Key, and Value matrices are computed as $Q = XW^Q$, $K = XW^K$, and $V = XW^V$. The attention score is computed by the dot product of queries and keys, scaled by the square root of the head dimension $d_k$ to prevent gradients from vanishing during the softmax operation. The formulation is given by: $$Attention(Q, K, V) = \text{softmax}\\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Multi-Head Attention (MHA) extends this concept by running multiple attention mechanisms in parallel. The intuition is that a single attention head may only focus on one type of relationship (e.g., syntactic dependency), whereas multiple heads allow the model to simultaneously attend to different representation subspaces (e.g., semantic meaning and temporal order). Each head $i$ has its own set of projections $W_i^Q, W_i^K, W_i^V$. The outputs of these heads are concatenated and then projected back to the original model dimension using a final weight matrix $W^O$: $$MultiHead(Q, K, V) = \text{Concat}(head_1, \dots, head_h)W^O$$ where $head_i = Attention(XW_i^Q, XW_i^K, XW_i^V)$.

While Multi-Head Attention handles the interaction between tokens, the Position-wise Feed-Forward Network (FFN) is responsible for processing the information extracted from those interactions. After the attention layer, the model applies a point-wise transformation to each token independently. The intuition is that while attention aggregates global context, the FFN provides the necessary non-linearity to transform these aggregated representations into higher-level features. It essentially acts as a local knowledge processor that operates on each position separately.

The FN consists of two linear transformations with a non-linear activation function, typically the Rectified Linear Unit (ReLU) or GeLU, sandwiched between them. For a given input vector $x$ at a specific position, the transformation is defined as: $$FFN(x) = \max(0, xW_1 + b_1)W_2 + b_2$$ Here, $W_1$ projects the data into a higher-dimensional latent space (often $4\\×$ the model dimension $d_{model}$), and $W_2$ projects it back down. This expansion-contraction cycle allows the model to learn complex patterns within the feature space.

The synergy between these two components is the secret to the Transformer's success. The MHA layer computes a weighted average of the input sequence, effectively creating a 'context-aware' embedding for every token. Then, the FFN applies a consistent non-linear transformation to these embeddings. Together, they form a block that preserves the spatial structure of the sequence while enriching the semantic content. This architecture ensures that the model is computationally efficient through parallelism, avoiding the sequential nature of Recurrent Neural Networks while maintaining a rigorous mathematical foundation for information flow.