The Mathematical Foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

At its core, the Attention mechanism is designed to solve the 'context' problem: how does a model determine which parts of an input sequence are most relevant to a specific token? The intuition is based on a retrieval system. Imagine a database where each entry has a 'Key' and a 'Value'. When you search for something using a 'Query', the system calculates the similarity between your query and all available keys to decide how much of each value to retrieve. In Transformers, these queries, keys, and values are not fixed but are learned linear projections of the input embeddings.

Mathematically, we begin with the Scaled Dot-Product Attention. Given an input matrix $X \\∈ \\ℝ^{n \\× d}$, we project it into three spaces using weight matrices $W^Q, W^K, W^V \\∈ \\ℝ^{d \\× d_k}$. The attention score is computed as the dot product of the query and key matrices. To prevent the gradients from vanishing or exploding during the softmax phase when $d_k$ is large, we scale the result by $\sqrt{d_k}$. The formulation is expressed as: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ where $Q=XW^Q, K=XW^K, V=XW^V$.

While a single attention head is powerful, it is limiting because it can only attend to one 'type' of relationship per layer. Multi-Head Attention (MHA) overcomes this by running $h$ attention mechanisms in parallel. Each head $i$ has its own set of learnable projections $W_i^Q, W_i^K, W_i^V$. This allows the model to simultaneously attend to different aspects of the sequence—for example, one head focusing on syntactic structures while another focuses on semantic references. Each head produces an output $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$.

To integrate these parallel perspectives, the outputs of the $h$ heads are concatenated and transformed via a final linear projection matrix $W^O \\∈ \\ℝ^{hd_k \\× d}$. This step ensures that the output of the MHA block maintains the same dimensionality as the original input, enabling the use of residual connections. The complete MHA operation is defined as: $$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$. This architecture allows the model to capture complex, multi-faceted dependencies across the sequence.

Following the attention mechanism, the Transformer employs a Position-wise Feed-Forward Network (FFN). While attention manages the interaction between tokens, the FFN focuses on processing the information within each token independently. The 'position-wise' nature means the same linear transformation is applied to every position in the sequence. This acts as a local processing unit that transforms the representation into a higher-dimensional space to extract more complex features before projecting it back to the original dimension.

The FFN consists of two linear transformations with a non-linear activation function (typically ReLU or GELU) in between. Mathematically, for a vector $x$ at a specific position, the transformation is expressed as: $$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$. Here, $W_1$ expands the dimension from $d$ to $d_{ff}$ (often $4d$), and $W_2$ compresses it back to $d$. This expansion-contraction cycle allows the model to learn non-linear mappings of the feature space, providing the 'capacity' needed to store factual knowledge gathered during training.