The mathematical foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

At its core, the Attention mechanism is designed to allow a model to focus on different parts of an input sequence dynamically. Instead of treating a sequence as a fixed vector, we treat it as a set of 'conceptual lookups.' Imagine a library: you have a query (what you are looking for), keys (the labels on the spines of books), and values (the actual content inside those books). The goal is to calculate a weighted sum of values based on how well your query matches the available keys. This allows the model to capture long-range dependencies regardless of their distance in the sequence, overcoming the 'vanishing gradient' issues found in recurrent architectures.

Mathematically, we begin with a Scaled Dot-Product Attention. Given an input matrix $X \\∈ \\ℝ^{n \\× d}$, we project it into three distinct spaces using weight matrices $W^Q, W^K, W^V \\∈ \\ℝ^{d \\× d_k}$. This yields the Query, Key, and Value matrices: $Q = XW^Q, K = XW^K, V = XW^V$. The attention scores are computed via the dot product of $Q$ and $K^T$, scaled by the square root of the dimension $d_k$ to prevent gradients from vanishing during the softmax operation. The formulation is expressed as: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Multi-Head Attention (MHA) extends this concept by performing the attention process multiple times in parallel. The intuition is that a single attention head can only focus on one type of relationship (e.g., syntactic dependency). By using $h$ different heads, the model can simultaneously attend to different representation subspaces—one head might track subject-verb agreement while another tracks coreference. Each head $i$ has its own set of learnable projections $W_i^Q, W_i^K, W_i^V$. The output of these heads is concatenated and then projected back to the original model dimension using a final weight matrix $W^O$: $$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$, where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$.

Following the attention mechanism, the Transformer employs a Position-wise Feed-Forward Network (FFN). While Attention allows tokens to 'communicate' with each other, the FFN allows the model to 'process' the information extracted from that communication. It is 'position-wise' because the same linear transformation is applied to each token independently across the sequence dimension. This effectively acts as a local feature extractor that transforms the representation of each token based on the context it gathered during the attention phase.

The FFN consists of two linear transformations with a non-linear activation function (typically ReLU or GeLU) in between. If $x$ is the input vector for a single position, the transformation is defined as: $$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$. Here, $W_1$ typically projects the dimension from $d_{model}$ to a larger hidden dimension $d_{ff}$ (often $4 \\× d_{model}$), and $W_2$ projects it back. This expansion-contraction architecture allows the model to project the data into a higher-dimensional space to find more complex patterns before compressing it back.

Integrating these components creates a powerful duality: MHA handles the global spatial relationships (the 'where' and 'what' of the sequence), while the FFN handles the point-wise semantic transformation (the 'meaning' of the resulting vector). Together, these operations ensure that the network is both permutation-invariant and capable of high-capacity representation. This mathematical structure allows the Transformer to be highly parallelizable, moving away from the sequential bottleneck of RNNs and enabling the scale of modern Large Language Models.