The Mathematical Foundation of Multi-Head Attention and Position-wise Feed-Forward Networks

At its core, the Attention mechanism is designed to solve the 'bottleneck' problem of sequential processing by allowing a model to focus on different parts of an input sequence regardless of their distance. The intuitive goal is to create a dynamic weighting system where each token in a sequence 'queries' all other tokens to determine which information is most relevant for its own representation. Mathematically, this is framed as a retrieval process from a key-value store, where a Query vector searches for matching Keys to extract associated Values.

To implement this, we define three weight matrices $W^Q, W^K, W^V \\∈ \\ℝ^{d_{model} \\× d_k}$. Given an input sequence $X \\∈ \\ℝ^{n \\× d_{model}}$, we compute the Query, Key, and Value matrices as $Q = XW^Q, K = XW^K$, and $V = XW^V$. The core operation is the Scaled Dot-Product Attention, formulated as: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ Here, $QK^T$ computes the pairwise similarity between all tokens. We scale by $\sqrt{d_k}$ to prevent the dot products from growing too large in magnitude, which would push the softmax function into regions with vanishing gradients.

While a single attention head is powerful, it is limited to one 'aspect' of the relationship between tokens. Multi-Head Attention (MHA) allows the model to jointly attend to information from different representation subspaces. We employ $h$ parallel heads, each with its own sets of weight matrices. The output of the $i$-th head is denoted as $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$. These heads are then concatenated and projected back into the original dimensionality using a final linear transformation $W^O$: $$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$

After the attention mechanism aggregates global context, the model requires a way to process this information locally and introduce non-linearity. This is the role of the Position-wise Feed-Forward Network (FFN). Unlike the attention layer, which allows tokens to interact, the FFN is applied to each position independently and identically. This ensures that the transformation is translation-invariant across the sequence length $n$, acting as a per-token feature extractor that refines the representations learned during the attention phase.

The architectural structure of the FFN consists of two linear transformations with a non-linear activation function (typically ReLU or GeLU) in between. Mathematically, for a vector $x$ at a specific position, the FFN is defined as: $$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$ In the standard Transformer, the inner layer dimensionality $d_{ff}$ is typically much larger than $d_{model}$ (e.g., $2048$ vs $512$). This expansion and subsequent contraction allows the network to project the data into a higher-dimensional space to uncover complex features before projecting them back.

Ultimately, the synergy between MHA and FFNs creates a powerful duality: MHA handles the 'spatial' or 'relational' synthesis, while the FFN handles the 'channel-wise' or 'semantic' refinement. Together, they form the basic building block of the Transformer. By stacking these layers, the model can iteratively refine its understanding of a token, moving from simple syntactic associations in lower layers to abstract semantic concepts in higher layers, all while maintaining a constant computational complexity per layer relative to sequence length.