The core intuition behind the Attention mechanism is the ability to dynamically assign weight to different parts of an input sequence based on their relevance to a specific token. Rather than treating a sequence as a static vector, attention allows the model to 'look' at other tokens to derive context. Mathematically, this is framed as a retrieval system where a 'Query' seeks information from a set of 'Keys', and the resulting alignment determines how much of the corresponding 'Value' is extracted.
To formalize this, we start with Scaled Dot-Product Attention. Given an input matrix $X \\∈ \\ℝ^{n \\× d}$, we project it into three distinct spaces using learned weight matrices $W^Q, W^K, W^V \\∈ \\ℝ^{d \\× d_k}$. This yields the Query, Key, and Value matrices: $Q = XW^Q, K = XW^K,$ and $V = XW^V$. The attention weights are computed via the softmax of the scaled dot product between queries and keys: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ The scaling factor $\sqrt{d_k}$ is critical to prevent the dot products from growing too large in magnitude, which would push the softmax function into regions with vanishing gradients.
Multi-Head Attention (MHA) extends this concept by allowing the model to jointly attend to information from different representation subspaces. Instead of one large attention operation, we perform $h$ parallel attention operations (heads). Each head $i$ has its own set of projections $W_i^Q, W_i^K, W_i^V$. The output of the MHA layer is the concatenation of these heads, projected back into the original dimension $d$ via a final learnable matrix $W^O$: $$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$ where each $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$.
While MHA handles the global interaction between tokens, it lacks a mechanism for non-linear transformation of the individual token representations. This is where the Position-wise Feed-Forward Network (FFN) comes into play. The FFN is applied to each position identically and independently. It consists of two linear transformations with a non-linear activation—typically ReLU or GeLU—sandwiched between them: $$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$ This structure effectively projects the token into a higher-dimensional space (often $4 \\× d$) to extract complex features and then projects it back to the model dimension.
The synergy between MHA and FFNs is fundamental to the Transformer's power. MHA acts as a 'communication' layer, where tokens exchange information across the sequence, while the FFN acts as a 'processing' layer, refining the representation of each token based on the information it gathered during the attention phase. Together, they form a universal function approximator capable of capturing both local textual patterns and long-range semantic dependencies.
Finally, to ensure numerical stability and facilitate the flow of gradients through deep architectures, these components are wrapped in residual connections and Layer Normalization. Given a layer function $f(x)$, the output is transformed as $\text{LayerNorm}(x + f(x))$. This prevents the vanishing gradient problem and allows the model to learn identity mappings, ensuring that the original input signal is preserved while the network learns the necessary incremental refinements.