The core intuition behind the Transformer architecture is the ability to dynamically weigh the importance of different parts of an input sequence. Unlike Recurrent Neural Networks (RNNs) that process data sequentially, the Attention mechanism allows every token to 'attend' to every other token simultaneously. Mathematically, this is treated as a retrieval problem: we have a 'Query' representing what we are looking for, a 'Key' representing what each token offers, and a 'Value' representing the actual content to be extracted. By computing the similarity between queries and keys, the model generates a weighted sum of values.
The fundamental unit of this process is Scaled Dot-Product Attention. Given an input matrix $X ∈ ℝ^{n × d}$, we project it into three distinct spaces using weight matrices $W^Q, W^K, W^V ∈ ℝ^{d × d_k}$. These result in matrices $Q = XW^Q$, $K = XW^K$, and $V = XW^V$. The attention weights are computed via the dot product of queries and keys, scaled by the square root of the dimension $d_k$ to prevent gradient vanishing during the softmax operation: $$ ext{Attention}(Q, K, V) = ext{softmax}\\left(rac{QK^T}{\\sqrt{d_k}} ight)V$$
Multi-Head Attention (MHA) extends this by allowing the model to jointly attend to information from different representation subspaces. A single attention head might focus on syntactic relationships, while another focuses on semantic consistency. We repeat the attention process $h$ times with different learned linear projections. The outputs of these $h$ heads are concatenated and projected back to the original dimension: $$ ext{MultiHead}(Q, K, V) = ext{Concat}( ext{head}_1, \\dots, ext{head}_h)W^O$$ where each $ ext{head}_i = ext{Attention}(QW_i^Q, KW_i^K, VW_i^V)$.
While attention captures the relationships between tokens, it is essentially a weighted average of linear projections. To introduce the capacity for complex, non-linear feature transformations, the Transformer employs Position-wise Feed-Forward Networks (FFN). The 'position-wise' aspect means the same network is applied to each token independently (identically), which can be viewed as a $1 × 1$ convolution. This allows the model to process the information extracted by the attention mechanism and project it into a higher-dimensional space for better feature disentanglement.
Mathematically, the FFN consists of two linear transformations with a non-linear activation function—typically ReLU or GeLU—in between. The operation is defined as: $$ ext{FFN}(x) = \\max(0, xW_1 + b_1)W_2 + b_2$$ Here, $W_1 ∈ ℝ^{d_{model} × d_{ff}}$ projects the vector into a larger latent space (often $d_{ff} = 4 × d_{model}$), and $W_2 ∈ ℝ^{d_{ff} × d_{model}}$ projects it back to the model dimension. This 'expansion-contraction' architecture is critical for increasing the model's expressive power.
To ensure stability and facilitate the flow of gradients in deep architectures, both MHA and FFN are wrapped in residual connections followed by Layer Normalization. The output of a layer is formulated as $ ext{LayerNorm}(x + ext{Sublayer}(x))$. This ensures that the original signal is preserved and prevents the internal covariate shift. Together, the global context provided by Multi-Head Attention and the local non-linear processing of FFNs form the mathematical engine that allows Transformers to achieve state-of-the-art performance across diverse NLP tasks.