At its core, the Attention mechanism is designed to solve the problem of 'contextual relevance.' In a sequence of tokens, a single word can have multiple meanings depending on the surrounding words. The intuition is to treat the input as a database: we have a 'Query' (what I am looking for), a 'Key' (what this token offers), and a 'Value' (the actual information). By calculating the similarity between the Query and the Key, the model can decide how much 'attention' to pay to the Value of a specific token, allowing it to dynamically weight the importance of different parts of the input sequence.
Mathematically, we start with Scaled Dot-Product Attention. Given an input matrix $X$, we project it into three subspaces using learned weight matrices $W^Q, W^K, \text{and } W^V$. The attention score is computed by the dot product of queries $Q$ and keys $K^T$, scaled by the square root of the head dimension $d_k$ to prevent gradients from vanishing during the softmax operation. The formula is expressed as: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ Here, the softmax ensures the weights sum to 1, creating a probability distribution over the sequence.
While a single attention head is powerful, it can only focus on one type of relationship at a time. Multi-Head Attention (MHA) allows the model to jointly attend to information from different representation subspaces. For instance, one head might focus on syntactic dependencies (verb-object), while another focuses on semantic references (pronoun-noun). We perform $h$ separate attention operations in parallel, each with its own set of projection weights, and then concatenate the results: $$ \text{MultiHead}(Q, K, V) = \text{Concat}( ext{head}_1, \dots, \text{head}_h)W^O $$ where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$ and $W^O$ is the final linear projection.
After the MHA layer has aggregated contextual information, the model requires a mechanism to process this information individually for each position. This is the role of the Position-wise Feed-Forward Network (FFN). While MHA acts as a 'communication' layer (mixing information across tokens), the FFN acts as a 'computation' layer (processing information within each token). It consists of two linear transformations separated by a non-linear activation function, typically the ReLU or GELU, applied to each position independently.
The mathematical structure of the FFN is characterized by an expansion and subsequent contraction of the dimensionality. Let $d_{\text{model}}$ be the embedding dimension and $d_{\text{ff}}$ be the inner layer dimension (usually $d_{\text{ff}} = 4 \\× d_{\text{model}}$). The operation is defined as: $$ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 $$ By projecting the data into a higher-dimensional space, the network can learn complex non-linear features before projecting it back to the original dimension for the next Transformer block.
In summary, the synergy between MHA and FFNs allows the Transformer to handle both global context and local feature extraction. MHA uses the dot-product mechanism to navigate the global structure of the sequence, while the FFN uses universal approximation properties to refine the representation of each individual token. Together, they transform a static embedding into a rich, context-aware vector, forming the bedrock of modern Large Language Models.