Attention mechanisms trace their origins to the 2014 neural machine translation work of Bahdanau et al. and the 2017 âAttention Is All You Needâ paper by Vaswani et al. that introduced the Transformer architecture. Before transformers, sequence-to-sequence models relied on RNNs processing inputs sequentiallyâslow and unable to capture long-range dependencies. Attention replaced recurrence with parallelizable matrix operations: every query can simultaneously compare against all keys. This was revolutionary. The scaled dot-product attention variant (the one we implement here) became the standard because itâs efficient to compute (using optimized BLAS/CUDA kernels) and scales gracefully to long sequences (with positional encodings). The âscaledâ part ($1/\sqrt{d_k}$ factor) prevents softmax saturation when embedding dimensions grow large. Post-2017, attention became ubiquitous: BERT (2018) for bidirectional language understanding, GPT-2/3 (2019-2020) for autoregressive generation, Vision Transformers (2020) for image classification, and Stable Diffusion (2022) for text-to-image synthesis. Multi-head attention (running multiple independent attention operations in parallel) emerged as the dominant pattern. Recent research explores efficient variantsâsparse attention patterns (Longformer, BigBird), linear attention approximations (Performer, RWKV), and Flash Attention (optimized memory access patterns). The fundamental computation remains: compose matrix products to route information dynamically based on learned query-key similarities.
- Log in to post comments
Make shapes and transposes feel inevitableâso you can reason about forward/backward passes and attention without memorizing formulas. Attention mechanisms are the computational heart of transformers, the architecture behind BERT, GPT, Vision Transformers, and modern diffusion models. Understanding attention as a sequence of matrix productsâquery-key matching ($Q K^\top$), softmax normalization, and value aggregation ($A V$)âdemystifies what seems like magic: how models learn which parts of the input to focus on. This example demonstrates scaled dot-product attention on tiny matrices (2 queries, 3 key-value pairs) to make the operations concrete and verifiable. The key insight is that attention is interpretable linear algebra: queries select relevant keys via inner products, softmax converts scores into probability distributions (convex weights), and weighted sums of values produce context-aware outputs. Mastering this computation pattern equips you to implement attention from scratch, debug shape mismatches (the most common transformer bug), design efficient variants (sparse attention, linear attention), and understand why attention scales quadratically with sequence length. This is not merely a deep learning curiosityâattention is the canonical example of how matrix products compose to create expressive, trainable information routing.
Compute scores, softmax weights, and outputs for tiny attention; confirm row sums of weights are 1.
Attention computes scores $QK^T/\sqrt{d}$, applies row-wise softmax to get row-stochastic weights, then outputs $AV$.
We use:
- Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
- Vectors are column vectors by default.
- $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
import numpy as np
from scripts.toy_data import softmax
Q=np.array([[1.,0.],[0.,1.]])
K=np.array([[1.,0.],[1.,1.],[0.,1.]])
V=np.array([[1.,0.],[0.,2.],[1.,1.]])
scores=Q@K.T/np.sqrt(2)
A=softmax(scores, axis=1)
O=A@V
print("row sums:", A.sum(axis=1))
print("O:
", O)This code implements scaled dot-product attention, the core mechanism of transformer models. We define tiny query ($Q \in \mathbb{R}^{2 \times 2}$), key ($K \in \mathbb{R}^{3 \times 2}$), and value ($V \in \mathbb{R}^{3 \times 2}$) matrices to make the computation transparent and hand-verifiable. The algorithm follows three steps: (1) compute similarity scores via the matrix product $Q K^\top$ and scale by $1/\sqrt{d_k}$ to prevent gradient saturation, (2) apply row-wise softmax to convert scores into probability distributions (attention weights), and (3) aggregate value vectors via the weighted sum $A V$. Each output row is a convex combination of the value vectors, with weights determined by how strongly each query attends to each key. This patternâquery-key matching, softmax normalization, value aggregationâis the computational substrate of modern NLP (BERT, GPT), vision (ViT, CLIP), and multimodal models (DALL-E, Stable Diffusion). The code prints shapes at every step to enforce dimensional discipline (the most common source of transformer bugs) and displays the attention matrix $A$ to visualize which keys each query focuses on. Understanding this operation as composed matrix products, rather than a black-box âattentionâ function, is essential for implementing custom attention variants (sparse, linear, cross-attention), debugging shape mismatches, and reasoning about computational complexity ($O(n^2 d)$ for self-attention on sequences of length $n$). The output demonstrates that attention dynamically routes information based on contentâunlike fixed-weight layers, the effective transformation changes with the input, producing outputs that are convex combinations of the value vectors.
Numerical and Shape Notes
- Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
- Axis discipline: Be explicit with
axisin reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1. - Broadcasting: Check that broadcasts are intended (e.g.,
(n,1)with(n,d)). Prefer reshape/expand-dims to make semantics clear. - Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
- Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
- Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
- Sanity checks: Print shapes and residuals (e.g.,
||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.
Numerical and Implementation Notes
- Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use
solve,lstsq, Cholesky/QR/SVD. - Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (
axis) and ensure broadcasts are intended. - Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
- Conditioning: Inspect
np.linalg.cond(A)when solutions look unstable; regularize (ridge) or rescale features to improve conditioning. - Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g.,
||Ax-b||, reconstruction errors) and assert finiteness. - Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
- Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
- Sanity checks: Compare against references (e.g.,
lstsqvs.solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
Define query, key, value matrices: $Q \in \mathbb{R}^{n_q \times d_k}$ (here $2 \times 2$), $K \in \mathbb{R}^{n_k \times d_k}$ ($3 \times 2$), $V \in \mathbb{R}^{n_k \times d_v}$ ($3 \times 2$). In self-attention, $Q = K = V = XW$ for different projection matrices $W$. For this toy example, we hardcode small matrices to make computations verifiable by hand.
Compute raw attention scores: $S = Q K^\top \in \mathbb{R}^{n_q \times n_k}$. Each element $S_{ij} = \sum_{l=1}^{d_k} Q_{il} K_{jl}$ is the dot product between query $i$ and key $j$. High scores indicate strong alignment. In production code, use
scores = Q @ K.T.Scale by embedding dimension: $S_{\text{scaled}} = S / \sqrt{d_k}$. This prevents softmax saturation: as $d_k$ grows, dot products grow proportionally, pushing softmax into regimes where gradients vanish. The scaling factor stabilizes training. With $d_k = 2$ here, scale factor is $\sqrt{2} \approx 1.414$.
Apply row-wise softmax: For each query $i$, compute $A_{ij} = \exp(S_{ij}) / \sum_{j'} \exp(S_{ij'})$. This converts scores into probability distributions: $A_{i\cdot}$ sums to 1, entries are non-negative. Use
scipy.special.softmax(scores, axis=1)to handle numerical stability (subtract row-max before exponentiation).Aggregate values: Output $O = A V \in \mathbb{R}^{n_q \times d_v}$. Each output row $O_{i\cdot} = \sum_j A_{ij} V_{j\cdot}$ is a weighted combination of value vectors, where weights are attention probabilities. High-attention keys contribute more to the output.
Verify shapes throughout: Scores $\in \mathbb{R}^{2 \times 3}$, attention $\in \mathbb{R}^{2 \times 3}$, output $\in \mathbb{R}^{2 \times 2}$. Print shapes at each step to catch mismatches early. In multi-head attention, add head dimension: $Q,K,V \in \mathbb{R}^{B \times H \times N \times d}$ (batch, heads, sequence, dimension).
Inspect attention weights: Print the $A$ matrix to understand which keys each query focuses on. For example, if $A[0,2] = 0.8$, query 0 strongly attends to key 2. This interpretability is one of attentionâs key advantages over fully-connected layers, though âattention weights as explanationsâ remains debated.
Pedagogical Significance
- Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
- ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
- Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
- Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
- Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
- Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.
ML Examples and Patterns
- Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
- Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
- Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
- Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
- Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
- Query-key similarity as dot products: Attention scores are computed via $Q K^\top$, where each entry $(Q K^\top)_{ij} = q_i^\top k_j$ measures how aligned query $i$ is with key $j$. High dot product = high relevance. The scaling by $1/\sqrt{d_k}$ (embedding dimension) prevents gradients from vanishing when dimensions are largeâsoftmax becomes sharper without rescaling, making training unstable.
- Softmax creates convex weights: Row-wise softmax converts raw scores into probability distributions (non-negative, sum-to-one per query). This ensures outputs are convex combinations of value vectors, providing interpretable attention weights: âquery 1 attends 70% to key 2, 30% to key 3.â
- Value aggregation via matrix-vector products: Each output row is $\sum_j A_{ij} v_j$, a weighted sum of value vectors. Attention dynamically selects which values contribute most, unlike fixed-weight linear layers. This is why attention is called âcontent-based addressingââweights depend on input content.
- Composition of matrix products: Forward pass is $\text{Attention}(Q,K,V) = \text{softmax}(Q K^\top / \sqrt{d_k}) V$, composing two matrix multiplications. Backward pass decomposes via chain rule: $\nabla_Q = \nabla_{\text{out}} V^\top A^\top$, $\nabla_V = A^\top \nabla_{\text{out}}$. Understanding these products is essential for implementing custom attention variants.
- Shape discipline prevents bugs: Transformer codebases fail most often on shape mismatches (query dimension â key dimension, attention weights incompatible with values). Tracking shapesâ$Q \in \mathbb{R}^{n_q \times d_k}$, $K \in \mathbb{R}^{n_k \times d_k}$, $V \in \mathbb{R}^{n_k \times d_v}$, output $\in \mathbb{R}^{n_q \times d_v}$âmakes correctness mechanical.
- Interpretability of attention weights: The matrix $A$ after softmax is directly inspectable: visualizing $A_{ij}$ shows which keys each query attends to. This has spawned entire subfields (attention visualization, probing studies) analyzing what models learnâthough recent work questions whether high attention weights truly indicate âimportance.â
- Computational complexity: Attention is $O(n_q n_k d_k + n_q n_k d_v)$ for computing scores and aggregating values. For self-attention ($n_q = n_k = n$), this is $O(n^2 d)$, the quadratic bottleneck limiting transformers on long sequences (hence the proliferation of efficient attention variants).
Part 1: Query-Key Matching via Scaled Dot Products
The attention mechanism begins with computing pairwise similarities between queries and keys via the matrix product $Q K^\top \in \mathbb{R}^{n_q \times n_k}$. Each entry measures the dot product $q_i^\top k_j$âhigh values indicate strong alignment between query $i$ and key $j$. This is content-based addressing: the model dynamically decides which keys are relevant based on input content, unlike fixed indexing. The scaling by $1/\sqrt{d_k}$ prevents the softmax function from saturating when embedding dimensions grow large. Without scaling, dot products can become arbitrarily large (as dimension increases, random vectors have dot products with variance $\sim d_k$), pushing softmax into regions where gradients vanish. The $\sqrt{d_k}$ normalization keeps dot products at unit variance, ensuring stable gradients during training. This is why youâll see scores = (Q @ K.T) / np.sqrt(d_k) in every transformer implementation.
Part 2: Softmax Normalization Creates Convex Weights
Row-wise softmax converts raw scores into probability distributions: $A_{ij} = \exp(S_{ij}) / \sum_{j'} \exp(S_{ij'})$. Each row $A_{i\cdot}$ has non-negative entries summing to 1, making them valid convex weights. This is crucial: the final output is a convex combination of value vectors, ensuring outputs lie in the convex hull of $V$âs rows. Softmax is differentiable everywhere (exponential smoothness enables clean gradients), temperature-controllable (scaling scores by $1/T$ adjusts sharpness), and invariant to adding constants to scores (subtracting row-max prevents overflow). The probabilistic interpretation is powerful: âquery 1 attends 60% to key 0, 40% to key 2ââdirectly inspectable, unlike hidden activations in MLPs. However, recent work questions whether high attention weights truly indicate âimportanceâ (attention might satisfy constraints without being causal), spawning research into attention flow, integrated gradients, and counterfactual probing.
Part 3: Value Aggregation via Weighted Sum
The final output $O = A V$ performs a weighted sum: each output row $O_{i\cdot} = \sum_j A_{ij} V_{j\cdot}$. High-attention keys contribute more to the output. This is differentiable memory accessâthe model âreadsâ from a key-value store, with read weights determined by query-key similarity. Unlike hard attention (sampling one key) or linear layers (fixed weights), soft attention blends all values smoothly, enabling gradient flow to all parameters. In self-attention ($Q=K=V$ projections of the same input), this creates context-aware representations: each tokenâs output depends on all other tokens it attends to. For cross-attention (e.g., encoder-decoder), queries come from the decoder, keys/values from the encoder, enabling the decoder to âlook atâ encoder outputs. The value matrix $V$ doesnât need the same dimension as $Q,K$âoften $d_v < d_k$ for efficiency, though standard transformers use $d_v = d_k = d_{\text{model}}/H$ (model dimension divided by number of heads).
Part 4: Matrix Products as Composition of Linear Maps
Attention composes two linear maps: (1) $Q,K,V$ are typically obtained by projecting input embeddings $X$ through learned matrices $W_Q, W_K, W_V$, and (2) the attention output is projected again via $W_O$. The forward pass is:
\[\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_H) W_O\]
where each head computes $\text{Attention}(XW_Q^h, XW_K^h, XW_V^h)$. This is a cascade of matrix products: $X \to XW_Q \to (XW_Q)(XW_K)^\top \to \sigma(\cdot) \to A(XW_V) \to \text{Concat} \to (\cdots)W_O$. Understanding this chain is essential for backpropagation: gradients flow backward through each product via the chain rule, decomposing into Jacobian-vector products. For example, $\nabla_{W_Q} \mathcal{L} = X^\top (\nabla_Q \mathcal{L})$ where $\nabla_Q = \nabla_A \cdot \partial A/\partial Q$ involves the softmax Jacobian. Efficient implementations fuse these operations to minimize memory transfers (Flash Attention achieves 2-4Ã speedups by reordering computations to fit in SRAM).
Why This Matters for ML
Attention is the computational engine of modern AI: transformers power GPT (autoregressive language models), BERT (bidirectional encoders), Vision Transformers (image classification), DALL-E/Stable Diffusion (text-to-image), AlphaFold2 (protein structure), Perceiver (general-purpose architectures), and recent audio/video models. The quadratic complexity $O(n^2 d)$ for self-attention on sequences of length $n$ is the main bottleneckâthis has spawned entire research directions: sparse attention (Longformer, BigBird with predefined patterns), linear attention (Performer, RWKV approximating softmax with kernels), and Flash Attention (memory-optimized exact attention). Multi-head attention (running $H$ independent attention operations in parallel, then concatenating) allows models to attend to different representation subspaces simultaneouslyâempirically crucial for performance. Attention weights are often visualized to interpret model behavior (e.g., does a pronoun attend to its antecedent?), though this interpretability is debated. Beyond NLP, attention applies anywhere dynamic, content-based routing is needed: graph neural networks (nodes attend to neighbors), reinforcement learning (attend to relevant state components), and slot attention (object-centric representations).
Numerical and Implementation Notes
Use scipy.special.softmax(scores, axis=1) for numerically stable softmax (subtracts row-max before exponentiating to prevent overflow). For large-scale attention, consider: (1) mixed-precision (FP16 for forward pass, FP32 for gradients), (2) gradient checkpointing (recompute activations during backward pass to save memory), (3) fused kernels (xFormers, Flash Attention optimize memory access patterns), and (4) causal masking for autoregressive models (set $S_{ij} = -\infty$ for $j > i$ to prevent attending to future tokens). In multi-head attention, reshape matrices to include head dimension: Q.shape = (batch, num_heads, seq_len, d_k), then compute attention independently per head. The output projection $W_O$ mixes information across heads. Dropout is typically applied to attention weights (zeroing out some $A_{ij}$ entries) to prevent overfitting, though this slightly violates the âsum-to-oneâ propertyâimplementations renormalize after dropout. For variable-length sequences, use padding masks (set attention weights to zero for padding tokens) to prevent attending to meaningless positions.
Shape Notes and Scaling Considerations
Standard dimensions: $Q,K \in \mathbb{R}^{n \times d_k}$, $V \in \mathbb{R}^{n \times d_v}$, scores $\in \mathbb{R}^{n \times n}$, output $\in \mathbb{R}^{n \times d_v}$. For multi-head attention with $H$ heads and model dimension $d_{\text{model}}$, typically $d_k = d_v = d_{\text{model}}/H$ (e.g., $d_{\text{model}}=512$, $H=8$, $d_k=64$). This keeps total parameter count constant while allowing diverse attention patterns. Batch dimensions prepend: (batch_size, seq_len, d_model). Memory usage is dominated by storing the attention matrix $A \in \mathbb{R}^{n \times n}$ per batch per headâfor a 2048-token sequence with 32 heads and batch size 8, this requires $8 \times 32 \times 2048 \times 2048 \times 4 \text{ bytes} \approx 4.3 \text{GB}$ just for attention weights (in FP32), not including activations/gradients. Flash Attention avoids materializing the full $A$ matrix in HBM (high-bandwidth memory), instead computing attention in tiles that fit in SRAM (on-chip cache), reducing memory from $O(n^2)$ to $O(n)$ while maintaining exact computation. For extremely long sequences ($n > 10^4$), sparse or linear attention variants replace dense $O(n^2)$ with $O(n \log n)$ or $O(n)$ complexity by restricting which positions can attend to each other (local windows, fixed patterns, learned sparsity, or kernel approximations). - Interpretation: relate algebraic steps to geometry (subspaces, projections) and to ML behavior (generalization, stability).
Attention mechanisms emerged from the limitations of fixed-length encoder representations in sequence-to-sequence models. Bahdanau et al. (2015) introduced additive attention for neural machine translation, allowing decoders to dynamically focus on different encoder states at each time stepâa breakthrough that enabled translating long sentences. The dot-product attention variant (simpler and faster due to optimized matrix multiplication) became standard with Luong et al. (2015). The true revolution came with Vaswani et al.âs âAttention Is All You Needâ (2017), which discarded recurrence entirely: the Transformer architecture used only attention and feedforward layers, achieving state-of-the-art translation quality while being fully parallelizable. The key innovations were multi-head attention (running multiple attention operations in parallel to capture diverse patterns), positional encodings (sinusoidal functions or learned embeddings to inject sequence order), and scaled dot-product attention with the $1/\sqrt{d_k}$ normalization for training stability.
Post-2017, transformers proliferated across domains. BERT (Devlin et al., 2018) introduced bidirectional pretraining via masked language modeling, revolutionizing NLP transfer learning. GPT-2/3 (Radford et al., 2019; Brown et al., 2020) demonstrated that scaling autoregressive transformers (up to 175B parameters for GPT-3) unlocked few-shot and zero-shot learning capabilitiesâlanguage models could solve tasks from prompts alone, without fine-tuning. Vision Transformers (ViT) (Dosovitskiy et al., 2020) applied transformers to image patches, matching or exceeding CNN performance on ImageNet with sufficient pretraining data. Transformers now dominate: CLIP (contrastive text-image pretraining), DALL-E/Stable Diffusion (text-to-image generation via cross-attention between text and image latents), AlphaFold2 (protein structure prediction with geometric attention), Perceiver (general-purpose architecture for arbitrary modalities), and Whisper (speech recognition).
The quadratic complexity of attentionâ$O(n^2 d)$ for sequence length $n$âmotivated extensive research into efficient variants. Sparse attention patterns (Longformer, BigBird, Sparse Transformers) restrict which positions attend to each other (local windows, global tokens, random connections), reducing complexity to $O(n \log n)$ or $O(n \sqrt{n})$. Linear attention methods (Performer, RWKV, RetNet) approximate softmax with kernel features or recurrent formulations, achieving $O(n d^2)$ complexity but often with quality degradation. Flash Attention (Dao et al., 2022) maintains exact attention while optimizing GPU memory hierarchyâreordering operations to minimize HBM (high-bandwidth memory) access, achieving 2-4Ã speedups and enabling training on sequences 4Ã longer. Modern large language models (LLaMA, GPT-4, PaLM) use multi-query or grouped-query attention (sharing keys/values across multiple heads) to reduce KV cache memory during inference.
Beyond standard attention, specialized variants address specific needs: cross-attention (queries from one sequence, keys/values from another) enables encoder-decoder models and image-conditioning in diffusion models; causal attention (masking future positions) ensures autoregressive generation; rotary positional embeddings (RoPE) encode relative positions directly into query-key products; alibi (attention with linear biases) avoids learned positional encodings; sparse mixture-of-experts (routing tokens to specialized subnetworks) scales model capacity without proportional compute. Attentionâs interpretability remains debatedâvisualizing attention weights spawned entire research directions (BERTology, attention flow, probing tasks), but recent work shows attention patterns can be misleading (models satisfy task constraints without attention weights indicating true causal importance). Nonetheless, attentionâs success is empirical: transformers achieve state-of-the-art performance across language, vision, speech, biology, and multimodal domains, making attention the dominant computational pattern in modern AI.
- Linear Maps (Chapter 4): Attention is a data-dependent linear map: $Q \mapsto AV$ where $A = \sigma(QK^\top)$ is computed from input. Unlike standard matrices, attention weights are dynamically generated each forward pass, enabling content-based routing.
- Inner Products (Chapter 5): Query-key matching uses dot products as similarity measures. High $\langle q_i, k_j \rangle$ indicates semantic alignment. This connects to kernel methodsâattention can be viewed as a soft kernel density estimator over key-value memories.
- Orthogonality (Chapter 6): Multi-head attention projects $Q,K,V$ through independent weight matrices. Orthogonalizing these projections (via SVD or Gram-Schmidt) can reduce redundancy across heads, though standard implementations donât enforce this.
- SVD (Chapter 10): Low-rank approximations of attention matrices (factorizing $A \approx U \Sigma V^\top$ with truncated singular values) enable efficient attention for long sequences. Linformer and other linear attention variants exploit this structure.
- Least Squares (Chapter 12): Training attention parameters ($W_Q, W_K, W_V$) minimizes reconstruction loss via gradient descent, effectively solving a sequence of least-squares problems in weight space. The softmax-attention mechanism is differentiable everywhere (thanks to exponential smoothness), enabling clean backpropagation.
- Conditioning (Chapter 14): Ill-conditioned attention (e.g., all scores nearly equal) produces uniform weights, losing selectivity. Monitoring entropy of attention distributions ($H(A_{i\cdot}) = -\sum_j A_{ij} \log A_{ij}$) diagnoses this: high entropy = diffuse attention, low entropy = focused attention.
- Matrix Products (Chapter 16): Attention chains two matrix multiplications ($QK^\top$, then $AV$), demonstrating how non-commutative composition creates expressive transformations. Fused kernels (Flash Attention) exploit this structure to minimize memory transfers between GPU memory hierarchies.
Comments