Example number
21
Example slug
example_21_dot_products_norms_and_cosine_similarity_retrieval
Background

Dot products, norms, and angles are the backbone of geometric reasoning in linear algebra. For vectors $x, q \in \mathbb{R}^d$, the dot product $\langle x, q\rangle = x^\top q$ quantifies alignment and scales with the lengths of both vectors. The Euclidean norm $\|x\|_2 = \sqrt{\sum_i x_i^2}$ measures length, and the angle $\theta$ satisfies $\cos\theta = \frac{x^\top q}{\|x\|\,\|q\|}$ by the Cauchy–Schwarz inequality. Cosine similarity therefore isolates direction: two vectors that point the same way (up to scale) have high cosine, even if one is very small.

In information retrieval, the vector space model ranks documents by similarity to a query, commonly using cosine because it discounts document length effects and term frequency scaling. In modern ML, embedding spaces for text, images, and users often use cosine to measure semantic closeness after normalization; dot product is preferred when magnitude carries signal (e.g., unnormalized counts or learned scaling in recommendation and attention). Understanding the trade-offs lets you design scoring functions that align with your data and training objectives.

Purpose

Build a crisp, ML-first mental model for similarity scoring that you can reuse across retrieval, recommendation, and embedding-based search. This example contrasts two ubiquitous choices: raw dot product (inner product) versus cosine similarity. The dot product mixes alignment with magnitude and is appropriate when vector length encodes importance (counts, confidence, scale). Cosine similarity removes magnitude to measure pure directional agreement, making it ideal when only orientation in feature space should matter (normalized embeddings).

You will learn when these scores agree, when they diverge, and how simple normalization flips rankings. The goal is to make you fluent in choosing the right similarity for your data and model, with clear shape discipline and numerically stable implementations.

Problem

Rank document vectors by dot product and cosine similarity against a query and compare rankings.

Solution (Math)

Dot product $q^Td$ depends on both alignment and magnitude. Cosine similarity

\[ \frac{q^Td}{\|q\|\|d\|} \]

depends only on angle. Thus rankings can differ when norms vary.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np

q = np.array([1., 0.5])
docs = np.array([[10.,0.],[0.5,0.5],[-1.,0.]])

dot = docs@q
cos = dot/(np.linalg.norm(docs,axis=1)*np.linalg.norm(q))

print("dot:", dot, "rank:", np.argsort(-dot))
print("cos:", cos, "rank:", np.argsort(-cos))
Code Introduction

This snippet contrasts two ways to score how well each document vector aligns with a query: raw dot product versus cosine similarity. With $q \in \mathbb{R}^2$ and $docs \in \mathbb{R}^{3\times 2}$, the line dot = docs @ q computes per-row inner products $x^\top q$, yielding $[10.0,\;0.75,\;-1.0]$. Sorting in descending order via np.argsort(-dot) ranks the documents as [0, 1, 2], favoring the first document because its large magnitude along $q$ dominates.

Cosine similarity normalizes away vector lengths to measure pure directional agreement: \[ \cos(\theta) \,=\, \frac{x^\top q}{\|x\|\,\|q\|}. \] Here $\|q\|=\sqrt{1.25}\approx 1.118$, and the document norms are $[10,\;\sqrt{0.5}\approx 0.707,\;1]$, giving cosine scores $\approx [0.894,\;0.949,\;-0.894]$. Ranking by np.argsort(-cos) yields [1, 0, 2], so the second document is most parallel to $q$ despite its small length. Interpretation: dot product answers “how much signal along $q$, scaled by magnitude,” while cosine answers “how parallel to $q$,” independent of scale. Gotchas: guard against zero norms to avoid division by zero (add a small $\epsilon$ or prefilter), and remember argsort is ascending—negate to sort by descending scores.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Compute dot-product scores efficiently via a matrix–vector multiply: dot = docs @ q where docs.shape == (n, d) and q.shape == (d,).
  • Compute per-document norms with np.linalg.norm(docs, axis=1) and the query norm with np.linalg.norm(q); these are nonnegative scalars/vectors.
  • Form cosine scores via safe normalization: cos = dot / (doc_norms * q_norm). Add a small eps or prefilter zero-norm rows to avoid division by zero.
  • Rank by descending score using np.argsort(-scores); confirm ranks differ between dot and cosine when magnitudes vary.
  • Validate numerics: check that np.allclose(cos, dot / (doc_norms * q_norm)) and that doc_norms >= 0. Track shapes explicitly.
  • For large batches, compute in chunks or with BLAS-backed operations to maintain throughput; keep everything in float32/float64 consistently.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Dot vs cosine: dot product couples alignment and magnitude; cosine isolates direction via normalization.
  • Rankings can differ: normalization can re-order documents even when dot-product ordering is clear.
  • Shape and broadcasting discipline: docs @ q yields per-row scores; norms computed with axis=1 broadcast cleanly.
  • Practical ranking: np.argsort sorts ascending; negate scores for descending. Guard against zero norms.
  • Model choice guidance: use dot when length encodes importance; use cosine when only orientation should matter.
Notes

Part 1: Forward Pass — Computing the Linear Map. Scoring by dot product is a forward pass of a linear map: for docs ∈ ℝ^{n×d} and q ∈ ℝ^d, docs @ q produces an n-vector of per-document signals $x_i^\top q$. This mixes alignment with magnitude, which is desirable when lengths reflect importance (e.g., term counts, confidence). Shape discipline matters: ensure docs rows correspond to examples and q is aligned with columns.

Part 2: Normalization and Cosine Similarity. Cosine similarity divides out the magnitudes, $\cos\theta = \frac{x^\top q}{\|x\|\,\|q\|}$, focusing strictly on directional alignment. This can flip rankings compared to dot product when a small, well-aligned vector beats a large but poorly aligned one. Numerically, guard against zero vectors by adding a small $\varepsilon$ to denominators or prefiltering.

Part 3: Ranking, Sorting, and Evaluation. Practical ranking uses np.argsort which sorts ascending; negate scores to obtain descending order. Verify ties and near-ties carefully, especially under normalization. Consider downstream metrics (precision@k, recall@k, NDCG) to evaluate whether dot or cosine better matches task semantics.

Part 4: Practical Gotchas and Debugging. Watch for unit/scale mismatches across features; normalization (per-row or per-column) changes the geometry of similarity. Track shapes and dtypes, and add assertions for norms and non-NaN scores. In larger systems, start with small, interpretable examples like this and compare dot vs cosine ranks as a primary debugging tool for neural network implementations.

History and Applications

The geometric relationship between dot product, norm, and angle traces back to classical results like the Cauchy–Schwarz inequality. In information retrieval, the vector space model popularized cosine similarity for ranking documents because it mitigates raw length effects and emphasizes directional alignment in term-weight space (e.g., tf–idf). As datasets grew, cosine and inner-product scoring became staples of large-scale search and recommendation, often paired with approximate nearest neighbor indices.

In modern ML, embedding-based retrieval uses cosine to compare normalized representations of sentences, images, users, and items; dot product remains central when magnitude is informative or when scores feed into softmax layers (as in attention mechanisms). Word embeddings (e.g., GloVe, word2vec) and sentence embeddings (e.g., Sentence-BERT) routinely evaluate semantic similarity via cosine. Choosing between dot and cosine—and understanding how normalization changes rankings—remains a practical, high-impact decision across search, recommendation, and representation learning.

Connection to Broader Examples
  • Chapter 5 (inner products and norms): this is a direct application of $\langle x, q\rangle$ and $\|\cdot\|_2$ to retrieval.
  • Chapter 11 (PCA): angles and cosine relate to principal directions; normalization interacts with variance alignment.
  • Chapter 16 (attention): attention scores start from dot products; softmax then normalizes to convex weights, akin to cosine’s orientation focus.
  • Chapter 19 (embeddings): cosine is a standard measure of embedding similarity; dot is used when magnitude encodes confidence.
  • Chapter 14 (conditioning): normalization can improve stability and comparability by reducing scale-induced sensitivity.

Comments