Example number
5
Example slug
example_5_dot_products_norms_and_cosine_similarity_retrieval
Background

In information retrieval and embedding-based search, document and query vectors live in a vector space where similarity drives ranking. The dot product correlates with both direction and magnitude; cosine similarity rescales vectors to unit length so only the angle matters.

Dot is natural for models where magnitude encodes confidence or frequency (e.g., unnormalized logits, TF-counts), while cosine is standard when representations are arbitrarily scaled (e.g., normalized embeddings, unit-norm features) or when we want length invariance. This example connects these choices to linear classifiers, attention mechanisms, and nearest-neighbor search.

Purpose

Train a reliable, ML-first intuition for similarity and scaling:

  • Understand when dot product (alignment × magnitude) is desired.
  • Understand when cosine similarity (pure angle) is preferred.
  • See how L2-normalization changes rankings and scores.
  • Maintain shape discipline: $X\in\mathbb{R}^{n\times d}$, $q,d\in\mathbb{R}^{d}$.
Problem

Rank document vectors by dot product and cosine similarity against a query and compare rankings.

Solution (Math)

Dot product $q^\top d$ depends on both alignment and magnitude. Cosine similarity

\[ \frac{\langle q,d\rangle}{\|q\|\,\|d\|}=\frac{q^\top d}{\|q\|\,\|d\|} \]

depends only on angle. Thus rankings can differ when norms vary.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^\top y$.
Solution (Python)

import numpy as np

q = np.array([1., 0.5])
docs = np.array([[10.,0.],[0.5,0.5],[-1.,0.]])

dot = docs@q
cos = dot/(np.linalg.norm(docs,axis=1)*np.linalg.norm(q))

print("dot:", dot, "rank:", np.argsort(-dot))
print("cos:", cos, "rank:", np.argsort(-cos))
Code Introduction

We create a 2D query q and three document vectors docs. We compute dot scores via docs @ q, cosine scores by dividing by the product of vector norms, and then rank by descending score using np.argsort(-scores). Shapes: docs is $(3\times 2)$, q is $(2)$, scores are $(3)$.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Compute dot scores in batch as $s= Xq$ with $X\in\mathbb{R}^{n\times d}$ and $q\in\mathbb{R}^d$.
  • Compute row norms once: $r=\|X\|_{2,\text{row}}\in\mathbb{R}^n$ via np.linalg.norm(X, axis=1).
  • Avoid division by zero: add a small $\varepsilon$ to $r$ or prefilter zero rows.
  • Cosine scores: $c= s/(\|q\|\, r)$ using broadcasting; or normalize rows $\hat{X}=X/\|X\|_{2,\text{row}}$ and compute $\hat{X}q$.
  • Shapes: verify $X:(n\times d)$, $q:(d)$, $s,c:(n)$.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Dot vs cosine can yield different rankings when norms vary.
  • L2-normalizing rows or the query converts dot into cosine.
  • Efficient batching: compute dot with $Xq$; cosine with row-normalized $\hat{X}$.
  • Practical guidance: choose cosine for scale invariance; choose dot when magnitude carries meaning.
Notes

Part 1: Core setup - Dot products, norms, and cosine similarity (retrieval)

State the objects, shapes, and target question for Dot products, norms, and cosine similarity (retrieval). Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - Dot products, norms, and cosine similarity (retrieval)

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make Dot products, norms, and cosine similarity (retrieval) work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - Dot products, norms, and cosine similarity (retrieval)

Give the computational recipe for Dot products, norms, and cosine similarity (retrieval), note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shape discipline: $X\in\mathbb{R}^{n\times d}$, $q\in\mathbb{R}^d$, outputs in $\mathbb{R}^n$.
  • Numerical note: prefer stable primitives and avoid explicit inverses; guard against zero norms.
  • Interpretation: use cosine when you want scale invariance; use dot when magnitude encodes importance.
History and Applications

Inner product and norms trace to Hilbert’s work on infinite-dimensional function spaces (early 20th century). The abstract notion of a “metric space” emerged from generalizing the Euclidean dot product and distance.

Cosine similarity in information retrieval: While dot products measure signed alignment, cosine similarity (normalized dot product) gained prominence in IR and text search (Salton & McGill, 1983; Baeza-Yates & Ribeiro-Neto, 1999) because it is scale-invariant. Two documents describing the same topic in different scales get the same score. This became the foundation for vector space models in search and recommendation systems.

Modern embeddings and inner products: With word2vec (Mikolov et al., 2013), fastText, and BERT, inner products became the default similarity metric for learned embeddings. Modern large-scale retrieval (e.g., dense passage retrieval, approximate nearest neighbors) relies on computing millions of dot products per query. The dot product’s computational simplicity (one matrix-vector multiply) and its natural alignment with neural network training (gradients of dot products are simple) make it the industry standard, even though cosine similarity often yields more interpretable scores.

Connection to Broader Examples
  • Retrieval: cosine is the default for unit-normalized embeddings; dot is used when magnitude (e.g., frequency or confidence) should affect ranking.
  • Linear classifiers: logits are dot products $w^\top x$; length scaling changes margins unless inputs are normalized.
  • Attention: scores are dot products; scaling/normalization (e.g., softmax temperature, vector norms) affects which keys dominate.
  • Geometry: cosine measures angle (projection onto the unit sphere), dot measures projection length along $q$.

Comments