Example number
53
Example slug
example_53_similarity_geometry_dot_vs_cosine_in_embedding_retrieval
Background

Historical context: The dot product (inner product) is a fundamental operation in linear algebra, formalizing “projection” and “work” in physics. Cosine similarity emerged in information retrieval (Salton’s vector space model, 1970s) to rank documents by relevance while normalizing for document length—a critical correction when comparing short abstracts to long articles. The geometric interpretation (cosine of the angle between vectors) made it intuitive: vectors pointing in the same direction are similar, regardless of magnitude. Modern ML inherited this: TF-IDF, Word2Vec, BERT embeddings, and collaborative filtering all default to cosine similarity because embeddings have heterogeneous norms (trained weights, vocabulary frequency, user activity levels).

Mathematical characterization: The dot product $\langle d, q \rangle = \sum_i d_i q_i$ is bilinear (scales linearly with each argument). The Cauchy-Schwarz inequality bounds it: $|\langle d, q \rangle| \le \|d\| \|q\|$, with equality when vectors are parallel. Cosine similarity normalizes to $[-1, 1]$: \[ \cos(\theta) = \frac{\langle d, q \rangle}{\|d\|_2 \|q\|_2}, \] where $\theta$ is the angle between vectors. Geometrically, $\cos(\theta) = 1$ (parallel), $0$ (orthogonal), $-1$ (anti-parallel). Cosine is scale-invariant: $\cos(\alpha d, q) = \cos(d, q)$ for $\alpha > 0$ (rescaling doesn’t change angle).

Prevalence in ML: Search engines, recommenders, semantic similarity, clustering, and contrastive learning all use cosine similarity to avoid magnitude bias. Attention mechanisms in transformers use scaled dot products ($QK^\top / \sqrt{d_k}$), but some variants normalize queries/keys to unit norm (converting to cosine). Understanding when each metric is appropriate is essential for retrieval, ranking, and embedding-based applications.

Purpose

Show how dot product and cosine similarity produce different document rankings when vector magnitudes vary, clarifying when to use which metric in ML applications. Demonstrate that dot product $\langle d, q \rangle$ mixes magnitude and alignment (favoring high-norm vectors), while cosine similarity $\cos(\theta) = \langle d, q \rangle / (\|d\| \|q\|)$ isolates angular alignment (scale-invariant). Build intuition for why information retrieval, recommendation systems, and embedding-based search almost always use cosine similarity—raw dot products bias toward verbosity (long documents, popular items, high-frequency words) rather than topical relevance. Emphasize that understanding this distinction is critical for debugging ranking pathologies and choosing appropriate similarity metrics in production ML systems.

Problem

Rank three document embeddings against a query using dot-product similarity and cosine similarity. Explain (in math) why rankings can differ when magnitudes differ.

Solution (Math)

Dot similarity is $s_i=q^Td_i$, mixing magnitude and alignment. Cosine similarity normalizes magnitudes:

\[ \cos$\theta_i$=\frac{q^Td_i}{\|q\|_2\|d_i\|_2}. \]

When $\|d_i\|$ varies widely, dot-product rankings can favor large-norm vectors even if the angle is worse; cosine removes that effect.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np

q = np.array([1., 0.5])
docs = np.array([
    [10., 0.],
    [ 0.5, 0.5],
    [-1., 0.],
])

dot = docs @ q
cos = dot / (np.linalg.norm(docs,axis=1) * np.linalg.norm(q))

print("dot:", dot, "rank:", np.argsort(-dot))
print("cos:", cos, "rank:", np.argsort(-cos))
Code Introduction

This code demonstrates the distinction between dot product and cosine similarity for ranking documents by relevance to a query—a fundamental pattern in information retrieval and recommendation systems. It reveals why normalized similarity (cosine) often outperforms raw inner products when vectors have heterogeneous magnitudes.

Part 1: Query-Document Scoring via Dot Product. The code defines a query vector $q = [1, 0.5]^\top \in \mathbb{R}^2$ and three document vectors stored as rows of a matrix. The dot product scores are computed as $\text{dot} = \text{docs} @ q \in \mathbb{R}^3$ (matrix-vector product yielding one score per document). The dot product $\langle d_i, q \rangle = \sum_j d_{ij} q_j$ measures unnormalized alignment: it grows when vectors point in similar directions and have large magnitudes. Document 0 has $d_0 = [10, 0]$, yielding $\langle d_0, q \rangle = 10$—a high score driven by large magnitude, even though it ignores the second query dimension. Document 1 has $d_1 = [0.5, 0.5]$, yielding $\langle d_1, q \rangle = 0.75$—much smaller despite aligning with both query dimensions. Document 2 has $d_2 = [-1, 0]$, giving $\langle d_2, q \rangle = -1$ (anti-aligned). The ranking np.argsort(-dot) sorts by descending score.

Part 2: Cosine Similarity Normalization. The cosine similarity scores normalize by vector magnitudes: $\cos(\theta_i) = \langle d_i, q \rangle / (\|d_i\|_2 \|q\|_2) \in [-1, 1]$. The code computes this as cos = dot / (np.linalg.norm(docs, axis=1) * np.linalg.norm(q)), where axis=1 computes row-wise norms (one per document). Cosine similarity measures the angle between vectors, independent of magnitude: a cosine of 1 means parallel (0°), 0 means orthogonal (90°), -1 means anti-parallel (180°). Unlike dot products, cosine similarity treats $d$ and $10d$ identically—both have the same angle with $q$. Rankings can differ: dot product favors document 0 (large magnitude), while cosine may favor document 1 (better alignment with both query dimensions after normalization). The ranking np.argsort(-cos) sorts by descending cosine similarity.

Why This Matters for ML: Information retrieval uses cosine to avoid bias toward long documents (more words → larger term-frequency vectors). Recommendation systems use cosine to focus on rating patterns (correlation), not absolute counts. Word embeddings (Word2Vec, BERT) use cosine for semantic relatedness because embedding norms vary arbitrarily. Understanding this distinction is critical for search engines, recommender systems, and any ranking task over high-dimensional embeddings. Shape discipline: docs @ q is $(3, 2) \times (2,) \to (3,)$; np.linalg.norm(docs, axis=1) yields shape (3,) (one norm per document); element-wise division produces cosine scores (3,). Gotcha: axis=1 is essential for row-wise norms; omitting axis computes Frobenius norm (scalar, incorrect).

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Define query vector: $q = [1, 0.5]^\top \in \mathbb{R}^2$ represents a search query or user preference.
  2. Define document matrix: $\text{docs} \in \mathbb{R}^{3 \times 2}$ with 3 documents (rows), each 2-dimensional.
  3. Compute dot products: dot = docs @ q yields one score per document via matrix-vector product.
  4. Compute document norms: np.linalg.norm(docs, axis=1) computes $\|d_i\|_2$ for each row (document).
  5. Compute query norm: np.linalg.norm(q) yields scalar $\|q\|_2$.
  6. Normalize to cosine similarity: cos = dot / (norms_docs * norm_q) via element-wise division.
  7. Rank by dot product: np.argsort(-dot) sorts descending (negative for reverse order).
  8. Rank by cosine similarity: np.argsort(-cos) sorts descending; compare to dot product ranking.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Dot product mixes magnitude and direction: $\langle d, q \rangle$ grows with both alignment and vector norms.
  • Cosine similarity isolates angle: $\cos(\theta) = \langle d, q \rangle / (\|d\| \|q\|)$ measures directional alignment only.
  • Rankings can differ dramatically: High-magnitude vectors dominate dot product rankings even if poorly aligned.
  • Normalization corrects length bias: Cosine similarity treats short and long documents (or embeddings) comparably.
  • Geometric interpretation: Cosine measures angle; dot product measures projection magnitude.
  • Scale invariance: Cosine is unchanged by rescaling vectors; dot product is not.
  • Bounded range: Cosine lies in $[-1, 1]$; dot product is unbounded.
  • ML applications: Information retrieval, collaborative filtering, semantic search, clustering all prefer cosine over dot product.
Notes

Part 1: Query-Document Scoring via Dot Product - Dot product: $\langle d_i, q \rangle = \sum_j d_{ij} q_j$ measures unnormalized alignment. - High scores when vectors point in similar directions and have large magnitudes. - Document 0: $\langle [10, 0], [1, 0.5] \rangle = 10 \cdot 1 + 0 \cdot 0.5 = 10$ (high magnitude dominates). - Document 1: $\langle [0.5, 0.5], [1, 0.5] \rangle = 0.5 \cdot 1 + 0.5 \cdot 0.5 = 0.75$ (smaller despite aligning with both dimensions). - Document 2: $\langle [-1, 0], [1, 0.5] \rangle = -1 \cdot 1 + 0 \cdot 0.5 = -1$ (anti-aligned).

Part 2: Cosine Similarity Normalization - Cosine: $\cos(\theta_i) = \langle d_i, q \rangle / (\|d_i\|_2 \|q\|_2) \in [-1, 1]$. - Measures angle between vectors, independent of magnitude. - $\|q\|_2 = \sqrt{1^2 + 0.5^2} \approx 1.118$; document norms vary widely. - After normalization, rankings may change: documents with better directional alignment rank higher even if magnitudes differ. - Cosine of 1 = parallel, 0 = orthogonal, -1 = anti-parallel.

Why This Matters for ML - Information retrieval: Dot products bias toward long documents; cosine normalizes by document length. - Recommendation systems: Cosine focuses on rating patterns (correlation), not absolute counts. - Word embeddings: Semantic relatedness uses cosine; embedding norms vary arbitrarily during training. - Attention mechanisms: Some variants normalize queries/keys to unit norm (converting dot products to cosine). - K-means clustering: Spherical K-means uses cosine similarity; standard K-means uses Euclidean distance (scale-sensitive).

ML Examples and Patterns - TF-IDF retrieval: Compute cosine between query and document TF-IDF vectors; ranks by topical relevance, not length. - Collaborative filtering: User-user similarity via cosine on rating vectors; identifies correlated preferences. - BERT semantic similarity: Embed sentences, compute cosine; high cosine ($\sim 0.9$) indicates paraphrase. - Contrastive learning (CLIP, SimCLR): Maximize cosine between positive pairs, minimize for negatives. - Nearest-neighbor search: With $\ell_2$-normalized embeddings, cosine reduces to dot product (efficient MIPS).

Connection to Linear Algebra Theory - Inner product is bilinear: $\langle \alpha d, q \rangle = \alpha \langle d, q \rangle$; cosine is scale-invariant. - Cauchy-Schwarz: $|\langle d, q \rangle| \le \|d\| \|q\|$; dividing yields $|\cos(\theta)| \le 1$. - Projection onto unit sphere: Cosine similarity is dot product after normalizing to unit norm. - Orthogonality: $\langle d, q \rangle = 0 \Leftrightarrow \cos(\theta) = 0$ (angle 90°); magnitude-independent.

Numerical and Implementation Notes - Axis for document norms: axis=1 computes row-wise norms (one per document); axis=0 would be column-wise (wrong). - Zero-magnitude vectors: If $\|d_i\| = 0$, cosine is undefined (division by zero); add epsilon or filter out. - Negative cosines: Valid and meaningful (anti-alignment); e.g., antonyms in word embeddings. - Ranking via argsort: np.argsort(-scores) for descending (highest first); negative flips order. - Broadcasting: dot / (norms_docs * norm_q) broadcasts element-wise: (3,) / ((3,) * scalar).

Numerical and Shape Notes - $q \in \mathbb{R}^2$ (shape (2,)), $\text{docs} \in \mathbb{R}^{3 \times 2}$ (shape (3, 2)). - docs @ q $\to$ shape (3,) (one score per document). - np.linalg.norm(docs, axis=1) $\to$ shape (3,) (one norm per document). - np.linalg.norm(q) $\to$ scalar (query norm). - cos $\to$ shape (3,) after element-wise division.

Pedagogical Significance - Dot product vs. cosine is a core distinction: Most ML applications use cosine to avoid magnitude bias. - Geometric intuition: Dot product measures projection; cosine measures angle. - Practical impact: Rankings differ dramatically when magnitudes vary; choosing the wrong metric causes poor retrieval. - Scale invariance: Cosine treats $d$ and $10d$ identically; dot product does not. - Foundation for retrieval: Understanding this pattern is essential for search, recommendation, and embedding-based systems.

History and Applications

The dot product (inner product) dates to 19th-century vector algebra (Grassmann, Gibbs, Hamilton) and became a foundational operation in linear algebra and physics (work, projection, orthogonality). Cosine similarity emerged in information retrieval with Salton’s vector space model (1970s): representing documents as term-frequency vectors and ranking by angle (cosine) rather than raw dot product addressed the “long document problem”—verbose documents dominated rankings even when less topically relevant. The geometric interpretation (cosine of angle) made it intuitive and computationally tractable. In the 1990s, collaborative filtering (GroupLens, MovieLens) adopted cosine similarity to find users with correlated preferences, normalizing for activity levels (users who rate many items vs. few). Word embeddings (Word2Vec 2013, GloVe 2014) revived cosine similarity for semantic relatedness: embeddings have heterogeneous norms due to word frequency and training dynamics, so cosine (not dot product) captures true semantic alignment. Modern contrastive learning (SimCLR 2020, CLIP 2021) uses cosine similarity in loss functions to maximize alignment between positive pairs while minimizing it for negatives—temperature-scaled cosine enables stable gradient flow. Today, cosine similarity is the default metric in neural search (FAISS, Annoy), recommendation systems (Netflix, Spotify), semantic similarity (sentence-BERT), and clustering (spherical K-means). Understanding when to use dot product (attention mechanisms with controlled norms) vs. cosine (retrieval with heterogeneous scales) is a core ML engineering skill.

Connection to Broader Examples
  • Chapter 2 (Span/Linear combinations): Dot product $\langle d, q \rangle$ decomposes $d$ as projection onto $q$ plus orthogonal component.
  • Chapter 4 (Linear maps): Matrix-vector product docs @ q applies linear map row-wise.
  • Chapter 5 (Inner products/norms): Core chapter; this is the canonical application of inner product vs. normalized similarity.
  • Chapter 6 (Projections): Normalizing to unit norm is orthogonal projection onto unit sphere.
  • Chapter 10 (SVD): Singular values reveal document/query subspace structure; cosine measures alignment.
  • Chapter 11 (PCA): PCA maximizes variance (unnormalized); cosine focuses on direction.
  • Chapter 12 (Least-squares): Regression coefficients depend on dot products; orthogonalization via cosine.
  • Chapter 14 (Conditioning): Ill-conditioned embeddings have extreme magnitude variations; cosine normalizes this.
  • Chapter 16 (Matrix products): Both metrics use matrix-vector products; cosine adds normalization.
  • Information retrieval: TF-IDF, BM25, and neural search all use cosine similarity for ranking.

Comments