Example number
69
Example slug
example_69_similarity_geometry_dot_vs_cosine_in_embedding_retrieval
Background

Inner products and norms define similarity geometry in ML. Dot products measure raw alignment (used in attention, linear layers, and unnormalized retrieval), while cosine similarity normalizes by magnitude to focus on direction. In information retrieval, cosine dominates (TF-IDF, BM25 variants) because document length shouldn’t bias relevance. In embedding-based search (word2vec, BERT, CLIP), vectors are often L2-normalized so dot product equals cosine, combining efficiency with invariance. Understanding when magnitude matters versus when it’s noise is essential for building effective retrieval and similarity systems.

Purpose

Build intuition for two fundamental similarity metrics in information retrieval and ML: dot-product similarity (magnitude-sensitive) and cosine similarity (magnitude-invariant). You should come away understanding when rankings differ, why normalization matters, and how to choose the right metric for embeddings, retrieval, and attention. This distinction is critical for search systems, recommendation, and any task where vector similarity drives decisions.

Problem

Rank three document embeddings against a query using dot-product similarity and cosine similarity. Explain (in math) why rankings can differ when magnitudes differ.

Solution (Math)

Dot-product similarity is $s_i = q^\top d_i$, mixing magnitude and directional alignment. Cosine similarity normalizes by norms:

\[ \cos\theta_i = \frac{q^\top d_i}{\|q\|_2 \|d_i\|_2}, \]

where $\theta_i$ is the angle between $q$ and $d_i$. When document norms $\|d_i\|$ vary widely, dot-product rankings can favor large-norm vectors even with worse alignment; cosine similarity removes that magnitude bias and measures pure directional similarity.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np

q = np.array([1., 0.5])
docs = np.array([
    [10., 0.],
    [ 0.5, 0.5],
    [-1., 0.],
])

dot = docs @ q
cos = dot / (np.linalg.norm(docs,axis=1) * np.linalg.norm(q))

print("dot:", dot, "rank:", np.argsort(-dot))
print("cos:", cos, "rank:", np.argsort(-cos))
Code Introduction

This snippet compares dot-product similarity versus cosine similarity for ranking documents against a query—a fundamental distinction in information retrieval and ML similarity search. The query $q \in \mathbb{R}^2$ is matched against three document vectors in docs $\in \mathbb{R}^{3 \times 2}$ (rows are documents, columns are features/terms).

The raw dot product dot = docs @ q computes $\langle d_i, q \rangle$ for each document row $d_i$, measuring unnormalized similarity: documents with larger magnitude in the query direction score higher. For the given data with $d_0 = [10, 0]^\top$, $d_1 = [0.5, 0.5]^\top$, $d_2 = [-1, 0]^\top$, and $q = [1, 0.5]^\top$, the dot products are $10$, $0.75$, and $-1$ respectively. Ranking by descending dot product yields: doc 0, doc 1, doc 2. Document 0 dominates because its large magnitude overwhelms the other terms.

Cosine similarity normalizes by the Euclidean norms: $\text{cosine}(d_i, q) = \frac{\langle d_i, q \rangle}{\|d_i\|_2 \|q\|_2}$. This measures directional alignment independent of magnitude—vectors pointing in the same direction have cosine 1, orthogonal vectors have cosine 0. With norms $\|d_0\|_2 = 10$, $\|d_1\|_2 \approx 0.707$, $\|d_2\|_2 = 1$, $\|q\|_2 \approx 1.118$, the cosine values become approximately $0.894$, $0.949$, and $-0.894$. Ranking by cosine yields: doc 1, doc 0, doc 2. Document 1 wins because it balances both query features proportionally.

Shape discipline: dot $\in \mathbb{R}^n$ from docs @ q; np.linalg.norm(docs, axis=1) computes row-wise norms (shape $(n,)$), broadcasting correctly in the division. Gotchas: zero-norm vectors cause nan; filter or floor norms. Sign matters: negative cosine indicates anti-alignment. For large corpora, use approximate nearest neighbor methods instead of exhaustive search.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Construct query $q \in \mathbb{R}^d$ and document matrix docs $\in \mathbb{R}^{n\times d}$ (rows are documents).
  • Compute dot products dot = docs @ q; shape $(n,)$.
  • Compute row-wise document norms np.linalg.norm(docs, axis=1); shape $(n,)$.
  • Compute query norm np.linalg.norm(q); scalar.
  • Compute cosine similarities cos = dot / (norm_docs * norm_q); broadcasting yields shape $(n,)$.
  • Rank by descending similarity: np.argsort(-dot) for dot product, np.argsort(-cos) for cosine.
  • Verify cosine values lie in $[-1, 1]$; inspect rankings to see how magnitude differences affect order.
  • Optional: pre-normalize embeddings and verify dot product on normalized vectors equals cosine.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Dot-product similarity: $s_i = q^\top d_i$ mixes magnitude and alignment; large-norm documents dominate rankings.
  • Cosine similarity: $\cos\theta_i = \frac{q^\top d_i}{\|q\|_2 \|d_i\|_2}$ measures directional alignment only, independent of magnitude.
  • How rankings change: when document norms vary, dot product and cosine can produce different orderings.
  • Normalization as a design choice: pre-normalize embeddings once to make dot product equivalent to cosine.
  • Shape discipline: docs @ q computes all dot products in one matrix-vector multiply; norm broadcasting ensures correct cosine computation.
  • When to use which: dot product for magnitude-weighted scoring; cosine for magnitude-invariant similarity.
Notes
  • Part 1: Dot-Product Scoring. Raw dot product $q^\top d_i$ measures unnormalized similarity; documents with larger magnitude score higher even if misaligned. For the given data, document 0 with $\|d_0\|=10$ dominates despite aligning only with the first query feature.
  • Part 2: Cosine Similarity Normalization. Cosine $\cos\theta_i = \frac{q^\top d_i}{\|q\|_2 \|d_i\|_2}$ measures directional alignment independent of magnitude. Document 1, with balanced alignment on both query features, wins under cosine despite having a smaller norm.
  • Part 3: Ranking changes and interpretation. When document norms vary widely, dot product favors large vectors; cosine reorders by direction. In retrieval, cosine often produces semantically better rankings because relevance shouldn’t depend on document length.
  • Why This Matters for ML. Every embedding-based retrieval system (search, recommendation, similarity search) must choose between dot product and cosine. Pre-normalized embeddings (common in BERT, Sentence-BERT, CLIP) make dot product equivalent to cosine, combining efficiency with magnitude invariance.
  • ML Examples and Patterns. Fit: rank documents/items by similarity to a query; Project: cosine measures projection fraction onto unit-norm directions; Decompose: correlation matrices use cosine-like normalization; Solve: gradient alignment in multi-task learning uses cosine similarity.
  • Connection to Linear Algebra Theory. Dot product $\langle q, d \rangle = \|q\|_2 \|d\|_2 \cos\theta$ relates magnitude and angle via Cauchy-Schwarz. Cosine isolates the $\cos\theta$ term. In high dimensions, random vectors are nearly orthogonal ($\cos\theta \approx 0$), so even small cosines indicate meaningful alignment.
  • Numerical and Implementation Notes. Avoid division by zero: filter or floor norms to $\max(\|d\|, \epsilon)$. For large corpora, use approximate nearest neighbor (ANN) methods (HNSW, FAISS) instead of exhaustive search. Pre-normalize embeddings once if using cosine repeatedly.
  • Numerical and Shape Notes. Annotate $q \in \mathbb{R}^d$, docs $\in \mathbb{R}^{n\times d}$, dot $\in \mathbb{R}^n$, cos $\in \mathbb{R}^n$. Verify np.linalg.norm(docs, axis=1) has shape $(n,)$ and broadcasts correctly in division.
  • Pedagogical Significance. This is the canonical minimal example showing that similarity metric choice changes rankings. The pattern generalizes to all embedding-based retrieval, recommendation, and attention systems. Understanding dot vs. cosine is foundational for building effective ML similarity systems.
History and Applications

Cosine similarity emerged in information retrieval (Salton & McGill 1983) as a solution to document-length bias in TF-IDF and vector space models. Modern embedding-based retrieval (word2vec 2013, BERT 2018, Sentence-BERT 2019) inherited cosine similarity but often pre-normalizes embeddings so dot product suffices, enabling efficient similarity search via maximum inner product search (MIPS) libraries like FAISS. Today, the dot-vs-cosine distinction underpins search engines, recommendation systems, embedding databases (Pinecone, Weaviate), and attention mechanisms in transformers. Understanding when magnitude matters versus when it’s noise remains foundational for building effective ML similarity systems.

Connection to Broader Examples
  • Links to Chapter 5 (inner products/norms): dot product is the canonical inner product; cosine is normalized.
  • Connects to Chapter 6 (projections): dot product measures projection magnitude; cosine measures projection fraction.
  • Relates to Chapter 16 (matrix products): batch similarity docs @ q is a matrix-vector product; attention uses similar patterns.
  • Bridges to Chapter 11 (PCA): variance (dot-product energy) versus correlation (cosine-like normalization).
  • Complements Chapter 12 (least squares): correlation coefficients in regression are cosine similarities of centered variables.
  • Reinforces shape discipline: annotate $q \in \mathbb{R}^d$, docs $\in \mathbb{R}^{n\times d}$, verify broadcasting.

Comments