ex1.ai

Example number

Example slug

example_37_dot_products_norms_and_cosine_similarity_retrieval

Background

Inner products quantify alignment; norms quantify length. Dot product scores $q^\top d$ grow with both alignment and vector length. Cosine similarity normalizes by norms, isolating direction: $\cos(q,d) = \frac{q^\top d}{\|q\|\,\|d\|}$. In retrieval, this matters: L2-normalized embeddings make dot product equal to cosine, but unnormalized embeddings can change rankings. Many models (word embeddings, CLIP, sentence transformers) L2-normalize outputs so cosine and dot coincide; others rely on magnitude (e.g., unnormalized logits in attention). Understanding when to normalize shapes recall/precision and stability.

Purpose

Build intuition for when raw dot products and cosine similarity rank items differently. Dot products mix alignment and magnitude, so long vectors can dominate even if poorly aligned. Cosine similarity removes length, ranking purely by angle. Understanding this distinction is essential for retrieval (information retrieval, vector databases), recommendation, and attention mechanisms where score choice changes what is retrieved or attended.

Problem

Rank document vectors by dot product and cosine similarity against a query and compare rankings.

Solution (Math)

Dot product $q^Td$ depends on both alignment and magnitude. Cosine similarity

\[ \frac{q^Td}{\|q\|\|d\|} \]

depends only on angle. Thus rankings can differ when norms vary.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)


import numpy as np

q = np.array([1., 0.5])
docs = np.array([[10.,0.],[0.5,0.5],[-1.,0.]])

dot = docs@q
cos = dot/(np.linalg.norm(docs,axis=1)*np.linalg.norm(q))

print("dot:", dot, "rank:", np.argsort(-dot))
print("cos:", cos, "rank:", np.argsort(-cos))

Code Introduction

This snippet compares raw dot product and cosine similarity for ranking vectors against a query. Dot scores reflect both magnitude and alignment; cosine scores normalize by vector lengths, isolating direction. The printed ranks show how a large-norm vector can dominate under dot product, while cosine similarity discounts length, favoring angular closeness.

Numerical Implementation Details

Numerical and Shape Notes

Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

Define query $q \in \mathbb{R}^d$ and documents matrix $D \in \mathbb{R}^{n \times d}$ (rows are docs).
Compute dot scores: dot = D @ q (shape $(n,)`).
Compute norms: doc_norms = np.linalg.norm(D, axis=1), q_norm = np.linalg.norm(q).
Compute cosine scores: cos = dot / (doc_norms * q_norm); guard against zero norms if present.
Rank: np.argsort(-dot) for dot-product ranking; np.argsort(-cos) for cosine ranking.
Compare rankings to see where they differ; print both.
Optional: L2-normalize docs and query once (D_norm = D / doc_norms[:, None], q_normed = q / q_norm) and then use dot to equal cosine.

What This Example Demonstrates

Pedagogical Significance

Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

Dot product mixes magnitude and direction; cosine depends only on direction.
Rankings can diverge when document norms differ: a large-norm but misaligned vector may outrank a well-aligned small-norm vector under dot product, but not under cosine.
L2-normalization makes dot and cosine rankings identical: if $\|q\|=\|d_i\|=1$, then $q^\top d_i = \cos(q,d_i)$.
Shape discipline: $q \in \mathbb{R}^d$, docs $\in \mathbb{R}^{n\times d}$, scores $\in \mathbb{R}^n$, ranks from argsort on negative scores.
Numerical stability: norm computations can underflow/overflow for extreme magnitudes; use np.linalg.norm and avoid zero vectors.

Notes

Part 1: Dot product scoring
Scores $q^\top d$ scale with both alignment and vector length. Large-norm docs can dominate even if angles are moderate.

Part 2: Cosine scoring
Cosine $\frac{q^\top d}{\|q\|\|d\|}$ removes magnitude, ranking purely by angle. It is invariant to rescaling any vector.

Part 3: When rankings differ
If document norms vary, dot and cosine can disagree. If all vectors are L2-normalized, rankings coincide. In practice, embeddings are often normalized at serving time to make cosine and dot equivalent.

Connection to ML Primitives
Dot vs.Â cosine choices appear in retrieval (vector DBs, nearest neighbors), recommendation (user/item embeddings), attention (scaled dot-product), and contrastive learning (InfoNCE uses cosine/dot with temperature). Normalization choices affect calibration and robustness.

Numerical and Implementation Notes
Avoid zero vectors (cosine undefined). Use stable norms; for FP16, accumulate norms in FP32. For large batches, precompute norms and reuse. When using ANN/vector DBs, note that some indexes assume normalized vectors to accelerate cosine via inner product.

Numerical and Shape Notes
Shapes: $q(d)$, $D(n\times d)$, dot/cos $(n,)$. argsort(-scores) ranks descending. Broadcasting: doc_norms * q_norm is length-$n$; ensure no division by zero. If batching queries ($Q \in \mathbb{R}^{m\times d}$), compute scores as $Q D^\top$ (shape $m\times n$) and normalize rows/columns appropriately.

History and Applications

Cosine similarity has roots in information retrieval (Saltonâs vector space model, 1960s) as a length-invariant measure of query-document relevance. Dot product scoring rose with neural embeddings: Word2Vec (2013) and GloVe (2014) trained vectors where inner products captured semantic association. Modern contrastive models (e.g., CLIP, SimCLR) L2-normalize embeddings so cosine and dot coincide, stabilizing training and retrieval. In recommender systems, user/item factors are often normalized to control popularity bias; in attention, scaling by $1/\sqrt{d}$ and softmax balances magnitude and angular effects. Vector databases and ANN indexes frequently assume normalized vectors to accelerate cosine via inner-product search. Across IR, NLP, vision, and multimodal models, the choice between dot and cosine shapes ranking behavior, calibration, and robustness.

Connection to Broader Examples

Vector spaces (Ch. 1): Inner products live in vector spaces and induce norms.
Inner products (Ch. 5): This example is a direct application: dot vs.Â cosine derive from the same inner product.
Orthogonality & projections (Ch. 6): Cosine is the projection of $d$ onto $q$ scaled by lengths; projecting onto a unit query gives cosine directly.
SVD/PCA (Ch. 10/11): Principal directions maximize variance; cosine similarity is often used in those embedding spaces for retrieval.
Attention (Ch. 16): Attention scores are dot products; adding scaling/normalization (softmax over scaled dots) changes how length vs.Â angle affects weights.
Conditioning (Ch. 14): Large norm disparities can destabilize dot-based ranking; normalization mitigates this.

Numerical and Shape Notes

Numerical and Implementation Notes

Pedagogical Significance

ML Examples and Patterns

Comments