ex1.ai

Example number

Example slug

example_85_similarity_geometry_dot_vs_cosine_in_embedding_retrieval

Background

Similarity metrics underpin information retrieval since the 1960s (Saltonâs vector space model). Classical TF-IDF (term frequencyâinverse document frequency) represents documents as sparse vectors; cosine similarity became standard because document length (word count) was uninformative noiseâlonger documents shouldnât rank higher by default. This principle generalized to dense embeddings: word2vec (Mikolov, 2013) and GloVe (Pennington, 2014) learned dense representations where cosine similarity captured semantic relatedness (âkingâ â âmanâ + âwomanâ â âqueenâ). BERT (Devlin, 2019) and sentence transformers extended this to contextual embeddings, enabling semantic search at scale. Modern retrieval systems (e.g., Google Search, chatbot memory) compute cosine similarity between query and billions of document embeddings in real time via approximate nearest neighbor (ANN) algorithms like FAISS or ScaNN. Contrastive learning (SimCLR, MoCo, CLIP) trains embeddings by maximizing cosine similarity between positive pairs and minimizing it for negatives. The key insight: normalizing to the unit sphere makes comparisons purely geometric (angles, not magnitudes), which is natural for high-dimensional data where scale often has no semantic meaning.

Purpose

Build an intuitive, geometry-driven understanding of similarity metrics in information retrieval and machine learning. Show why dot product (unnormalized) and cosine similarity (normalized) can produce drastically different rankings, and teach you to choose the right metric based on whether vector magnitude is informative. Emphasize that normalization is not a cosmetic tweakâit fundamentally changes the geometry from measuring âmagnitude Ã alignmentâ to measuring âpure directional alignment,â which is critical for tasks like text retrieval, semantic search, and contrastive learning where scale is irrelevant. Train your intuition to predict when rankings will flip based on norm variation, and connect this to the unit sphere geometry underlying embeddings in modern NLP and vision.

Problem

Rank three document embeddings against a query using dot-product similarity and cosine similarity. Explain (in math) why rankings can differ when magnitudes differ.

Solution (Math)

For query $q \in \mathbb{R}^d$ and document vectors $d_i \in \mathbb{R}^d$ (rows of $\text{docs} \in \mathbb{R}^{n \times d}$):

Dot product similarity: \[ s_i = d_i^\top q = \|d_i\| \|q\| \cos(\theta_i), \] mixes magnitude and alignment (angle $\theta_i$).

Cosine similarity: \[ \cos(\theta_i) = \frac{d_i^\top q}{\|d_i\|_2 \|q\|_2} \in [-1, 1], \] isolates directional alignment (scale-invariant).

Why rankings differ: When $\|d_i\|$ varies widely, dot-product rankings favor large-norm vectors even if $\theta_i$ is larger (worse alignment). Cosine normalization removes magnitude bias, ranking purely by angle. For example: $d_1 = [10, 0]^\top$ and $d_2 = [0.5, 0.5]^\top$ with $q = [1, 0.5]^\top$. Dot products: $d_1^\top q = 10$, $d_2^\top q = 0.75$ (ranks $d_1 > d_2$). Cosines: $\cos(\theta_1) \approx 0.894$, $\cos(\theta_2) \approx 0.949$ (ranks $d_2 > d_1$ because $\theta_2 < \theta_1$).

We use: - $X \in \mathbb{R}^{n \times d}$ (rows = examples, $n$ = batch size) - Column vectors by default - Inner product $\langle x, y \rangle = x^\top y$; norm $\|x\|_2 = \sqrt{x^\top x}$ - Transpose $A^\top$ (not $A^T$)

Solution (Python)


import numpy as np

q = np.array([1., 0.5])
docs = np.array([
    [10., 0.],
    [ 0.5, 0.5],
    [-1., 0.],
])

dot = docs @ q
cos = dot / (np.linalg.norm(docs,axis=1) * np.linalg.norm(q))

print("dot:", dot, "rank:", np.argsort(-dot))
print("cos:", cos, "rank:", np.argsort(-cos))

Code Introduction

This snippet demonstrates the critical difference between dot product similarity (unnormalized) and cosine similarity (normalized) in information retrieval and machine learning. The code compares how a query vector ranks against a small document collection, showing how normalization by vector magnitude can drastically change relevance scores and rankings.

Part 1: Query-Document Inner Products. The setup models a simple retrieval task: given a query vector $q = [1, 0.5]^\\top \\in \\mathbb{R}^2$ and three document vectors forming rows of $\\text{docs} \\in \\mathbb{R}^{3 \\times 2}$, compute similarity scores. The documents are: $d_1 = [10, 0]^\\top$ (large magnitude, aligned with first query dimension), $d_2 = [0.5, 0.5]^\\top$ (small magnitude, aligned with both query dimensions), $d_3 = [-1, 0]^\\top$ (opposed to first query dimension). The dot product similarity is the standard inner product: $\\text{dot}_i = d_i^\\top q$. Computing explicitly: $\\text{dot}_1 = 10$, $\\text{dot}_2 = 0.75$, $\\text{dot}_3 = -1$. The ranking by dot product (using np.argsort(-dot) to sort descending) is: Document 1 > Document 2 > Document 3. Document 1 dominates because its magnitude is large ($\\|d_1\\| = 10$), even though itâs orthogonal to the second query dimension.

Part 2: Cosine Similarity via Normalization. Cosine similarity removes the magnitude bias by normalizing both query and documents to unit length: $\\cos(\\theta_i) = \\frac{d_i^\\top q}{\\|d_i\\| \\|q\\|}$, where $\\theta_i$ is the angle between $d_i$ and $q$. This measures directional alignment (the cosine of the angle), independent of vector length. Computing the norms: $\\|d_1\\| = 10$, $\\|d_2\\| \\approx 0.707$, $\\|d_3\\| = 1$, $\\|q\\| \\approx 1.118$. Normalizing the dot products: $\\cos(\\theta_1) \\approx 0.894$, $\\cos(\\theta_2) \\approx 0.949$, $\\cos(\\theta_3) \\approx -0.894$. The ranking by cosine similarity is: Document 2 > Document 1 > Document 3. Document 2 now ranks highest because itâs most aligned with the query direction (smallest angle $\\theta_2 \\approx 18Â°$), despite having much smaller magnitude than Document 1.

Why This Matters for ML. Information retrieval and document ranking: Search engines and recommendation systems use cosine similarity to match queries to documents. The normalization ensures that longer documents (more words, larger feature magnitudes) donât automatically rank higherâwhat matters is conceptual alignment, not document length. TF-IDF and embedding-based retrieval: Term frequencyâinverse document frequency (TF-IDF) vectors have varying magnitudes (common words boost length). Cosine similarity prevents bias toward documents with high word counts. Embedding spaces (word2vec, BERT, sentence transformers): Neural embeddings represent text as dense vectors. Cosine similarity measures semantic closeness independent of embedding scale. Attention mechanisms: Scaled dot-product attention uses unnormalized scores $QK^\\top$, but softmax normalization plays a similar role. Some variants (e.g., normalized attention) explicitly use cosine similarity. Contrastive learning (SimCLR, MoCo): Loss functions maximize cosine similarity between augmented views of the same image. Clustering (K-means variants): Spherical K-means uses cosine similarity instead of Euclidean distance, useful for text clustering where document length varies. Recommendation systems (collaborative filtering): User-item preference vectors are compared via cosine similarity to find similar users or items, ignoring scale differences.

Shape Discipline (Critical). Gradients always match parameter shapes: $q \\in \\mathbb{R}^d$, $\\text{docs} \\in \\mathbb{R}^{n \\times d}$, $\\text{dot} \\in \\mathbb{R}^n$, $\\|\\text{docs}\\|_2 \\in \\mathbb{R}^n$ (row-wise norms via axis=1), $\\|q\\|_2$ is scalar, $\\cos(\\theta) \\in \\mathbb{R}^n$. The axis parameter matters: np.linalg.norm(docs, axis=1) computes the norm of each row (each document), not columns. Broadcasting in division: dot / (doc_norms * q_norm) broadcasts correctly because numerator and denominator both have shape (n,). Ranking via argsort: np.argsort(-dot) returns indices that sort in descending order (negation flips ascending to descending). Cosine similarity range: $\\cos(\\theta) \\in [-1, 1]$. Values near 1 indicate alignment, near 0 indicate orthogonality, and near -1 indicate opposition. Numerical stability: For very small norms (near-zero vectors), add epsilon: cos = dot / (doc_norms * q_norm + 1e-8) to avoid division by zero.

Numerical Implementation Details

Dot product: dot = docs @ q (matrix-vector multiply, $O(nd)$ for $n$ documents, $d$ dimensions).
Cosine similarity: cos = dot / (np.linalg.norm(docs, axis=1) * np.linalg.norm(q))
1. Compute row-wise document norms: np.linalg.norm(docs, axis=1) â shape (n,)
2. Compute query norm: np.linalg.norm(q) â scalar
3. Divide dot products by product of norms (NumPy broadcasts correctly)
Ranking: np.argsort(-scores) sorts descending (negation flips order). Returns indices, not scores.
Complexity: Dot product is $O(nd)$; norm computations add $O(nd)$ (dominated by dot product). Cosine adds negligible overhead.
Memory: Store docs and q; intermediate dot is $O(n)$.
Numerical stability: For very small norms, add epsilon: cos = dot / (doc_norms * q_norm + 1e-8) to avoid division by zero.
Alternative (pre-normalize): Store docs_normalized = docs / np.linalg.norm(docs, axis=1, keepdims=True) once, then cos = docs_normalized @ (q / np.linalg.norm(q)) (saves repeated norm computation for multiple queries).

What This Example Demonstrates

Dot product biases toward magnitude: Larger vectors dominate rankings even if poorly aligned (large $\|d_i\|$ compensates for large $\theta_i$).
Cosine similarity is scale-invariant: Normalization focuses purely on directional alignment (angle), ignoring length.
Rankings can flip dramatically: Same data, different metric â completely different orderings. Document 1 wins by dot product (large norm), Document 2 wins by cosine (small angle).
Unit sphere geometry: Cosine similarity is equivalent to dot product on normalized vectors (projecting onto the unit sphere).
When to use which: Dot product when magnitude is informative (e.g., attention logits, weighted voting); cosine when scale is arbitrary (e.g., text embeddings, image features).
Numerical stability: Division by norms requires care (add epsilon for near-zero vectors); broadcasting matters for batch computations.

Notes

Part 1: Dot Product Similarity (Unnormalized)

The dot product $d_i^\top q$ computes the projection of document $d_i$ onto query $q$ (times $\|q\|$). From the Cauchy-Schwarz inequality: \[ d_i^\top q = \|d_i\| \|q\| \cos(\theta_i), \] so the dot product is the product of three factors: (1) document magnitude $\|d_i\|$, (2) query magnitude $\|q\|$, and (3) cosine of the angle $\cos(\theta_i)$. Implication: Doubling the document length doubles the similarity score, even if the direction (content) is unchanged. For text, this biases toward longer documents; for embeddings, it favors vectors with larger norms (which may encode confidence or frequency, but often are arbitrary). When to use: Dot product is appropriate when magnitude is informative (e.g., attention logits where larger values indicate stronger relevance, or when aggregating weighted votes where magnitude encodes weight).

Part 2: Cosine Similarity (Normalized, Scale-Invariant)

Cosine similarity $\frac{d_i^\top q}{\|d_i\| \|q\|}$ isolates the angular component, removing magnitude: \[ \cos(\theta_i) \in [-1, 1]. \] Values near 1 indicate alignment (small angle), near 0 indicate orthogonality (90Â°), and near -1 indicate opposition (180Â°). Implication: Doubling document length doesnât change the score the direction matters. This is equivalent to projecting all vectors onto the unit sphere (normalize to length 1), then computing dot products. When to use: Cosine is standard when scale is arbitrary or irrelevant: text retrieval (document length varies), image embeddings (CNN features have varying norms), contrastive learning (embeddings trained with L2 normalization).

Part 3: Why Rankings Differ (Norm Variation)

In the example: $d_1 = [10, 0]^\top$, $d_2 = [0.5, 0.5]^\top$, $q = [1, 0.5]^\top$.

Dot products: - $d_1^\top q = 10$: Large norm ($\|d_1\| = 10$) dominates despite $\theta_1 \approx 26.6Â°$. - $d_2^\top q = 0.75$: Small norm ($\|d_2\| \approx 0.707$) suppresses score despite smaller angle $\theta_2 \approx 18.4Â°$. - Ranking: Document 1 > Document 2 (norm wins).

Cosine similarities: - $\cos(\theta_1) \approx 0.894$: Alignment moderate. - $\cos(\theta_2) \approx 0.949$: Alignment best (smallest angle). - Ranking: Document 2 > Document 1 (angle wins).

General principle: Dot product rankings mix magnitude and angle; cosine rankings are purely angular. When norms vary by orders of magnitude (common in sparse TF-IDF or unnormalized embeddings), dot product and cosine can produce opposite orderings.

Why This Matters for ML

Information retrieval: Search engines rank billions of documents against user queries. Cosine similarity prevents bias toward long documents (more words more relevant). Googleâs early PageRank used TF-IDF + cosine for this reason.

Semantic search (BERT, Sentence-BERT): Dense embeddings from transformers encode meaning. Cosine similarity finds semantically related sentences, even if phrasing differs ("The cat sits on the mat" vs.Â "Feline rests on a rug" cosine despite different words).

Contrastive learning (SimCLR, MoCo, CLIP): Training objective maximizes cosine similarity for positive pairs (augmented views of same image, or image-caption pairs), minimizes for negatives. Loss functions like NT-Xent (normalized temperature-scaled cross-entropy) normalize embeddings to the unit sphere: \[ \\mathcal{L}_i = -\\log \\frac{\\exp(\\cos(z_i, z_j) / \\tau)}{\\sum_k \\exp(\\cos(z_i, z_k) / \\tau)}, \] where $z_i, z_j$ are positive pairs, $\\tau$ is temperature, and $\\cos(\\cdot)$ is cosine similarity.

Clustering (K-means variants): Spherical K-means uses cosine similarity instead of Euclidean distance, clustering on the unit sphere. Better for text (document length is noise) and directional data (wind directions, orientations).

Recommendation systems: Collaborative filtering compares user-item vectors. Cosine similarity finds similar users ("users who rated these items similarly") regardless of rating scale differences (one user rates 1, another rates 3).

ML Examples and Patterns

(Included in the Code Introduction section below for the detailed implementation examples)

Connection to Linear Algebra Theory

Inner product as projection: The dot product $d^\top q$ measures the projection of $d$ onto $q$ (times $\|q\|$). From linear algebra: \[ \\text{proj}_q(d) = \\frac{d^\top q}{\\|q\\|^2} q, \] so $d^\top q = \\|q\\|^2 \\cdot \\text{scalar projection of } d \\text{ onto } q$.

Cauchy-Schwarz inequality: Bounds the dot product: \[ |d^\top q| \\le \\|d\\| \\|q\\|, \] with equality when $d$ and $q$ are parallel. This implies: \[ \\frac{d^\top q}{\\|d\\| \\|q\\|} \\in [-1, 1], \] i.e., cosine similarity is always in the valid range for a cosine function.

Unit sphere normalization: Cosine similarity is equivalent to computing dot products on the unit sphere: \[ \\hat{d} = \\frac{d}{\\|d\\|}, \\quad \\hat{q} = \\frac{q}{\\|q\\|}, \\quad \\cos(\\theta) = \\hat{d}^\\top \\hat{q}. \] All vectors of the same direction collapse to a single point on the unit sphere (scale-invariant representation).

Orthogonality: Vectors with $\\cos(\\theta) = 0$ (dot product zero) are orthogonal (perpendicular). In high dimensions, random vectors are approximately orthogonal (concentration of measure).

Angular distance: Cosine similarity induces a distance metric (arc length on unit sphere): \[ d_{\\text{angular}}(d, q) = \\arccos\\left( \\frac{d^\\top q}{\\|d\\| \\|q\\|} \\right). \] This satisfies the triangle inequality (unlike raw dot product), making it a proper metric.

Numerical and Implementation Notes

Axis parameter in np.linalg.norm: - axis=1: Compute norm of each row (per document). Correct for document-query similarity. - axis=0: Compute norm of each column (per feature). Wrong for this task. - axis=None (default): Compute Frobenius norm (flatten entire matrix). Also wrong.

Broadcasting in division:

cos = dot / (doc_norms * q_norm)

dot has shape (n,)
doc_norms has shape (n,)
q_norm is a scalar
doc_norms * q_norm broadcasts to shape (n,) (scalar multiplies each element)
Division is element-wise: cos[i] = dot[i] / (doc_norms[i] * q_norm)

Avoiding division by zero:

eps = 1e-8
cos = dot / (doc_norms * q_norm + eps)

Prevents nan or inf when norms are near zero (e.g., all-zeros document or query).

Pre-normalization for multiple queries:

# Normalize documents once
docs_norm = docs / np.linalg.norm(docs, axis=1, keepdims=True)

# For each query
q_norm = q / np.linalg.norm(q)
cos = docs_norm @ q_norm  # Already cosine similarity!

Saves repeated norm computation (amortizes cost over many queries).

Ranking descending (highest first):

indices = np.argsort(-scores)  # Negation flips to descending
top_k = indices[:k]  # Top k documents

Numerical stability for very high dimensions: For $d \\sim 10^4$ (e.g., BERT embeddings), naive norm computation can overflow. Use stable alternatives:

# Stable norm (computes sqrt(sum(x**2)))
norm = np.linalg.norm(x)  # Uses numerically stable BLAS routine

# Manual (less stable for large d)
norm = np.sqrt((x**2).sum())  # Can overflow if x has large values

Numerical and Shape Notes

Shape discipline: - $q \\in \\mathbb{R}^d$: query vector (e.g., $d=768$ for BERT-base) - $\\text{docs} \\in \\mathbb{R}^{n \\times d}$: document matrix ($n$ documents, each $d$-dimensional) - $\\text{dot} = \\text{docs} \\, q \\in \\mathbb{R}^n$: dot products (one per document) - $\\|\\text{docs}\\|_2 \\in \\mathbb{R}^n$: row-wise norms (np.linalg.norm(docs, axis=1)) - $\\|q\\|_2 \\in \\mathbb{R}$ (scalar): query norm - $\\cos(\\theta) \\in \\mathbb{R}^n$: cosine similarities (one per document)

Keepdims for broadcasting:

# Without keepdims: shape (n,), can't broadcast correctly for column-wise ops
doc_norms = np.linalg.norm(docs, axis=1)  # shape (n,)

# With keepdims: shape (n, 1), broadcasts correctly
doc_norms = np.linalg.norm(docs, axis=1, keepdims=True)  # shape (n, 1)
docs_normalized = docs / doc_norms  # broadcasts to (n, d)

Transposing for batch queries:

# Multiple queries: Q is (m, d), docs is (n, d)
dots = Q @ docs.T  # (m, d) @ (d, n) = (m, n)
# dots[i, j] = similarity of query i to document j

Memory efficiency for large corpora: For $n \\sim 10^9$ documents (web-scale), computing all similarities is infeasible. Use approximate nearest neighbors (ANN): - FAISS (Facebook AI Similarity Search): GPU-accelerated ANN with quantization - ScaNN (Google): compressed embeddings + learned distance metrics - Annoy (Spotify): tree-based partitioning

Pedagogical Significance

This example is the foundational demonstration of similarity metrics in ML:

Key takeaways: 1. Dot product biases toward magnitude: Large vectors win, even if misaligned. 2. Cosine similarity is scale-invariant: Normalization focuses on direction (angle). 3. Rankings can flip: Same data, different metric opposite orderings. 4. Unit sphere geometry: Cosine = dot product on normalized vectors. 5. Choose metric by task: Magnitude informative? Use dot product. Scale arbitrary? Use cosine.

Common misconceptions addressed: - "Dot product and cosine are equivalent": No scales with magnitude, cosine doesnât. - "Normalization is optional": Noâs critical when scale is uninformative (text, images). - "Cosine is always better": No on whether magnitude encodes useful information. - "Rankings are robust to metric choice": No example shows rankings can reverse entirely.

Connection to other examples: - Example 80 (Attention): Attention uses unnormalized $QK^\\top$; softmax provides implicit normalization. - Example 84 (Backprop): Gradients involve inner products ($X^\\top \\frac{\\partial L}{\\partial Y}$); gradient clipping normalizes by magnitude. - Example 83 (PCA): Principal components are unit vectors; projections use dot products but components are normalized.

Why this snippet is powerful: It isolates the core geometric distinction between magnitude-dependent and magnitude-independent similarity in minimal code. The ranking flip (Document 1 vs.Â Document 2) makes the difference visceral an abstract formula, but a concrete reversal of preferences. This is the gateway to understanding modern retrieval (Google Search, semantic search, recommender systems) and contrastive learning (SimCLR, CLIP), where cosine similarity is ubiquitous. Every embedding-based system (BERT, GPT, diffusion models, face recognition) uses these metrics. The principle when scale is noise beyond ML to any high-dimensional data analysis (PCA, clustering, regression). This is the most important similarity metric pattern in the entire book.

History and Applications

Vector space model origins (1960sâ1970s): Gerard Salton pioneered the vector space model for information retrieval at Cornell, representing documents as term-frequency vectors. The key innovation: use cosine similarity to rank documents against queries, normalizing out document length (longer documents have more terms, but arenât necessarily more relevant). This became the foundation of modern search engines and remains the default metric in retrieval systems. The TF-IDF weighting scheme (term frequency Ã inverse document frequency) further refined this by down-weighting common words, creating sparse high-dimensional vectors where cosine similarity measures topical overlap.

Word embeddings revolution (2013): Word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) learned dense representations where cosine similarity captured semantic relatedness. The famous example: king - man + woman â queen used cosine similarity to find the nearest vector. These embeddings had arbitrary scale (training doesnât fix norm), so cosine was essential. This pattern extended to sentence embeddings (Sentence-BERT, 2019), image embeddings (ResNet features), and multimodal embeddings (CLIP, 2021, aligning text and images via cosine similarity in a shared space).

Contrastive learning era (2020+): SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) trained visual representations by maximizing cosine similarity between augmented views of the same image. The NT-Xent loss (normalized temperature-scaled cross-entropy) normalizes embeddings to the unit sphere, then applies softmax to cosine similarities. CLIP (Radford et al., 2021) extended this to text-image pairs, achieving zero-shot transfer by learning a joint embedding space. Training objective: maximize cosine similarity for matched text-image pairs, minimize for mismatched. This approach now dominates self-supervised learning (DINO, MAE, data2vec).

Large-scale retrieval systems: Modern semantic search (Google Search, Bing, chatbot memory systems) compute cosine similarity between queries and billions of document embeddings. Approximate nearest neighbor (ANN) algorithms like FAISS (Facebook AI Similarity Search, 2017) and ScaNN (Google, 2020) enable sub-millisecond retrieval at web scale by quantizing embeddings and using hierarchical clustering. These systems pre-normalize embeddings (store on unit sphere), reducing cosine similarity to a single dot product (no division needed). Dense retrieval models (DPR, ColBERT) have largely replaced sparse TF-IDF in production, but still use cosine similarity as the core metric.

Applications across domains: Cosine similarity underpins: (1) Recommender systems (collaborative filtering, content-based filtering), (2) Clustering (spherical K-means for text, image, and directional data), (3) Duplicate detection (near-duplicate documents, plagiarism detection), (4) Anomaly detection (outliers have low cosine similarity to cluster centers), (5) Face recognition (FaceNet, ArcFace normalize embeddings, compare via cosine), (6) Drug discovery (molecular embeddings, find similar compounds), (7) Code search (GitHub Copilot retrieves relevant code via cosine similarity to query embedding). The principleânormalize when scale is noiseâis universal in high-dimensional ML.

Connection to Broader Examples

Attention mechanisms (Ex 80): Attention uses unnormalized dot products $QK^\top$, but softmax provides implicit normalization. Some variants (e.g., cosine attention) explicitly normalize: $\text{softmax}(\frac{QK^\top}{\|Q\| \|K\|})$.
Inner products and orthogonality (Ch 5): Cosine similarity measures the angle via $\langle d, q \rangle = \|d\| \|q\| \cos(\theta)$. Orthogonal vectors ($\cos(\theta) = 0$) have zero dot product.
PCA and SVD (Ex 11, 10): Principal components are normalized eigenvectors (unit length). Projections onto them use dot products, but normalization ensures scale invariance.
Least squares and projections (Ex 12, 6): Projection onto a subspace involves inner products. Normalization (e.g., QR decomposition) produces orthonormal bases.
Backprop (Ex 84): Gradients involve transposes and inner products ($X^\top \frac{\partial L}{\partial Y}$). Normalized gradients (gradient clipping) use similar principles.
Sparse methods (Ex 15): Sparse TF-IDF vectors often use cosine similarity (most entries are zero; normalization handles varying document lengths).
Contrastive learning: SimCLR, MoCo, CLIP maximize cosine similarity between positive pairs, minimize for negatives. Loss functions like NT-Xent use normalized embeddings explicitly.