Part 1: Query-Document Scoring via Dot Product - Dot product: $\langle d_i, q \rangle = \sum_j d_{ij} q_j$ measures unnormalized alignment. - High scores when vectors point in similar directions and have large magnitudes. - Document 0: $\langle [10, 0], [1, 0.5] \rangle = 10 \cdot 1 + 0 \cdot 0.5 = 10$ (high magnitude dominates). - Document 1: $\langle [0.5, 0.5], [1, 0.5] \rangle = 0.5 \cdot 1 + 0.5 \cdot 0.5 = 0.75$ (smaller despite aligning with both dimensions). - Document 2: $\langle [-1, 0], [1, 0.5] \rangle = -1 \cdot 1 + 0 \cdot 0.5 = -1$ (anti-aligned).
Part 2: Cosine Similarity Normalization - Cosine: $\cos(\theta_i) = \langle d_i, q \rangle / (\|d_i\|_2 \|q\|_2) \in [-1, 1]$. - Measures angle between vectors, independent of magnitude. - $\|q\|_2 = \sqrt{1^2 + 0.5^2} \approx 1.118$; document norms vary widely. - After normalization, rankings may change: documents with better directional alignment rank higher even if magnitudes differ. - Cosine of 1 = parallel, 0 = orthogonal, -1 = anti-parallel.
Why This Matters for ML - Information retrieval: Dot products bias toward long documents; cosine normalizes by document length. - Recommendation systems: Cosine focuses on rating patterns (correlation), not absolute counts. - Word embeddings: Semantic relatedness uses cosine; embedding norms vary arbitrarily during training. - Attention mechanisms: Some variants normalize queries/keys to unit norm (converting dot products to cosine). - K-means clustering: Spherical K-means uses cosine similarity; standard K-means uses Euclidean distance (scale-sensitive).
ML Examples and Patterns - TF-IDF retrieval: Compute cosine between query and document TF-IDF vectors; ranks by topical relevance, not length. - Collaborative filtering: User-user similarity via cosine on rating vectors; identifies correlated preferences. - BERT semantic similarity: Embed sentences, compute cosine; high cosine ($\sim 0.9$) indicates paraphrase. - Contrastive learning (CLIP, SimCLR): Maximize cosine between positive pairs, minimize for negatives. - Nearest-neighbor search: With $\ell_2$-normalized embeddings, cosine reduces to dot product (efficient MIPS).
Connection to Linear Algebra Theory - Inner product is bilinear: $\langle \alpha d, q \rangle = \alpha \langle d, q \rangle$; cosine is scale-invariant. - Cauchy-Schwarz: $|\langle d, q \rangle| \le \|d\| \|q\|$; dividing yields $|\cos(\theta)| \le 1$. - Projection onto unit sphere: Cosine similarity is dot product after normalizing to unit norm. - Orthogonality: $\langle d, q \rangle = 0 \Leftrightarrow \cos(\theta) = 0$ (angle 90°); magnitude-independent.
Numerical and Implementation Notes - Axis for document norms: axis=1 computes row-wise norms (one per document); axis=0 would be column-wise (wrong). - Zero-magnitude vectors: If $\|d_i\| = 0$, cosine is undefined (division by zero); add epsilon or filter out. - Negative cosines: Valid and meaningful (anti-alignment); e.g., antonyms in word embeddings. - Ranking via argsort: np.argsort(-scores) for descending (highest first); negative flips order. - Broadcasting: dot / (norms_docs * norm_q) broadcasts element-wise: (3,) / ((3,) * scalar).
Numerical and Shape Notes - $q \in \mathbb{R}^2$ (shape (2,)), $\text{docs} \in \mathbb{R}^{3 \times 2}$ (shape (3, 2)). - docs @ q $\to$ shape (3,) (one score per document). - np.linalg.norm(docs, axis=1) $\to$ shape (3,) (one norm per document). - np.linalg.norm(q) $\to$ scalar (query norm). - cos $\to$ shape (3,) after element-wise division.
Pedagogical Significance - Dot product vs. cosine is a core distinction: Most ML applications use cosine to avoid magnitude bias. - Geometric intuition: Dot product measures projection; cosine measures angle. - Practical impact: Rankings differ dramatically when magnitudes vary; choosing the wrong metric causes poor retrieval. - Scale invariance: Cosine treats $d$ and $10d$ identically; dot product does not. - Foundation for retrieval: Understanding this pattern is essential for search, recommendation, and embedding-based systems.
Comments