Part 1: Convex Combination as Interpolation - Convex combination: $v = \alpha a + (1-\alpha) b$ with $\alpha \in [0,1]$ ensures $v$ lies on the line segment $[a, b]$. - For $\alpha = 0$, $v = b$; for $\alpha = 1$, $v = a$; for $\alpha = 0.5$, $v$ is the midpoint. - Affine combinations allow $\alpha \notin [0,1]$ (extrapolation beyond the segment), but may leave the convex hull. - ML applications: Model ensembles (average predictions), embedding arithmetic (analogy queries), data augmentation (Mixup). - Closure property: Any linear combination $\sum_i c_i v_i$ with $v_i \in \mathbb{R}^d$ yields a vector in $\mathbb{R}^d$ (vector space axiom).
Part 2: Centering via Projection Matrix - Projection matrix: $P = I - (1/n) \mathbf{1} \mathbf{1}^\top$ where $(1/n) \mathbf{1} \mathbf{1}^\top$ is the mean projector. - Action: $Px = x - \bar{x} \mathbf{1}$ subtracts the mean $\bar{x} = (1/n) \sum_i x_i$ from each entry. - Result: $\mathbf{1}^\top (Px) = 0$ (sum equals zero) and $\|Px\| \le \|x\|$ (projection contracts or preserves norm). - Idempotent: $P^2 = P$ because projecting a centered vector does nothing. - Symmetric: $P^\top = P$ because $(\mathbf{1} \mathbf{1}^\top)^\top = \mathbf{1} \mathbf{1}^\top$ (outer product is symmetric). - Computational cost: Forming $P$ explicitly is $O(n^2)$ memory; in practice, compute x - x.mean() directly ($O(n)$).
Part 3: Orthogonal Decomposition - Any $x \in \mathbb{R}^n$ decomposes uniquely as $x = \underbrace{\bar{x} \mathbf{1}}_{\text{constant}} + \underbrace{(x - \bar{x} \mathbf{1})}_{\text{zero-mean}}$. - The two components are orthogonal: $\langle \mathbf{1}, x - \bar{x} \mathbf{1} \rangle = 0$. - Direct sum: $\mathbb{R}^n = \text{span}(\mathbf{1}) \oplus S$ where $S = \ker(\mathbf{1}^\top)$. - Dimension: $\dim(\text{span}(\mathbf{1})) = 1$, $\dim(S) = n-1$, and $1 + (n-1) = n$. - Geometric intuition: In $\mathbb{R}^3$, $S$ is a plane through the origin perpendicular to $\mathbf{1} = [1,1,1]^\top$.
ML Examples - Word embeddings: Convex combinations interpolate semantics; $0.5 \cdot \text{king} + 0.5 \cdot \text{queen}$ yields a âroyalâ concept. - Mixup augmentation: $x' = \lambda x_i + (1-\lambda) x_j$ with $\lambda \sim \text{Beta}(\alpha, \alpha)$ creates synthetic training examples. - Stochastic Weight Averaging: Average model parameters over trajectory: $\theta_{\text{avg}} = (1/T) \sum_t \theta_t$ improves generalization. - PCA preprocessing: Center data via $X_c = X - \mu$ (equivalently, $Px$ per column) before computing eigenvectors of covariance. - Batch normalization: Normalize activations $\hat{x} = (x - \mu_B) / \sigma_B$ where $\mu_B$ is the batch mean (centering step). - Residual connections: ResNet blocks compute $y = x + F(x)$, an affine combination (not convex unless normalized).
Pedagogical Notes - Dual perspectives: Construction (build vectors via combinations) vs. decomposition (split into orthogonal components). - Why two unrelated blocks? Both are vector space operations, but highlight different aspects: closure under operations (Part 1) and subspace characterization (Part 2). - Minimal example: 4D embeddings and 7D signal are small enough to inspect manually but large enough to be non-trivial. - Verification: Printing shapes and sums provides sanity checks that operations succeeded. - Generalization: The pattern extends to $\mathbb{R}^{1000}$ embeddings, $n=10^6$ time series, and batch/channel dimensions in CNNs.
Connection to ML Applications - Embedding arithmetic: Analogy queries like âkingâ - âmanâ + âwomanâ â âqueenâ rely on vector space structure. - Model interpolation: Linear mode connectivity (Frankle et al.) shows that interpolating between trained models often yields low-loss paths. - Data centering: Standardization $(x - \mu) / \sigma$ combines centering (projection) and scaling (normalization). - Low-rank approximation: Centered data enables PCA/SVD to capture variance directions rather than the mean offset. - Gradient descent: Parameter updates $\theta \leftarrow \theta - \eta \nabla L$ are affine combinations (weight $1$, gradient weight $-\eta$).
Connection to Linear Algebra Theory - Vector space axioms: Closure under addition/scalar multiplication, associativity, commutativity, identity, inverses, distributivity. - Subspace criteria: $S \subseteq V$ is a subspace iff $u, v \in S \Rightarrow u+v \in S$ and $\alpha v \in S$ for all $\alpha$. - Kernel characterization: $S = \ker(\mathbf{1}^\top)$ defines $S$ via a linear functional; kernels are always subspaces. - Projection theorem: For any $x \in V$ and subspace $S$, there exists a unique $x_S \in S$ minimizing $\|x - x_S\|$ (the projection). - Direct sum: $V = S \oplus S^\perp$ decomposes $V$ into orthogonal subspaces (here, $S^\perp = \text{span}(\mathbf{1})$). - Gram-Schmidt preview: Orthogonalizing vectors starts by projecting each against the span of previous ones (centering is the first step).
Numerical and Implementation Notes - Avoid forming $P$ explicitly: For large $n$, compute x - x.mean() instead of P @ x to save $O(n^2)$ memory. - Column vector for $\mathbf{1}$: Use np.ones((n, 1)) (shape (n, 1)) to enable outer product one @ one.T â (n, n). - Broadcasting: P @ x works when x has shape (n,) or (n, 1); NumPy broadcasts appropriately. - Floating-point sum: Centered vector sum is $\approx 10^{-15}$, not exactly zero, due to rounding errors. - Numerical stability: Projection $P$ is numerically stable (subtracting mean doesnât amplify errors); explicit matrix formation is the only concern.
Numerical and Shape Notes - Embedding shapes: a, b are shape (4,); convex combination yields shape (4,) (preserved). - Signal shape: x is shape (7,); projection yields shape (7,) (preserved). - Projection matrix: $P \in \mathbb{R}^{7 \times 7}$ (shape (7, 7)); expensive to store for large $n$. - Outer product: one @ one.T with one shape (7, 1) yields shape (7, 7) (rank-1 matrix). - Identity matrix: np.eye(n) yields shape (n, n); subtracting outer product yields $P$.
ML Context: From Vector Spaces to Deep Learning - Embeddings as vectors: Word2Vec, GloVe, BERT embeddings all live in $\mathbb{R}^d$; distances and angles have semantic meaning. - Layer activations: Each hidden layer in a neural network maps $\mathbb{R}^{d_{in}} \to \mathbb{R}^{d_{out}}$; activations are vectors. - Parameter space: Model parameters $\theta \in \mathbb{R}^p$ form a high-dimensional vector space; optimization navigates this space. - Gradient flow: Backpropagation computes gradients $\nabla_{\theta} L \in \mathbb{R}^p$; updates are vector additions. - Normalization layers: Batch norm, layer norm, group norm all center activations (projection onto zero-mean subspace). - Residual connections: $y = x + F(x)$ exploits vector addition to enable gradient flow through 100+ layers. - Attention mechanisms: Queries, keys, values are vectors; scaled dot-product attention computes inner products and linear combinations.
Comments