Chapter 02 — Basis, Dimension, and Coordinate Systems

Chapter 02 — Basis, Dimension, and Coordinate Systems

Overview

Purpose of the Chapter

This chapter formalizes the concept of dimension—the single most important numerical invariant of a vector space—and its foundation: the notion of a basis. While Chapter 1 introduced vector spaces, subspaces, and linear independence as abstract structures, this chapter makes those ideas concrete and computational by establishing that every finite-dimensional vector space admits a basis (a linearly independent spanning set) and that all bases have the same size, called the dimension. This universality is profound: dimension is an intrinsic property of the space, independent of how we choose to represent vectors. Understanding dimension transforms qualitative questions (“Can this model represent that function?”) into quantitative ones (“Does the parameter space have sufficient dimension?”), and clarifies when two spaces are isomorphic (same dimension implies same structure, up to relabeling).

The chapter develops the machinery for working with coordinates: given a basis \(\mathcal{B} = \{\mathbf{b}_1, \dots, \mathbf{b}_n\}\), every vector \(\mathbf{v}\) has a unique coordinate representation \([\mathbf{v}]_\mathcal{B} \in \mathbb{R}^n\), and changing basis corresponds to a linear transformation (change-of-basis matrix). This duality—abstract vectors versus concrete coordinate vectors—is the bridge between geometric intuition (vectors as arrows, subspaces as flats) and algebraic computation (vectors as tuples, linear maps as matrices). In machine learning, this duality appears everywhere: raw features (abstract vectors) versus their representation in a learned basis (PCA components, embeddings), model parameters in the “standard” parameterization versus reparameterized coordinates (natural gradient, whitening), and latent codes in autoencoders (coordinates in a learned basis of the data manifold).

The chapter progresses systematically: we prove the Basis Extension Theorem (every independent set extends to a basis), the Replacement Theorem (bases can be exchanged incrementally, controlling dimension), and the Dimension Theorem (all bases have the same size). These results establish dimension as well-defined and enable the fundamental rank-nullity theorem, which relates the dimensions of domain, image, and kernel of a linear transformation: \(\dim(\text{domain}) = \dim(\text{image}) + \dim(\text{kernel})\). This equation is the algebraic heart of linear algebra, encoding constraints on expressivity, redundancy, and information flow that govern everything from the solvability of linear systems to the representational capacity of neural network layers.

Concrete ML application: In Principal Component Analysis (PCA) applied to a dataset of \(n = 1000\) images with \(d = 784\) pixels (MNIST), the data matrix \(X \in \mathbb{R}^{1000 \times 784}\) has rank at most \(\min(1000, 784) = 784\), meaning the \(1000\) data vectors lie in a subspace of dimension at most \(784\). In practice, if we compute the eigendecomposition of the covariance matrix \(\Sigma = \frac{1}{n} X^\top X\) and find that only \(k = 50\) eigenvalues are significantly non-zero (say, capturing 95% of variance), this reveals that the data effectively lie in a 50-dimensional subspace \(U \subseteq \mathbb{R}^{784}\). The top \(k\) eigenvectors \(\{\mathbf{u}_1, \dots, \mathbf{u}_{50}\}\) form a basis for this subspace, and projecting data onto this basis via \(\tilde{\mathbf{x}}_i = U_k^\top \mathbf{x}_i \in \mathbb{R}^{50}\) yields 50-dimensional coordinate representations (the “principal component scores”). Each \(\tilde{\mathbf{x}}_i\) is \([\mathbf{x}_i]_{\mathcal{B}}\), the coordinate vector of \(\mathbf{x}_i\) with respect to the PCA basis \(\mathcal{B} = \{\mathbf{u}_1, \dots, \mathbf{u}_{50}\}\). This change of basis from the standard pixel basis (where each coordinate is a pixel value) to the PCA basis (where each coordinate is a “component loading”) reduces dimensionality from 784 to 50, dramatically lowering storage and computation while preserving essential structure. The fact that this compressed representation works well for downstream tasks (classification, clustering) validates the assumption that the intrinsic dimension of the data (the dimension of the subspace they lie near) is indeed \(\approx 50\), far below the ambient 784 dimensions.

Conceptual Scope

The chapter covers four interrelated conceptual pillars, each building on the vector space foundations from Chapter 1. (1) Basis: Minimal Spanning Sets: A basis is simultaneously a spanning set (every vector is a linear combination of basis elements) and a linearly independent set (no element is redundant). This duality makes bases the “minimal complete coordinate systems”: they have just enough elements to specify any vector uniquely, with no waste. The existence of a basis is non-trivial for finite-dimensional spaces and requires proof (typically via the Basis Extension or Replacement Theorems). Standard examples include the standard basis \(\{\mathbf{e}_1, \dots, \mathbf{e}_n\}\) in \(\mathbb{R}^n\), where \(\mathbf{e}_i\) has 1 in position \(i\) and 0 elsewhere; orthonormal bases arising from Gram-Schmidt orthogonalization or eigendecompositions; and Fourier bases in function spaces (sines and cosines spanning periodic functions). Non-examples include redundant spanning sets (not independent) and insufficient independent sets (not spanning). The utility of a basis lies in coordinatization: given \(\mathcal{B} = \{\mathbf{b}_1, \dots, \mathbf{b}_n\}\), every \(\mathbf{v}\) has unique coordinates \(c_1, \dots, c_n\) satisfying \(\mathbf{v} = \sum_i c_i \mathbf{b}_i\), enabling computation via familiar \(n\)-tuples.

(2) Dimension: The Universal Invariant: Dimension is defined as the size of any basis. The Dimension Theorem proves that all bases of a given space have the same size, so dimension is an intrinsic property of the space, not dependent on the chosen basis. For \(\mathbb{R}^n\), \(\dim(\mathbb{R}^n) = n\); for polynomial spaces \(\mathcal{P}_k\) of degree at most \(k\), \(\dim(\mathcal{P}_k) = k+1\) (basis: \(\{1, x, x^2, \dots, x^k\}\)); for matrix spaces \(\mathbb{R}^{m \times n}\), \(\dim(\mathbb{R}^{m \times n}) = mn\). Dimension quantifies degrees of freedom: an \(n\)-dimensional space requires \(n\) parameters to specify a vector. Subspace dimensions satisfy \(\dim(W) \leq \dim(V)\) when \(W \subseteq V\), with equality implying \(W = V\). The rank-nullity theorem relates dimensions: for a linear map \(T: V \to W\), \(\dim(V) = \dim(\ker T) + \dim(\text{im} T)\), where \(\ker T\) is the null space (dimensions lost) and \(\text{im} T\) is the image (dimensions preserved). This formula is fundamental: it explains why underdetermined systems have infinitely many solutions (\(\dim(\ker A) > 0\)), why overdetermined systems typically have no exact solution (\(\dim(\text{im} A) < \dim(\text{codomain})\)), and why low-rank approximations compress information (reducing dimension of the image space).

(3) Coordinate Representations and Change of Basis: Once a basis \(\mathcal{B}\) is fixed, the coordinate map \([\cdot]_\mathcal{B} : V \to \mathbb{R}^n\) is a linear isomorphism: it preserves addition and scaling, and is bijective (one-to-one and onto). This means \(V\) and \(\mathbb{R}^n\) are “the same” from an algebraic perspective—any statement true in one translates to the other. However, different bases yield different coordinate systems. Changing from basis \(\mathcal{B}\) to basis \(\mathcal{B}'\) transforms coordinates via the change-of-basis matrix \(P_{\mathcal{B} \to \mathcal{B}'}\), whose columns are \([\mathbf{b}_i]_{\mathcal{B}'}\) (old basis vectors expressed in new coordinates). The transformation law is \([\mathbf{v}]_{\mathcal{B}'} = P_{\mathcal{B} \to \mathcal{B}'} [\mathbf{v}]_\mathcal{B}\). Choosing a basis is choosing a coordinate system, and different bases are convenient for different purposes: the standard basis makes vectors explicit as tuples, the eigenbasis diagonalizes operators (decoupling coordinates), and the Fourier basis decomposes signals into frequency components. In ML, choosing a good basis (feature representation) is often the determining factor in model performance: PCA chooses the variance-maximizing basis, Independent Component Analysis (ICA) seeks a statistically independent basis, and autoencoders learn a basis (encoder weights) optimized for reconstruction and downstream tasks.

(4) Rank, Nullity, and the Fundamental Theorem: The rank of a matrix \(A \in \mathbb{R}^{m \times n}\) is the dimension of its column space (equivalently, row space), and the nullity is the dimension of its null space. The rank-nullity theorem states \(n = \text{rank}(A) + \text{nullity}(A)\), partitioning the \(n\) input dimensions into those preserved (rank) and those annihilated (nullity). Full rank (\(\text{rank}(A) = \min(m,n)\)) means maximal preservation: if \(m \geq n\) and \(\text{rank}(A) = n\), then \(A\) is injective (one-to-one), so \(A\mathbf{x} = A\mathbf{y} \implies \mathbf{x} = \mathbf{y}\) (no information loss in encoding). If \(m \leq n\) and \(\text{rank}(A) = m\), then \(A\) is surjective (onto), so every \(\mathbf{b} \in \mathbb{R}^m\) satisfies \(A\mathbf{x} = \mathbf{b}\) for some \(\mathbf{x}\) (every output is reachable). Rank deficiency (\(\text{rank}(A) < \min(m,n)\)) indicates degeneracy: columns (or rows) are linearly dependent, redundancy exists, and \(A\) fails to be invertible (if square). Computing rank (via Gaussian elimination, SVD, or eigenvalue counts) is a fundamental diagnostic in ML: feature matrices with dependent columns signal multicollinearity, weight matrices with low rank indicate bottleneck layers, and data matrices with low rank reveal low intrinsic dimension (justifying dimensionality reduction).

Concrete ML application: In linear regression with \(n = 100\) samples and \(d = 50\) features, the design matrix \(X \in \mathbb{R}^{100 \times 50}\) has rank at most \(\min(100, 50) = 50\). If two features are identical (say, \(X_{:,5} = X_{:,10}\)), the columns are dependent, so \(\text{rank}(X) \leq 49\). The least-squares solution \(\hat{\mathbf{w}} = (X^\top X)^{-1} X^\top \mathbf{y}\) requires \(X^\top X \in \mathbb{R}^{50 \times 50}\) to be invertible, which demands \(\text{rank}(X) = 50\) (full column rank). With dependent columns, \(\text{rank}(X^\top X) = \text{rank}(X) < 50\), so \(X^\top X\) is singular (non-invertible), and the solution is non-unique: infinitely many weight vectors yield identical predictions. The dimension of the solution set is the nullity of \(X^\top X\), which equals the number of redundant features. In practice, regularization (ridge regression \((X^\top X + \lambda I)^{-1} X^\top \mathbf{y}\)) adds a multiple of the identity to make the matrix invertible, effectively projecting the solution onto a subspace where a unique solution exists. The regularization parameter \(\lambda\) controls the strength of this projection: large \(\lambda\) pulls solutions toward zero (the center of parameter space), while small \(\lambda\) allows larger norms. From a dimension perspective, regularization implicitly reduces the effective dimension of the parameter space by penalizing directions with small eigenvalues (near-nullity directions), stabilizing the solution and improving generalization by biasing toward lower-complexity (lower-effective-dimension) models.

Questions This Chapter Answers

This chapter addresses foundational questions about structure, representation, and computation in vector spaces, all revolving around the notion of dimension. What is a basis, and why does every finite-dimensional vector space have one? We define a basis as a linearly independent spanning set and prove existence via the Basis Extension Theorem: start with any independent set (or the empty set, yielding \(\{\mathbf{0}\}\)) and repeatedly add vectors until spanning is achieved; independence is preserved at each step by carefully choosing vectors outside the current span. The proof is constructive, providing an algorithm (though inefficient in practice), and establishes that bases are neither exotic nor optional—they are fundamental and ubiquitous. Are all bases of a given space the same size? Yes: the Dimension Theorem (also called the Basis Invariance Theorem) proves that any two bases of a finite-dimensional vector space \(V\) have the same number of elements. This non-trivial result relies on the Replacement Theorem (or the Steinitz Exchange Lemma), which shows that if \(\{\mathbf{v}_1, \dots, \mathbf{v}_m\}\) is independent and \(\{\mathbf{w}_1, \dots, \mathbf{w}_n\}\) spans, then \(m \leq n\). Applying this to two bases (each independent and spanning) gives \(m \leq n\) and \(n \leq m\), hence \(m = n\). Thus dimension is well-defined.

How do we compute coordinates, and what does changing basis mean? Given a basis \(\mathcal{B} = \{\mathbf{b}_1, \dots, \mathbf{b}_n\}\) and a vector \(\mathbf{v} = \sum_i c_i \mathbf{b}_i\), the coordinates are \([\mathbf{v}]_\mathcal{B} = (c_1, \dots, c_n)^\top\). To compute them, solve the linear system \(B \mathbf{c} = \mathbf{v}\), where \(B = [\mathbf{b}_1 | \cdots | \mathbf{b}_n]\) is the matrix with basis vectors as columns, yielding \(\mathbf{c} = B^{-1} \mathbf{v}\). Changing from basis \(\mathcal{B}\) to \(\mathcal{B}'\) transforms coordinates via \([\mathbf{v}]_{\mathcal{B}'} = P [\mathbf{v}]_\mathcal{B}\), where \(P\) is the change-of-basis matrix whose columns are the old basis vectors expressed in the new basis. This is a similarity transformation: it preserves algebraic relationships but alters numerical values. In ML, this corresponds to feature transformations (PCA whitening, normalization) or reparameterization (gradient preconditioning).

What is the rank-nullity theorem, and why does it matter? For a linear map \(T: V \to W\) (or matrix \(A \in \mathbb{R}^{m \times n}\)), the theorem states \(\dim(V) = \dim(\ker T) + \dim(\text{im} T)\) (or \(n = \text{nullity}(A) + \text{rank}(A)\)). This partitions the domain dimension into directions annihilated by \(T\) (kernel) and directions preserved (image). It explains solution existence and uniqueness: for \(A\mathbf{x} = \mathbf{b}\), solutions exist iff \(\mathbf{b} \in \text{im}(A)\) (equivalently, \(\text{rank}([A | \mathbf{b}]) = \text{rank}(A)\)), and uniqueness holds iff \(\ker(A) = \{\mathbf{0}\}\) (nullity = 0, equivalently, full column rank). The theorem also governs information flow in compositions: if \(T: U \to V\) and \(S: V \to W\), then \(\text{rank}(ST) \leq \min(\text{rank}(S), \text{rank}(T))\), and a bottleneck (low rank) anywhere in the chain limits total rank. In neural networks, each layer \(\mathbf{h}^{(\ell)} = W^{(\ell)} \mathbf{h}^{(\ell-1)}\) has image dimension \(\leq \text{rank}(W^{(\ell)})\), so a single low-rank layer constrains the rank of all subsequent layers, bounding representational capacity.

How does dimension relate to expressivity and model capacity? Dimension quantifies degrees of freedom: an \(n\)-dimensional parameter space admits \(n\) independent adjustments, corresponding to \(n\) “knobs” that can be tuned. In hypothesis classes, higher dimension typically implies greater expressivity (more functions can be represented), but also higher sample complexity (more data needed to generalize). For example, linear classifiers in \(\mathbb{R}^d\) have parameter space \(\mathbb{R}^{d+1}\) (weights + bias), dimension \(d+1\), and VC dimension \(d+1\) (related but distinct). Kernel methods embed data into high-(or infinite-)dimensional feature spaces, vastly increasing dimension and expressivity but risking overfitting unless regularized. Low-rank matrix factorization (embedding dimension \(k\)) reduces parameter count from \(mn\) to \(k(m+n)\) by restricting to a \(k\)-dimensional latent subspace, trading expressivity for generalization. Understanding dimension clarifies these trade-offs quantitatively: dimension bounds capacity, sample complexity, and computational cost.

What are the dimensions of common subspaces (null space, column space, row space)? For \(A \in \mathbb{R}^{m \times n}\), the dimensions are: \(\dim(\text{Col}(A)) = \text{rank}(A) \leq \min(m,n)\), \(\dim(\text{Nul}(A)) = n - \text{rank}(A)\) (by rank-nullity), \(\dim(\text{Row}(A)) = \text{rank}(A)\) (row rank equals column rank), and \(\dim(\text{Left-Nul}(A)) = m - \text{rank}(A)\) (dimension of \(\{\mathbf{y} : A^\top \mathbf{y} = \mathbf{0}\}\)). These four subspaces—column, null, row, left-null—form the fundamental theorem of linear algebra: they decompose domain and codomain into orthogonal complements (when equipped with inner products). In regression, \(\text{Col}(X)\) is the space of fitted values, \(\text{Nul}(X)\) parameterizes solution non-uniqueness, \(\text{Row}(X)\) represents “active directions” in feature space, and \(\text{Left-Nul}(X)\) corresponds to unmodeled variation in the response. Dimension counts for these subspaces diagnose problems: \(\dim(\text{Nul}(X)) > 0\) signals multicollinearity, \(\dim(\text{Col}(X)) < m\) means some responses are unattainable, and \(\text{rank}(X) < d\) indicates redundant features.

Concrete ML application: A practitioner building a recommender system uses collaborative filtering with a user-item matrix \(R \in \mathbb{R}^{1000 \times 500}\) (1000 users, 500 items), where most entries are missing (sparse observations). To impute missing entries, she factorizes \(R \approx UV^\top\) with \(U \in \mathbb{R}^{1000 \times k}, V \in \mathbb{R}^{500 \times k}\), assuming rank \(k = 20\) (latent factors like genres, themes). This assumes \(\dim(\text{Col}(R)) \leq 20\), i.e., all item rating profiles lie in a 20-dimensional subspace of \(\mathbb{R}^{1000}\) (spanned by \(k\) latent factors). The columns of \(U\) form a basis for the user latent space (each user is a linear combination of \(k\) “archetypes”), and columns of \(V\) form a basis for item latent space (each item is a combination of \(k\) attributes). The predicted rating \(\hat{R}_{ij} = U_{i,:} V_{j,:}^\top\) is the inner product of user \(i\)’s coordinates and item \(j\)’s coordinates in these bases. By reducing dimension from 500 (full item space) to 20 (latent subspace), the model dramatically cuts parameters (from \(1000 \times 500 = 500000\) to \(1000 \times 20 + 500 \times 20 = 30000\)), enabling generalization from sparse data. If the true rank of rating patterns is indeed \(\approx 20\), this compression is lossless (in the limit of infinite data); if the true rank is higher, the model underrepresents complexity and suffers bias. Choosing \(k\) (the dimension of the latent basis) is thus a bias-variance trade-off: too small and the model is rigid (underfitting), too large and it overfits noise (poor generalization). Cross-validation on held-out data guides this choice, using reconstruction error as a proxy for generalization.

How This Chapter Fits Into the Full Book

This chapter is the bridge between abstract vector space theory (Chapter 1) and concrete matrix computations (Chapters 3 onward). Chapter 1 introduced subspaces, span, and independence as algebraic concepts; this chapter quantifies them via dimension and makes them computational via bases and coordinates. Chapter 3 (Linear Transformations and Matrices) relies heavily on coordinate representations: a linear transformation \(T: V \to W\) is represented by a matrix \([T]_{\mathcal{B}, \mathcal{C}}\) once bases \(\mathcal{B}\) (for \(V\)) and \(\mathcal{C}\) (for \(W\)) are chosen, and changing bases yields a similarity transformation \(A' = P^{-1} A Q\) relating matrix representations. The rank-nullity theorem governs injectivity and surjectivity, key properties for understanding when transformations are invertible (isomorphisms).

Chapter 4 (Determinants and Inverses) uses dimension to characterize invertibility: a square matrix \(A \in \mathbb{R}^{n \times n}\) is invertible iff \(\text{rank}(A) = n\) (full rank) iff \(\det(A) \neq 0\). The determinant measures how \(A\) scales volumes (dimension-preserving transformations preserve volumes up to sign), and zero determinant signals dimension collapse (the image is a lower-dimensional subspace). Chapter 5 (Eigenvalues and Eigenvectors) finds special bases (eigenbases) that diagonalize matrices, simplifying computations. Dimension counts eigenspaces: the geometric multiplicity of eigenvalue \(\lambda\) is \(\dim(\text{Null}(A - \lambda I))\), and diagonalizability requires the sum of geometric multiplicities to equal \(n\) (full dimension). The spectral theorem (Chapter 6) guarantees eigenbases for symmetric matrices, underlying PCA and many other ML algorithms.

Chapter 7 (Singular Value Decomposition) generalizes eigendecomposition to rectangular matrices, expressing \(A \in \mathbb{R}^{m \times n}\) as \(A = U \Sigma V^\top\), where \(U, V\) provide orthonormal bases for domain and codomain, and \(\Sigma\) contains singular values (scaling factors). The rank \(r = \text{rank}(A)\) is the number of nonzero singular values, and the SVD decomposes \(A\) into \(r\) rank-1 components, each contributing one dimension to the column space. Low-rank approximation \(A_k = U_k \Sigma_k V_k^\top\) (keeping top \(k < r\) singular values) projects \(A\) onto a \(k\)-dimensional subspace, minimizing Frobenius norm error—the optimal dimension reduction for matrices.

Chapter 8 (Orthogonality and Least Squares) uses orthogonal bases (orthonormal via Gram-Schmidt) to simplify projections: projecting \(\mathbf{y}\) onto \(\text{Col}(X)\) yields \(\hat{\mathbf{y}} = X(X^\top X)^{-1} X^\top \mathbf{y}\), the least-squares fit. The dimensions govern solution uniqueness: \(\dim(\text{Nul}(X)) = 0\) (full column rank) ensures unique \(\hat{\mathbf{w}}\). The QR decomposition \(X = QR\) (orthonormal \(Q\), upper-triangular \(R\)) encodes a change of basis from standard coordinates to an orthonormal basis spanning \(\text{Col}(X)\), stabilizing numerical computations.

Chapter 9 onward (Optimization, Eigendecomposition, Matrix Calculus) assumes dimension and basis theory as background. Optimization in \(\mathbb{R}^n\) (gradient descent, Newton’s method) treats the parameter space as a coordinate space; natural gradient methods change to a basis aligned with the Fisher information metric, improving convergence. Eigenvalue problems (Chapter 10) find invariant subspaces (eigenspaces), each with a dimension (geometric multiplicity), and spectral clustering decomposes data space into orthogonal eigenspace components, each representing a mode of variation. Matrix calculus (Chapter 11) differentiates with respect to matrix variables, yielding gradients living in spaces of the same dimension (e.g., \(\nabla_W L \in \mathbb{R}^{m \times n}\) if \(W \in \mathbb{R}^{m \times n}\)).

In machine learning applications throughout Part 2 (Chapter 12: PCA and Dimensionality Reduction, Chapter 13: Regression and Regularization, Chapter 14: Support Vector Machines, Chapter 15: Neural Networks and Deep Learning), dimension is the central organizing concept: PCA identifies the \(k\)-dimensional principal subspace (basis: eigenvectors), regularization constrains effective dimension (ridge shrinks toward lower-dimensional subspaces), SVMs maximize margin in feature space (dimension = number of support vectors determines model complexity), and neural network layers compose linear transformations between spaces of varying dimensions, with bottleneck layers (low dimension) enforcing information compression. Without dimension and basis theory—the tools to count degrees of freedom, characterize redundancy, and choose coordinate systems—these advanced topics would lack quantitative rigor, and practitioners would struggle to diagnose rank deficiency, choose embedding dimensions, or understand why certain architectures fail to learn. Thus, this chapter is the quantitative foundation: it turns vector spaces from abstract sets into structured, measurable, and computable objects.


Motivation

Why Coordinates Matter

Machine learning is fundamentally about computation with data, and computation requires concrete numerical representations. Abstract vectors in \(\mathbb{R}^d\) or feature spaces are conceptually clean but operationally useless until we represent them as arrays of numbers (tuples, vectors, tensors). A basis provides this representation: given \(\mathcal{B} = \{\mathbf{b}_1, \dots, \mathbf{b}_n\}\), every vector \(\mathbf{v}\) becomes a tuple \([\mathbf{v}]_\mathcal{B} = (c_1, \dots, c_n)^\top\), and operations (addition, scaling, dot products) become arithmetic on tuples. This coordinatization bridges theory and implementation: theorems about abstract spaces translate into algorithms on \(\mathbb{R}^n\), and computational results (numerics) inform theoretical understanding. Without coordinates, we have no way to store vectors in memory, no way to execute matrix-vector products, and no way to optimize over parameter spaces—coordinates are the interface between mathematics and machines.

However, the choice of basis matters immensely. Different bases yield different coordinate systems, and some bases are far more useful than others for specific tasks. The standard basis \(\{\mathbf{e}_1, \dots, \mathbf{e}_n\}\) (unit vectors along coordinate axes) is the default: coordinates are simply the components of the vector, and operations are straightforward. But this basis has no special relationship to the data or problem structure. An orthonormal basis (mutually perpendicular unit vectors) preserves lengths and angles, simplifying geometry: dot products become sums of products of coordinates, norms become Euclidean lengths, and projections become coordinate drops. An eigenbasis (eigenvectors of a matrix) diagonalizes that matrix, decoupling coordinates: each coordinate evolves independently under the transformation, enabling efficient exponentiation, inversion, and iteration (powers of diagonal matrices are trivial). A Fourier basis decomposes signals into frequency components, separating low-frequency trends from high-frequency noise—critical in signal processing and time series analysis. In machine learning, choosing the right basis is often the difference between success and failure: raw pixel bases obscure structure (nearby pixels are correlated), while learned bases (PCA components, convolutional filters, word embeddings) expose latent patterns that align with the learning task.

Concrete ML example: In a time series forecasting problem with daily temperature data (365 days per year over 10 years, yielding vectors in \(\mathbb{R}^{365}\)), the standard basis (each coordinate = one day’s temperature) treats all days equally, offering no insight into seasonal patterns. However, transforming to a Fourier basis (sum of sine and cosine functions at various frequencies) decomposes the signal into periodic components: a low-frequency component captures the annual cycle (winter-summer variation), medium frequencies capture monthly fluctuations, and high frequencies capture daily noise. In this Fourier basis, the first few coordinates (low frequencies) contain most of the signal energy, while later coordinates (high frequencies) are dominated by noise. Forecasting becomes simple: predict the low-frequency coordinates (smooth trends) using historical data, and ignore or regularize the high-frequency coordinates (unforecastable noise). A model trained in the Fourier basis generalizes better because the coordinates align with the problem structure (periodicity), whereas a model in the standard basis must learn this structure from scratch (requiring vastly more data). This demonstrates a principle: the right basis choice reduces effective dimension (concentrates information in fewer coordinates), simplifies models, and improves generalization. In practice, the Fourier transform (FFT algorithm) is a change-of-basis matrix applied to the data, converting from standard coordinates to Fourier coordinates—an O(n log n) operation that is essential in signal processing, image compression (JPEG uses Discrete Cosine Transform, a related basis), and many ML preprocessing pipelines.

Minimal Generating Sets

A key insight of basis theory is the tension between spanning (completeness) and independence (minimality). A spanning set \(S\) generates the entire space (every vector is a linear combination of elements in \(S\)), ensuring that we can represent anything, but \(S\) may contain redundant elements: some vectors in \(S\) might be expressible as combinations of others, offering no new directions. An independent set contains no such redundancy—each element adds a new dimension—but may fail to span, leaving some vectors unrepresentable. A basis reconciles these: it is a spanning set with no redundancy (independent) or, equivalently, a maximal independent set (adding any vector breaks independence). This makes bases optimal coordinate systems: they have just enough elements to cover the space (spanning) without wasting any (independence).

Why does minimality matter? In computational terms, representing a vector in basis \(\mathcal{B}\) requires \(|\mathcal{B}|\) coordinates (one per basis element). A redundant spanning set would require more coordinates than necessary, and those extra coordinates would not be unique (infinitely many coordinate tuples represent the same vector). Non-uniqueness destroys the coordinate map as an isomorphism: we cannot invert it (going from coordinates back to vectors is ambiguous). This makes computations ill-defined: which coordinate tuple should we use for \(\mathbf{v}\)? In optimization, non-unique parameterizations cause identifiability failures: many parameter values yield identical model predictions, preventing gradient-based methods from converging to a unique solution. In statistical estimation, non-identifiability inflates standard errors and makes parameter interpretations meaningless. Thus, minimality (independence) is not an aesthetic preference—it is a computational and statistical necessity.

Conversely, a non-spanning independent set is insufficient: some vectors have no coordinate representation at all, and operations involving those vectors are undefined. This limits expressivity: models parameterized with an insufficient basis cannot represent all possible functions in the target class, guaranteeing approximation error no matter how much data or compute is available. In neural networks, insufficient hidden layer dimensions (too few neurons) create a representational bottleneck: the layer’s output subspace has lower dimension than required for the task, and no amount of training can overcome this (it’s a capacity limitation, not a training failure). Thus, spanning (completeness) is equally essential: bases must have enough elements to cover the space.

Concrete ML example: In polynomial regression, we model \(y\) as a polynomial function of \(x\): \(\hat{y} = w_0 + w_1 x + w_2 x^2 + \cdots + w_k x^k\). The feature space is \(\mathcal{P}_k\) (polynomials of degree \(\leq k\)), and the natural basis is \(\mathcal{B} = \{1, x, x^2, \dots, x^k\}\). This basis has \(k+1\) elements, so \(\dim(\mathcal{P}_k) = k+1\). Model complexity is determined by \(k\): higher \(k\) means higher dimension, more parameters, and greater flexibility (but also higher variance). Now suppose a practitioner mistakenly includes redundant features: \(\{1, x, x^2, 2x^2\}\) (the last is redundant since \(2x^2\) is a multiple of \(x^2\)). This spanning set has 4 elements but spans only a 3-dimensional space (dimension of \(\mathcal{P}_2\)). Attempting to fit \(w_0 + w_1 x + w_2 x^2 + w_3 (2x^2) = w_0 + w_1 x + (w_2 + 2w_3) x^2\) yields non-unique coefficients: many \((w_0, w_1, w_2, w_3)\) tuples produce the same polynomial (e.g., \(w_3 = 0, w_2 = 1\) and \(w_3 = 0.5, w_2 = 0\) both give \(x^2\)). The normal equations \(X^\top X \mathbf{w} = X^\top \mathbf{y}\) have \(X^\top X\) singular (non-invertible), and solving via pseudoinverse or regularization produces an arbitrary solution from an infinite family. This instability—caused by using a redundant (non-independent) generating set—makes the model uninterpretable (what does \(w_3\) mean?) and numerically fragile. Removing the redundancy (using the basis \(\{1, x, x^2\}\)) restores uniqueness, stability, and interpretability. This illustrates the power of minimal spanning sets (bases): they eliminate redundancy, ensure unique parameterization, and stabilize computation.

Redundancy and Expressivity

In machine learning, model capacity is often quantified by the number of parameters, but the true measure of expressivity is the dimension of the hypothesis class (the space of functions representable by the model). These can diverge when parameters are redundant: a model with 1000 parameters that are linearly dependent may have effective dimension \(\ll 1000\). For example, a linear layer \(\mathbf{h} = W\mathbf{x}\) with \(W \in \mathbb{R}^{100 \times 100}\) has 10,000 parameters, but if \(\text{rank}(W) = 50\), the image \(\{W\mathbf{x} : \mathbf{x} \in \mathbb{R}^{100}\}\) is a 50-dimensional subspace of \(\mathbb{R}^{100}\). The effective dimension is 50, not 100: the layer can only produce outputs in a 50-dimensional manifold, no matter how the parameters are adjusted. This redundancy limits expressivity below the nominal parameter count.

Redundancy arises from several sources: (1) Overparameterization: more parameters than necessary to represent the function class (common in deep learning, often intentional for optimization reasons). (2) Symmetries: different parameter settings yielding identical functions (e.g., swapping neurons in a hidden layer). (3) Linear dependencies: constraints or relationships among parameters that reduce effective degrees of freedom. Redundancy is not inherently bad—in fact, overparameterization can improve trainability (wider loss landscapes, easier gradient flow)—but understanding the effective dimension (true expressivity) is crucial for generalization analysis. A model with high parameter count but low effective dimension may generalize well despite classical theories (like VC dimension bounds) suggesting overfitting, because the true capacity is lower than the parameter count suggests.

Conversely, expressivity is maximized when the parameter space has no redundancy: a basis for the hypothesis class. For example, representing functions in \(\mathcal{P}_k\) via the monomial basis \(\{1, x, \dots, x^k\}\) uses \(k+1\) parameters (coefficients), achieving full expressivity with minimal parameterization. Any redundant features (e.g., including \(x^2\) and \(2x^2\)) add parameters without increasing expressivity. Thus, basis theory provides a language for discussing expressivity-redundancy trade-offs: bases minimize parameters for a given expressivity (dimension), while redundant parameterizations may offer other benefits (optimization landscapes, interpretability, ease of constraints) at the cost of identifiability.

Concrete ML example: In collaborative filtering, a movie recommendation system models user \(i\)’s rating of movie \(j\) as \(\hat{R}_{ij} = \mathbf{u}_i^\top \mathbf{v}_j\), where \(\mathbf{u}_i \in \mathbb{R}^k\) is user \(i\)’s latent profile and \(\mathbf{v}_j \in \mathbb{R}^k\) is movie \(j\)’s latent profile. The parameter space is \(\mathbb{R}^{nk + mk}\) (\(n\) users, \(m\) movies), seemingly \(k(n+m)\)-dimensional. However, the model is invariant to rotations: if we rotate both \(\mathbf{u}_i \to Q\mathbf{u}_i\) and \(\mathbf{v}_j \to Q\mathbf{v}_j\) for any orthogonal \(Q \in \mathbb{R}^{k \times k}\), the predictions remain unchanged (since \((Q\mathbf{u}_i)^\top (Q\mathbf{v}_j) = \mathbf{u}_i^\top Q^\top Q \mathbf{v}_j = \mathbf{u}_i^\top \mathbf{v}_j\)). This symmetry introduces \(k(k-1)/2\) redundant parameters (the dimension of the rotation group \(SO(k)\)), reducing the effective dimension to approximately \(k(n+m) - k(k-1)/2\). The redundancy causes identifiability issues: infinitely many \((\mathbf{u}, \mathbf{v})\) pairs produce identical predictions (all related by rotations). This doesn’t prevent prediction (we can find one solution), but it complicates interpretation (which latent factors are “real”?) and makes maximum likelihood estimation non-unique. To resolve this, practitioners often fix a specific orthogonal basis for the latent space (e.g., via SVD initialization or orthogonality constraints), removing redundancy and ensuring unique parameter estimates. This trade-off—freedom (redundancy) versus identifiability (unique basis)—is pervasive in ML: overparameterized models train easily but resist interpretation, while minimal parameterizations (bases) are interpretable but may be harder to optimize.

Dimension as Capacity

Dimension quantifies the complexity or capacity of a model: how many degrees of freedom it has to fit data. In classical learning theory, higher dimension generally implies higher capacity, enabling the model to fit more complex patterns but also increasing the risk of overfitting. The VC (Vapnik-Chervonenkis) dimension of a hypothesis class \(\mathcal{H}\) measures the maximum number of points that can be shattered (all \(2^n\) labelings realized) by \(\mathcal{H}\); for linear classifiers in \(\mathbb{R}^d\), \(\text{VC-dim} = d+1\), directly related to parameter dimension. Generalization bounds depend on VC dimension: sample complexity scales as \(O(\text{VC-dim} / \epsilon^2)\) to achieve \(\epsilon\)-accuracy. Thus, controlling dimension controls generalization: lower dimension (simpler models) generalize with less data but may underfit; higher dimension (complex models) fit training data better but require more data to avoid overfitting.

Dimension also governs representational capacity: the set of functions a model can represent. A neural network with layers of dimensions \(d_0 \to d_1 \to \cdots \to d_L\) (ignoring nonlinearities momentarily) can represent linear maps \(\mathbb{R}^{d_0} \to \mathbb{R}^{d_L}\) of rank at most \(\min(d_0, d_1, \dots, d_L)\). A bottleneck layer (small \(d_\ell\)) limits rank globally: all subsequent representations lie in a \(d_\ell\)-dimensional subspace. Thus, layer dimensions \(d_\ell\) set capacity ceilings. Choosing \(d_\ell\) is a design decision balancing expressivity (large \(d_\ell\) captures more complexity) and generalization (small \(d_\ell\) reduces overfitting). Empirically, wide layers (large dimensions) often train faster and generalize better than narrow layers, even when total parameter counts are matched, suggesting that dimension per se (not just parameter count) impacts trainability—likely because higher dimension provides more paths for gradient flow and reduces the chance of rank collapse during training.

Concrete ML example: A practitioner designs a convolutional autoencoder for compressing 128×128 grayscale images (\(16384 = 128^2\) pixels). The encoder progressively reduces spatial and channel dimensions: \(128 \times 128 \times 1 \to 64 \times 64 \times 32 \to 32 \times 32 \times 64 \to 16 \times 16 \times 128\), then flattens to a bottleneck code \(\mathbf{z} \in \mathbb{R}^{512}\) (dimension 512). The decoder reverses this, reconstructing \(\hat{\mathbf{x}} \in \mathbb{R}^{16384}\). The bottleneck dimension (512) determines compresison ratio (\(16384 / 512 = 32\times\)) and representational capacity. If natural images lie near a manifold of intrinsic dimension \(d_{\text{intrinsic}} \approx 400\), a 512-dimensional bottleneck is sufficient (exceeds intrinsic dimension), and reconstruction loss is limited by noise and model capacity, not intrinsic dimensionality. But if the practitioner reduces the bottleneck to 128 dimensions (compression 128×), and the intrinsic dimension is indeed 400, the autoencoder cannot faithfully represent all image variations: it underrepresents, incurring irreducible approximation error. Training will converge to a lossy encoding that captures the “most important” 128 dimensions (directions of highest variance), discarding the rest. This manifests as blurriness or loss of detail in reconstructions. To quantify, she computes reconstruction MSE as a function of bottleneck dimension \(k\): plotting MSE vs. \(k\), she observes a steep decrease up to \(k \approx 400\), then gradual improvement (diminishing returns). The “elbow” at \(k \approx 400\) suggests the intrinsic dimension, and choosing \(k \geq 400\) suffices for high-fidelity reconstruction. This illustrates dimension as capacity: too little (low \(k\)) and the model cannot represent the data distribution (underfitting due to insufficient capacity), too much (high \(k\)) and the model overfits noise or wastes computation (unnecessary capacity). Optimal \(k\) matches the intrinsic dimension, balancing expressivity and efficiency—a principle guided by dimension theory.

Common Misconceptions About Basis

Several misconceptions about bases persist, even among experienced practitioners, often stemming from familiarity with the standard basis in \(\mathbb{R}^n\) (where coordinates are explicit) and neglecting that bases are choices, not inherent properties of the space. Misconception 1: “There is one canonical basis for a vector space.” False: every vector space has infinitely many bases (unless dimension is 0 or 1). In \(\mathbb{R}^n\), the standard basis \(\{\mathbf{e}_1, \dots, \mathbf{e}_n\}\) is conventional but not special mathematically—any set of \(n\) linearly independent vectors is equally valid. Orthonormal bases (obtained via Gram-Schmidt) are convenient for geometric calculations, but still non-unique (infinitely many). The choice of basis is a modeling decision: different bases suit different tasks (standard for simplicity, eigenbases for diagonalization, Fourier for periodicity, PCA for variance capture). Failing to recognize this choice leads to rigid thinking (“we must use raw features as coordinates”) and misses opportunities for preprocessing (basis transformation) that dramatically improve performance.

Misconception 2: “Dimension is the number of features/parameters/data points.” Partially true, partially false: dimension is the number of basis elements (equivalently, the size of a minimal spanning set), which for \(\mathbb{R}^d\) equals \(d\) (number of features). However, if features are linearly dependent (multicollinearity), the effective dimension is less than \(d\): the data lie in a lower-dimensional subspace. For parameters, the dimension of the parameter space is the number of independent parameters, which may be less than the total count if constraints or symmetries exist. For data points, \(n\) samples do not determine dimension directly—they span a subspace of dimension at most \(\min(n, d)\), often much less if data lie near a manifold. Dimension is intrinsic to the space, not to specific representations or samples.

Misconception 3: “Coordinates are unique.” True only for a fixed basis: given basis \(\mathcal{B}\), coordinates \([\mathbf{v}]_\mathcal{B}\) are unique (by linear independence in the basis). But change the basis, and coordinates change: \([\mathbf{v}]_{\mathcal{B}'} \neq [\mathbf{v}]_\mathcal{B}\) in general. There is no “true” coordinate representation—all bases are valid, and coordinates are relative to the choice of basis. This is conceptually similar to changing units (meters to feet) or coordinate systems (Cartesian to polar): the object (vector) is invariant, but its numerical description depends on the frame. Practical implication: when reporting numbers (e.g., PCA component scores, embedding dimensions), always specify the basis (e.g., “first three principal components” = coordinates in PCA basis, not standard basis).

Misconception 4: “You can’t have a basis for infinite-dimensional spaces.” False: infinite-dimensional spaces have bases (proven via the Axiom of Choice and Zorn’s Lemma in general, or constructively for separable spaces like \(\ell^2\), function spaces \(L^2\), etc.). For example, the Fourier basis \(\{\sin(nx), \cos(nx)\}_{n=1}^\infty \cup \{1\}\) spans \(L^2([0, 2\pi])\), and Hilbert space theory extends finite-dimensional linear algebra to countably infinite bases (with convergence in norm replacing finite sums). In ML, kernel methods implicitly use infinite-dimensional bases (feature maps \(\phi(\mathbf{x})\) in reproducing kernel Hilbert spaces), though computations remain finite-dimensional via the kernel trick (dot products in feature space, not explicit coordinates). Infinite dimension does introduce subtleties (convergence, completeness), but the core concepts (spanning, independence, coordinates) generalize.

Misconception 5: “Increasing basis size always increases model capacity.” True if the basis elements are independent and added to a proper subset, false if they are redundant. Adding a redundant element (dependent on existing basis elements) to a basis does not increase dimension or expressivity—it merely creates a non-minimal spanning set (no longer a basis). True capacity is determined by dimension (number of independent elements), not parameter count (which may include redundancies). In neural networks, adding neurons to a hidden layer increases nominal parameters but may not increase effective dimension if the added neurons’ outputs lie in the span of existing neurons (e.g., if they learn similar features due to symmetry or initialization). Measuring true capacity requires checking rank or independence, not counting parameters naively.

Concrete ML example: A data scientist uses PCA to reduce dimensionality of text data (bag-of-words with \(d = 10000\) vocabulary). She computes the top \(k = 100\) principal components and projects data into this 100-dimensional PCA basis. Later, a colleague suggests “adding back a few original words that might be important” by appending those raw feature coordinates to the PCA coordinates, yielding a 105-dimensional representation. They assume this increases model capacity. However, those 5 raw features are already expressible as linear combinations of the 100 PCA components (since PCA spans the same space as the original features, just in a different basis). Adding them back does not increase the dimension of the span—the 105 coordinates still span a 100-dimensional subspace (or perhaps slightly more if numerical precision differs, but conceptually the intrinsic dimension is unchanged). The extra coordinates are redundant, causing: (1) non-unique coordinate representations (infinitely many 105-dimensional coordinate vectors map to the same point in the 100-dimensional subspace), (2) multicollinearity in downstream models (regression coefficients become unstable), (3) wasted computation (105 features instead of 100, with no expressivity gain). To fix, they should either use the 100 PCA components alone (a basis), or explicitly compute a new basis spanning the union of PCA components and the selected raw features (via Gram-Schmidt or QR), ensuring independence. This example shows the danger of ignoring basis theory: naively mixing coordinate systems (PCA coordinates + raw coordinates) without understanding linear independence leads to redundancy, instability, and confusion.


ML Connection

Feature Engineering and Redundancy

Feature engineering—transforming raw inputs into a representation suitable for learning—is fundamentally a basis choice problem. Raw data (pixels, text tokens, sensor readings) arrive in a default (often standard) basis that may not align with the underlying structure of the problem. For example, raw pixel values treat neighboring pixels independently, ignoring spatial correlations; raw word counts treat synonyms as distinct, losing semantic similarity. Feature engineering designs new features (transformations, combinations, kernels) that capture relevant structure, effectively defining a new basis for the feature space. From a linear algebra perspective, this is a change of basis: we start with coordinates in the standard basis (raw features) and transform to coordinates in a custom basis (engineered features), often via a linear transformation (though nonlinear preprocessing is also common).

Redundancy—linearly dependent features—is a major concern in feature engineering. If two features are perfectly correlated (e.g., temperature in Celsius and Fahrenheit), they provide no independent information: \(T_F = (9/5) T_C + 32\), a linear dependency. Including both increases parameter count but not model expressivity (dimension stays the same), and causes multicollinearity (non-unique parameter estimates, inflated variances). More subtly, near-dependencies (features with correlation close to 1 or -1) cause numerical instability: the design matrix \(X\) has near-zero singular values, and solutions to \(X^\top X \mathbf{w} = X^\top \mathbf{y}\) amplify numerical errors. Basis theory provides a remedy: identify redundancies via rank checks (compute \(\text{rank}(X)\); if less than \(d\), features are dependent), remove redundant features (keeping a basis for \(\text{Col}(X)\)), or regularize (ridge regression implicitly projects to a stable subspace, penalizing directions with small eigenvalues).

Good feature engineering produces a feature set that is (approximately) an independent set, spanning the relevant information in the data. For instance, PCA constructs an orthonormal basis (principal components) ordered by variance explained: the first \(k\) components form a basis for the top-\(k\) variance subspace. Dropping low-variance components removes noise and computational burden while retaining signal, effectively reducing dimensionality to the intrinsic dimension of the signal subspace. This is explicit basis design: we choose a basis (PCA) optimized for a criterion (variance), then work in those coordinates. Other basis choices (ICA for independence, sparse coding for sparsity, learned embeddings in neural networks) optimize different criteria, but all share the goal of finding a basis that makes the learning problem simpler—fewer dimensions, more interpretable coordinates, better aligned with model assumptions.

Concrete ML example: In a credit risk model predicting loan default, raw features include annual income \(I\), monthly income \(I/12\), total debt \(D\), debt-to-income ratio \(D/I\), and log-income \(\log(I)\). These 5 features are not linearly independent: monthly income is \((1/12)\) times annual income (linear dependence), and debt-to-income is \(D/I\) (expressible as a rational function, not linear, but if linearized or if we consider feature products, dependencies arise). Additionally, \(\log(I)\) is nonlinearly related to \(I\), but if the model is linear, this is fine (no linear dependence). However, including both \(I\) and \(I/12\) is redundant: they span a 1-dimensional subspace (the line \(\text{span}\{(I, I/12)\}\)), not a 2-dimensional space. Training a linear model on these features would produce a singular\(X^\top X\) matrix if these are the only features. In practice, the practitioner notices high condition number and runs a variance inflation factor (VIF) analysis, identifying \(I\) and \(I/12\) as perfectly collinear (VIF = ∞). She removes \(I/12\), reducing from 5 to 4 features, and checks rank: now \(\text{rank}(X) = 4\) (assuming the remaining features are independent), so the feature set is a basis for a 4-dimensional subspace. The model trains stably, coefficient variances decrease, and interpretation improves (each coefficient corresponds to one independent direction in feature space). By removing redundancy (ensuring independence), she obtained a basis for the feature subspace, enabling unique parameter estimates and stable predictions—demonstrating that careful feature engineering (basis selection) is essential for robust models.

Intrinsic vs Ambient Dimension in Data

A central assumption in modern machine learning is the manifold hypothesis: high-dimensional data (large ambient dimension) often concentrate near a low-dimensional manifold or subspace (small intrinsic dimension). For instance, natural images in \(\mathbb{R}^{d}\) (\(d \approx 10^6\) pixels for megapixel images) do not fill \(\mathbb{R}^d\) uniformly—they occupy a tiny region near a manifold of dimension perhaps \(d_{\text{intrinsic}} \sim 10^2\)-\(10^3\), capturing degrees of freedom like object pose, lighting, texture. This discrepancy—ambient dimension \(d \gg d_{\text{intrinsic}}\)—is the foundation for dimensionality reduction, compression, and generalization: if data truly live in a low-dimensional subspace, we can represent them efficiently using a basis for that subspace, discarding the \(d - d_{\text{intrinsic}}\) orthogonal dimensions as noise or irrelevant variation.

Estimating intrinsic dimension from finite samples is a statistical challenge. PCA provides one estimate: plot singular values (or eigenvalues of covariance) in decreasing order and identify an “elbow” where values drop sharply; the number of large singular values approximates \(d_{\text{intrinsic}}\). More sophisticated methods (local PCA, manifold learning algorithms like Isomap or t-SNE, intrinsic dimension estimators based on nearest-neighbor distances) refine this estimate, accounting for nonlinearity (if data lie on a curved manifold, linear methods like PCA underestimate intrinsic dimension). Regardless of method, the key insight is that effective model capacity should match intrinsic dimension, not ambient dimension: models with \(d_{\text{intrinsic}}\) parameters (or a \(d_{\text{intrinsic}}\)-dimensional latent space) suffice to capture data structure, while models with \(d\) parameters overfit noise.

Dimension mismatch causes problems in both directions: (1) Underestimating intrinsic dimension (model dimension \(< d_{\text{intrinsic}}\)): the model cannot represent all data variations, incurring approximation error (bias). For example, a 10-dimensional autoencoder bottleneck on 50-dimensional-intrinsic data will compress excessively, losing information. (2) Overestimating intrinsic dimension (model dimension \(> d_{\text{intrinsic}}\)): the model has excess capacity, fitting noise and overfitting. For example, a 1000-dimensional bottleneck on 50-dimensional-intrinsic data wastes computation and risks overfitting (the extra 950 dimensions encode noise). Optimal model dimension aligns with intrinsic dimension: this “Goldilocks” regime maximizes expressivity (no underfitting) while minimizing overfitting risk (no excess capacity).

Concrete ML example: A speech recognition system processes audio spectrograms of dimension \(d = 80\) (80 mel-frequency bins per frame). Frames are windowed over time, yielding sequences of \(d\)-dimensional vectors. To denoise and reduce computation, the practitioner applies PCA to each frame, retaining \(k\) components. Plotting cumulative explained variance vs. \(k\), she observes that \(k = 20\) captures 95% of variance, \(k = 40\) captures 99%, and \(k = 80\) captures 100% (trivially, since PCA with all components is the identity). She infers that the intrinsic dimension of speech spectra is approximately 20-40 (most signal energy concentrates in 20-40 directions, corresponding to formant structure and harmonic content), while the remaining 40-60 dimensions are noise (sensor artifacts, background hiss). Using \(k = 30\) as a compromise, she projects spectrograms to \(\mathbb{R}^{30}\), reducing the input dimension to a recurrent neural network (RNN) downstream. The RNN, now operating on 30-dimensional inputs instead of 80, trains faster (fewer parameters in input layer) and generalizes better (less noise). Phoneme recognition accuracy improves by 2%, confirming that the ambient dimension (80) overrepresented the data, and working in the intrinsic-dimensional space (30) was more effective. This example illustrates the power of dimension estimation: by identifying intrinsic dimension and choosing a basis (PCA components) for that subspace, she optimized feature representation, aligning model capacity with data structure.

Representation Compression

Compression—reducing data size while preserving essential information—is a direct application of dimension theory. Lossless compression (e.g., Huffman coding, LZ77) exploits statistical redundancy (symbol frequencies) but does not reduce dimension (the representation remains in the original space, just shorter on average). Lossy compression (e.g., JPEG for images, MP3 for audio, video codecs) explicitly reduces dimension by projecting data onto a lower-dimensional subspace or manifold, discarding components deemed “unimportant” (low variance, high frequency noise, perceptually irrelevant). From a linear algebra perspective, lossy compression is a low-rank approximation: approximate \(\mathbf{x} \in \mathbb{R}^d\) by \(\hat{\mathbf{x}} = P_k(\mathbf{x})\), the projection onto a \(k\)-dimensional subspace \(U_k \subseteq \mathbb{R}^d\), where \(k \ll d\).

The optimal subspace (minimizing reconstruction error) is given by PCA (or SVD for matrices): the top \(k\) principal components span the subspace \(U_k\) that maximizes variance retained (equivalently, minimizes mean squared error \(\mathbb{E}[\|\mathbf{x} - P_k(\mathbf{x})\|^2]\)). The compressed representation is the coordinate vector \(\mathbf{z} = U_k^\top \mathbf{x} \in \mathbb{R}^k\) (projection onto the basis \(\{u_1, \dots, u_k\}\)), requiring \(k\) numbers instead of \(d\). Reconstruction is \(\hat{\mathbf{x}} = U_k \mathbf{z} = U_k U_k^\top \mathbf{x}\), an orthogonal projection. The compression ratio is \(d / k\) (ignoring storage of the basis itself, which can be amortized over many samples). For example, compressing 784-dimensional MNIST digits to 50 dimensions achieves \(784/50 \approx 15.7\times\) compression, with reconstruction error bounded by the sum of discarded eigenvalues (controlled by variance threshold, e.g., retaining 95% variance ensures error is \(\leq 5\%\) of total variance).

Autoencoders generalize PCA to nonlinear compression: the encoder \(E: \mathbb{R}^d \to \mathbb{R}^k\) maps data to a \(k\)-dimensional latent code (coordinates in a learned basis, possibly nonlinear), and the decoder \(D: \mathbb{R}^k \to \mathbb{R}^d\) reconstructs (maps coordinates back to data space). A linear autoencoder (\(E\) and \(D\) linear) recovers PCA (the learned encoder basis is the PCA basis, up to rotation). Nonlinear autoencoders (neural networks) learn curved manifolds, achieving better compression if data lie on a nonlinear manifold. In both cases, the bottleneck dimension \(k\) is the intrinsic dimension of the representation, and the basis (linear or nonlinear) is learned from data. Dimension theory provides the conceptual framework: we seek a \(k\)-dimensional coordinate system that captures data structure, trading exact representation (\(d\) dimensions) for approximate representation (\(k\) dimensions, \(k \ll d\)) with minimal information loss.

Concrete ML example: A company stores millions of product images (512×512 RGB, \(d = 512 \times 512 \times 3 = 786432\) pixels each) for an e-commerce site. Storage costs and transmission bandwidth are prohibitive. They train a convolutional autoencoder with a 1024-dimensional bottleneck (\(k = 1024\)), achieving \(768\times\) compression (\(786432 / 1024\)). Each image is encoded as \(\mathbf{z} \in \mathbb{R}^{1024}\) (the latent code, coordinates in the learned autoencoder basis), stored/transmitted as 1024 floats, then decoded on-demand to \(\hat{\mathbf{x}} \in \mathbb{R}^{786432}\). Perceptual quality (SSIM, human ratings) is indistinguishable from the original for most images, indicating the intrinsic dimension of product images is \(\lesssim 1024\), much lower than the ambient 786432. This dramatic compression works because product images are highly structured: backgrounds are uniform, objects are centered, lighting is controlled—this structure concentrates information in a low-dimensional manifold. The autoencoder learns a basis for this manifold (implicitly: the decoder weights define a coordinate frame), enabling near-lossless compression. Compared to JPEG (a fixed basis, Discrete Cosine Transform), the learned basis adapts to the specific data distribution (product images), achieving better compression ratios at the same quality. This is dimension reduction via basis learning: the autoencoder discovers the intrinsic-dimensional subspace (or manifold) where data live and represents them compactly in that basis, drastically reducing storage while preserving perceptual information.

Identifiability in Parameter Spaces

Identifiability is the property that distinct parameter values correspond to distinct model behaviors (predictions, distributions, etc.). Non-identifiable models suffer from redundancy: multiple parameter settings yield identical outputs, making parameter estimation ambiguous (infinitely many optimal solutions) and parameter interpretation meaningless (which parameters are the “true” ones?). Basis theory clarifies identifiability by connecting it to independence: a model is identifiable if its parameter space is spanned by an independent set of parameters (a basis for the hypothesis class), ensuring unique coordinates. Conversely, redundant parameterizations (dependent parameters) cause non-identifiability.

Examples of non-identifiability: (1) Multicollinearity in regression: perfectly correlated features lead to non-unique coefficient estimates (infinitely many \(\mathbf{w}\) achieve the same fit). (2) Rotation symmetry in matrix factorization: \(UV^\top = (UQ)(Q^\top V^\top)\) for any orthogonal \(Q\), so \((U,V)\) and \((UQ, VQ^{-\top})\) are observationally equivalent. (3) Neuron permutation in neural networks: swapping two neurons in a hidden layer (and their outgoing weights) produces an identical function, so weight matrices are non-identifiable up to permutation. (4) Scale-shift ambiguity: in models like \(f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}\), scaling \(\mathbf{w} \to c\mathbf{w}\) and \(\mathbf{x} \to \mathbf{x}/c\) leaves predictions unchanged (if both are learnable), causing non-identifiability.

Restoring identifiability requires constraints that eliminate redundancy, effectively choosing a canonical basis for the parameter space. Common approaches: (1) Remove redundant parameters: drop dependent features, fix rotation to a canonical orientation (e.g., align first basis vector with largest variance direction). (2) Orthogonality constraints: require weight matrices to have orthonormal columns, breaking rotation symmetry. (3) Sparsity or norm constraints: regularization penalties (L1, L2) bias toward a unique solution (e.g., smallest-norm solution in ridge regression). (4) Initialization and optimization dynamics: in practice, gradient descent breaks symmetries via random initialization, converging to one among many equivalent solutions—identifiability is “soft” (not guaranteed by model structure, but achieved by training procedure). Understanding non-identifiability through dimension and basis theory helps diagnose parameter estimation issues and design constraints that ensure unique, interpretable solutions.

Concrete ML example: In word embeddings (Word2Vec, GloVe), words are represented as vectors \(\mathbf{w}_i \in \mathbb{R}^d\), and similarity is measured by dot products \(\mathbf{w}_i^\top \mathbf{w}_j\) (or cosine similarity). The embedding matrix \(W \in \mathbb{R}^{V \times d}\) (vocabulary size \(V\), embedding dimension \(d\)) is learned by optimizing a loss function that depends only on dot products, not on individual coordinates. This makes \(W\) non-identifiable up to orthogonal transformations: \(W\) and \(WQ\) for any orthogonal \(Q \in \mathbb{R}^{d \times d}\) yield identical dot products (since \((WQ) (WQ)^\top = WQQ^\top W^\top = WW^\top\)). Thus, infinitely many embedding matrices \(W\) represent the same “semantic space” (all related by rotations). This non-identifiability is not a bug—it reflects that the choice of coordinate axes (basis) in embedding space is arbitrary. However, it complicates interpretation: if dimension 1 of \(W\) corresponds to “gender” for one random seed but dimension 3 for another, comparing embeddings across runs is difficult. To address this, researchers sometimes post-process embeddings to a canonical basis: align the first axis with the “gender” direction (found via PCA on gender-related word pairs), the second with location, etc., using Procrustes analysis or rotation to a fixed interpretation frame. This basis alignment restores interpretability at the cost of imposing a human-chosen coordinate system. The lesson: non-identifiability (redundant parameterizations) arises when choosing a basis (coordinate system) is part of the learning problem; making models identifiable requires selecting a basis (via constraints, post-processing, or convention).

Coordinate Choices in Neural Networks

Neural networks are compositions of linear transformations (layers) interspersed with nonlinear activations. Each linear layer \(\mathbf{h}^{(\ell)} = W^{(\ell)} \mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)}\) can be viewed as a change of basis (followed by a translation and nonlinearity): the weight matrix \(W^{(\ell)} \in \mathbb{R}^{n_\ell \times n_{\ell-1}}\) maps coordinates in the layer \(\ell-1\) basis to coordinates in the layer \(\ell\) basis. Choosing layer dimensions \(n_\ell\) and initialization (the initial basis) profoundly affects training dynamics and representation learning. From a basis perspective, each layer learns a new coordinate system for representing data: early layers discover low-level features (edges, textures), middle layers combine them into mid-level features (parts, shapes), and late layers form high-level abstractions (objects, categories). This hierarchical basis-building is the key to deep learning’s success: each layer refines the basis, progressively aligning coordinates with the task structure.

Different coordinate choices (bases) in intermediate layers affect optimization landscapes and generalization. For example, whitening (normalizing activations to have identity covariance) is a change-of-basis transformation that decorrelates features, accelerating gradient descent (similar to preconditioning in optimization). Batch normalization (standardizing activations per mini-batch) effectively rescales and shifts coordinates, stabilizing training by preventing internal covariate shift. Layer normalization and weight normalization make similar coordinate adjustments. All can be interpreted as adaptively choosing better bases during training: bases where gradients flow smoothly, parameters have similar scales, and loss surfaces are well-conditioned. Explicitly designing networks to operate in good coordinate systems (e.g., orthogonal weight matrices, using rotation-equivariant architectures) can improve convergence and generalization.

Coordinate choices also affect interpretability: in a standard basis (raw pixel inputs), neuron activations have clear meanings (brightness at specific pixels), but in learned bases (hidden layers), activations are linear combinations of learned features (e.g., “edge at 45° + medium blue + slight curvature”). Interpreting these requires understanding the learned basis (visualizing weight matrices, activation maximization, feature inversion). In some architectures (e.g., disentangled variational autoencoders), the goal is to learn a basis where each coordinate has an independent, interpretable meaning (one coordinate = “pose,” another = “lighting,” etc.)—a basis aligned with causal factors of variation, often requiring explicit regularization or structured priors.

Concrete ML example: A practitioner trains a fully connected network for MNIST classification with architecture \(784 \to 256 \to 128 \to 10\). After training, she examines the learned weight matrix \(W^{(1)} \in \mathbb{R}^{256 \times 784}\) of the first hidden layer. Each row \(W^{(1)}_{i,:}\) is a linear functional (a “detector”) that looks for a particular pattern in the input. Visualizing these rows as 28×28 images, she observes that many resemble oriented edge filters (Gabor-like patterns), stroke detectors, and blob detectors—the network has learned a basis for the image space that decomposes images into edge components. This is a change of basis from the pixel basis (where coordinates are pixel intensities) to an “edge basis” (where coordinates are edge filter responses). Subsequent layers build on this: the second layer \(W^{(2)} \in \mathbb{R}^{128 \times 256}\) combines edge responses into stroke patterns (digit parts), and the third layer \(W^{(3)} \in \mathbb{R}^{10 \times 128}\) combines strokes into digit classifiers. Each layer’s weight matrix defines a change-of-basis, progressively transforming representations from raw pixels to task-relevant abstractions (digit identities). This hierarchical basis learning is automatic (learned by backpropagation), but understanding it as a sequence of basis changes clarifies why deep networks work: they discover coordinate systems (bases) that make the classification problem linear in the final layer (digit classes are linearly separable in the learned high-level basis, even though they are not in the raw pixel basis). Explicitly encouraging good basis properties (orthogonality, sparsity, independence) via regularization can accelerate this learning and improve robustness, as seen in techniques like weight orthogonalization (constraining \(W^\top W \approx I\), making layers near-isometries, preserving gradient magnitudes) and sparse coding (encouraging few nonzero coordinates in learned bases, improving interpretability and noise robustness).


Worked Examples

Example 1 — Basis of \(\mathbb{R}^n\)

The standard basis of \(\mathbb{R}^3\) is \(\mathcal{B}_{\text{std}} = \{\mathbf{e}_1, \mathbf{e}_2, \mathbf{e}_3\}\) where \(\mathbf{e}_1 = (1, 0, 0)^\top\), \(\mathbf{e}_2 = (0, 1, 0)^\top\), \(\mathbf{e}_3 = (0, 0, 1)^\top\). Every vector \(\mathbf{v} = (v_1, v_2, v_3)^\top \in \mathbb{R}^3\) is uniquely expressed as \(\mathbf{v} = v_1 \mathbf{e}_1 + v_2 \mathbf{e}_2 + v_3 \mathbf{e}_3\), so the coordinates \([\mathbf{v}]_{\mathcal{B}_{\text{std}}} = (v_1, v_2, v_3)^\top\) are simply the entries themselves. Linear independence follows because if \(c_1 \mathbf{e}_1 + c_2 \mathbf{e}_2 + c_3 \mathbf{e}_3 = \mathbf{0}\), then \((c_1, c_2, c_3)^\top = (0, 0, 0)^\top\), so \(c_1 = c_2 = c_3 = 0\). Spanning is obvious: any vector lies in \(\text{span}(\mathcal{B}_{\text{std}}) = \mathbb{R}^3\). Thus \(\mathcal{B}_{\text{std}}\) is a basis, and \(\dim(\mathbb{R}^3) = 3\). The standard basis is the most natural coordinate system: it aligns with physical axes and causes coordinates to be transparent (no computation needed). However, natural doesn’t mean optimal: if data are unevenly distributed (elongated along one direction, compressed in others), the standard basis is inefficient. One might ask: what if we use an alternative basis, like \(\mathcal{B}_{\text{alt}} = \{(1, 1, 1)^\top, (1, -1, 0)^\top, (1, 0, -1)^\top\}\)? To verify it’s a basis, check independence (construct a \(3 \times 3\) matrix with \(\mathcal{B}_{\text{alt}}\) as columns, confirm full rank via non-zero determinant) and confirm spanning (any \(\mathbb{R}^3\) vector is in the span). If both conditions hold, \(\mathcal{B}_{\text{alt}}\) is also a basis, and \(\dim(\mathbb{R}^3) = 3\) remains—dimension is unchanged. Now reexpress the vector \(\mathbf{v} = (1, 2, 3)^\top\) in \(\mathcal{B}_{\text{alt}}\): solve \((1, 2, 3)^\top = c_1 (1, 1, 1)^\top + c_2 (1, -1, 0)^\top + c_3 (1, 0, -1)^\top\) for \(c_1, c_2, c_3\). This gives a linear system whose solution is the coordinate vector \([\mathbf{v}]_{\mathcal{B}_{\text{alt}}}\), different from \([\mathbf{v}]_{\mathcal{B}_{\text{std}}} = (1, 2, 3)^\top\). The vector \(\mathbf{v}\) (as an abstract entity) is identical; only its representation changes. Common misconception: students often conflate coordinates with vectors. The coordinates \((c_1, c_2, c_3)\) depend on the basis choice; the vector \(\mathbf{v}\) does not. What-if scenario: if we lack a basis, say with only \(\{(1, 2, 3)^\top\}\) (spanning \(\mathbb{R}^1\), not all of \(\mathbb{R}^3\)), we cannot represent all vectors uniquely (dimension remains undefined for the incomplete set). We must extend to a basis by adding independent vectors.

ML relevance: Standard basis is a design choice: Raw measurements (pixel intensities, sensor readings, counts) define a standard basis, but this basis is rarely optimal for learning. Models trained directly in the raw basis often see ill-conditioned loss surfaces: features are correlated, gradients are skewed, and optimization is slow. Changing basis is a practical preprocessing decision, not just a mathematical nicety. For example, pixel-space classifiers struggle with rotated or shifted images because the basis is tied to fixed pixel locations. A more suitable basis (edges, textures) makes the problem more linear and stable.

Data-driven basis discovery: PCA, ICA, and whitening discover bases aligned with data structure. PCA rotates the basis to maximize variance along leading axes, which often correspond to meaningful modes (shape, illumination, pose). Truncating to the top components yields a low-rank approximation that preserves signal and discards noise. This is fundamental for compression (image and audio), denoising, and improving generalization when sample size is limited. PCA basis selection is also an implicit regularizer: it reduces the effective dimension, lowering variance in downstream models.

Learned bases in modern models: Neural networks learn basis transforms end-to-end. Early layers in CNNs act like oriented edge bases; deeper layers form higher-level basis vectors aligned with semantics. Embedding models (word2vec, sentence transformers) create learned bases where semantic relationships are linear, enabling simple linear probes to recover attributes (sentiment, topic). In practice, basis alignment matters: embeddings trained with different random seeds are equivalent up to rotation, so alignment (Procrustes, CCA) is needed for cross-run comparison. The lesson is consistent: choosing or learning a good basis is often the difference between a model that trains well and one that fails to generalize.

Example 2 — Non-Standard Basis Construction

Consider \(\mathbb{R}^2\) and the candidate basis \(\mathcal{B} = \{(1, 1)^\top, (1, -1)^\top\}\). To verify it’s a basis: (1) Independence: if \(c_1 (1, 1)^\top + c_2 (1, -1)^\top = (0, 0)^\top\), then \(c_1 + c_2 = 0\) and \(c_1 - c_2 = 0\), yielding \(c_1 = c_2 = 0\). ✓ (2) Spanning: for any \((a, b)^\top \in \mathbb{R}^2\), solve \(c_1 (1, 1)^\top + c_2 (1, -1)^\top = (a, b)^\top\), giving \(c_1 + c_2 = a\) and \(c_1 - c_2 = b\). Solving yields \(c_1 = (a+b)/2\) and \(c_2 = (a-b)/2\), so every vector is a linear combination of \(\mathcal{B}\). ✓ Thus \(\mathcal{B}\) is a basis. Now express \(\mathbf{v} = (3, 1)^\top\) in \(\mathcal{B}\): set \((3, 1)^\top = c_1 (1, 1)^\top + c_2 (1, -1)^\top\), solving to get \(c_1 = 2\) and \(c_2 = 1\). So \([\mathbf{v}]_{\mathcal{B}} = (2, 1)^\top\). The standard basis gives \([\mathbf{v}]_{\mathcal{B}_{\text{std}}} = (3, 1)^\top\), which differs. Geometrically, \(\mathcal{B}\) consists of a vector along the direction \((1, 1)\) (45° from the \(x\)-axis) and one along \((1, -1)\) (−45°). These are orthogonal (dot product is \(1 \cdot 1 + 1 \cdot (-1) = 0\)), making \(\mathcal{B}\) an orthonormal basis (after normalizing to unit length). Reasoning: constructing a basis from scratch requires ensuring independence and spanning via an explicit algorithm (e.g., Gram-Schmidt orthogonalization, which sequentially orthogonalizes vectors while preserving span). Interpretation: choosing non-standard bases reveals hidden structure: in this case, \(\mathcal{B}\) aligns with diagonals of the \((x, y)\)-plane, useful if data vary along those directions (e.g., ridge regression problems where features have correlation structure along diagonals). Common misconceptions: orthogonal bases are not required for a basis (orthogonality is a bonus property, not a necessity), but they simplify computations (change-of-basis matrices are orthogonal, resembling rotations rather than general invertible transformations). What-if scenario: if we add a third vector like \((2, 2)^\top = 2(1, 1)^\top\) to \(\{(1, 1)^\top, (1, -1)^\top\}\), the set becomes linearly dependent (\((2, 2)^\top\) is redundant), and spanning \(\mathbb{R}^2\) requires dropping it.

ML relevance: PCA is a basis choice: Principal Component Analysis constructs a basis by finding directions of maximum variance. In image compression, the leading components encode coarse structure, while trailing components capture fine details and noise. Truncating to the leading components yields a low-rank approximation, reducing storage and improving generalization when samples are limited.

Domain-specific bases: Fourier bases are ideal for periodic signals (audio, time series), wavelets for localized patterns (images, transient events), and dictionary learning bases for sparse coding. In ML pipelines, these bases appear as feature extractors (STFT, MFCCs, wavelet coefficients) and shape the downstream learning problem by making salient structure sparse or separable.

Orthonormality and optimization stability: Orthonormal bases preserve distances and angles, which helps optimization. If the basis is orthonormal, change-of-basis matrices are orthogonal, so gradients are not distorted and the loss landscape is more isotropic. This underpins techniques like whitening, orthogonal initialization, and spectral normalization, which keep learned representations well-conditioned and improve training stability.

Example 3 — Verifying Linear Independence

To check if \(\mathbf{v}_1 = (1, 2, 1)^\top\), \(\mathbf{v}_2 = (2, 1, 0)^\top\), \(\mathbf{v}_3 = (1, -1, -1)^\top\) form a linearly independent set, we solve \(c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2 + c_3 \mathbf{v}_3 = \mathbf{0}\), yielding the system: \[\begin{align} c_1 + 2c_2 + c_3 &= 0 \\ 2c_1 + c_2 - c_3 &= 0 \\ c_1 - c_3 &= 0 \end{align}\] From the third equation, \(c_1 = c_3\). Substituting into the first: \(c_1 + 2c_2 + c_1 = 0 \implies 2c_1 + 2c_2 = 0 \implies c_2 = -c_1\). Substituting both into the second: \(2c_1 - c_1 - c_1 = 0 \implies 0 = 0\) ✓. So the system is consistent with \(c_1 = t\), \(c_2 = -t\), \(c_3 = t\) for any scalar \(t\). For linear independence, we need only the trivial solution \(t = 0\) (i.e., \(c_1 = c_2 = c_3 = 0\)). Since non-trivial solutions exist, the vectors are linearly dependent, and one can be expressed as a combination of the others. Indeed: \(\mathbf{v}_1 - \mathbf{v}_2 + \mathbf{v}_3 = (1, 2, 1)^\top - (2, 1, 0)^\top + (1, -1, -1)^\top = (0, 0, 0)^\top\) ✓. Setup: to verify independence of \(n\) vectors in \(\mathbb{R}^m\), construct the \(m \times n\) matrix with vectors as columns and reduce to row echelon form via Gaussian elimination. Independence holds iff there are no free variables (every column is a pivot column). Reasoning: the rank of the matrix equals the number of independent vectors; if rank $ < n$, dependence exists. Interpretation: dependent vectors mean one is redundant (superfluous for spanning), so the smallest spanning set (basis) has size strictly less than \(n\). Common misconceptions: students sometimes confuse linear dependence with “almost parallel” (similar directions). Linear dependence is exact: one vector is exactly an algebraic combination of others. What-if scenario: changing \(\mathbf{v}_3\) to \((1, -1, 0)^\top\) instead would yield a different system; checking then reveals whether the modified set is independent.

ML relevance: Multicollinearity and identifiability: When features are linearly dependent, the design matrix is rank-deficient and the normal equations \((X^\top X)^{-1}\) are singular. This yields non-unique parameter estimates and unstable coefficients: many weight vectors fit the data equally well, so interpretability collapses.

Detection and mitigation: Practitioners use SVD, condition number, or variance inflation factors (VIF) to detect near-dependence. Remedies include feature pruning, feature aggregation, and regularization. Ridge regression adds \(\lambda I\) to stabilize inversions and selects a minimum-norm solution; LASSO selects a sparse subset of features, effectively choosing an independent basis.

Redundancy beyond linear models: Redundant directions also appear in deep networks (similar filters or neurons). Pruning and low-rank factorization remove dependent directions without hurting predictions, improving compute efficiency and reducing overfitting. Linear independence is thus a practical compression and stability principle, not just a theoretical property.

Example 4 — Constructing a Basis from a Spanning Set

Suppose \(\mathcal{S} = \{(1, 2, 1)^\top, (2, 4, 2)^\top, (1, 0, -1)^\top, (0, 1, 0)^\top\}\) spans a subspace \(U \subseteq \mathbb{R}^3\). Observe that \((2, 4, 2)^\top = 2(1, 2, 1)^\top\), so this vector is redundant. We can extract a basis by identifying linearly independent vectors within the set. Algorithm: arrange vectors as columns of a matrix and row-reduce to echelon form: \[M = \begin{pmatrix} 1 & 2 & 1 & 0 \\ 2 & 4 & 0 & 1 \\ 1 & 2 & -1 & 0 \end{pmatrix} \xrightarrow{\text{row reduce}} \begin{pmatrix} 1 & 2 & 0 & 1/3 \\ 0 & 0 & 1 & 1/3 \\ 0 & 0 & 0 & 0 \end{pmatrix}\] The pivot columns are columns 1 and 3 of the original matrix, corresponding to \((1, 2, 1)^\top\) and \((1, 0, -1)^\top\). These form a basis for \(U\): both are independent (column echelon form confirms full rank) and they span \(U\) (any vector in \(U\) is a linear combination of them). Reasoning: row reduction preserves the column space (the span of columns), so pivot columns form a basis for the original column space. Non-pivot columns are dependent on pivot columns, so they are redundant. Interpretation: dimension \(\dim(U) = 2\) (number of pivot columns). The discarded vectors reduce the spanning set to a basis without losing the ability to represent \(U\). Common misconceptions: one might assume any maximal independent subset is a basis, but identifying maximality requires checking all subsets (expensive). Row reduction automates this. What-if scenario: if we choose different vectors for rows, the pivot positions change, but the dimension and the space spanned remain the same.

ML relevance: Feature selection as basis extraction: In high-dimensional datasets, many features are redundant. Rank-revealing QR or SVD identifies a minimal independent subset (a basis) that preserves the span of the data. This improves numerical stability, reduces overfitting, and simplifies interpretation by keeping only independent directions.

PCA as automatic basis pruning: PCA is a principled way to discard redundant directions. If the covariance matrix has low rank, PCA identifies the pivot directions (principal components) and throws out the rest, yielding a compact basis that preserves most variance. This is the standard recipe for compression and denoising in ML pipelines.

Operational impact: Removing redundant features speeds training, reduces storage, and improves generalization. In practice, feature pruning is often the first step before model selection: remove dependencies, then fit the model on the cleaned basis. This is essential in tabular ML, where collinearity is common (one-hot features, ratios, engineered metrics).

Example 5 — Dimension of Polynomial Spaces

The span of \(\mathcal{P}_2\), the set of polynomials of degree at most 2, is \(\mathcal{P}_2 = \{a_0 + a_1 x + a_2 x^2 : a_0, a_1, a_2 \in \mathbb{R}\}\). The standard basis is \(\mathcal{B} = \{1, x, x^2\}\). Verification of basis: (1) spanning: any polynomial \(p(x) = a_0 + a_1 x + a_2 x^2\) is a combination \(a_0 \cdot 1 + a_1 \cdot x + a_2 \cdot x^2\). ✓ (2) independence: if \(c_0 + c_1 x + c_2 x^2 = 0\) for all \(x\), evaluating at \(x = 0, 1, -1\) gives \(c_0 = 0\), \(c_0 + c_1 + c_2 = 0\), and \(c_0 - c_1 + c_2 = 0\). From these, \(c_1 = c_2 = 0\). ✓ Thus \(\dim(\mathcal{P}_2) = 3\). More generally, \(\dim(\mathcal{P}_n) = n+1\) (basis: \(\{1, x, \ldots, x^n\}\), with \(n+1\) elements). The dimension grows with the degree allowed; restricting to degree exactly 2 (not “at most”) changes the space (it’s not a vector space, as the zero polynomial has degree \(-\infty\) by convention, not 2), so dimensions apply only to honest vector spaces. Setup: a basis for a function space is often discovered by exploiting orthogonality or by recognizing a natural spanning set and verifying minimality. Reasoning: polynomial spaces are infinite-dimensional if no degree bound is imposed (\(\mathcal{P} = \mathbb{R}[x]\), all polynomials), but finite-dimensional if degree is capped. Interpretation: a degree-\(n\) polynomial has \(n+1\) degrees of freedom (the \(n+1\) coefficients), mirroring the dimension count. Common misconceptions: students might think all infinite-dimensional spaces have the same “size” or cardinality. While all countable-dimensional spaces are infinite, their dimensions can differ (e.g., \(\mathcal{P}\) is countable-dimensional, with countably many basis elements; function spaces \(C([0,1])\) are uncountable-dimensional, with uncountably many basis elements). What-if scenario: if we restrict to polynomials with zero constant term, \(\{a_1 x + a_2 x^2 : a_1, a_2 \in \mathbb{R}\}\), this is a subspace of \(\mathcal{P}_2\) with basis \(\{x, x^2\}\) and dimension 2.

ML relevance: Polynomial features as capacity control: Polynomial regression expands the feature space so a linear model can fit nonlinear patterns. A degree-\(d\) polynomial yields a \((d+1)\)-dimensional hypothesis space, so choosing \(d\) is choosing model capacity. Too small \(d\) underfits; too large \(d\) overfits and becomes numerically unstable.

Multivariate explosion and kernel trick: With multiple features, the number of polynomial terms grows combinatorially (e.g., \(\binom{p+d}{d}\)). This quickly becomes infeasible, which motivates the kernel trick: polynomial kernels implicitly compute dot products in the expanded space without explicitly constructing it, enabling high-degree fits at manageable cost.

Regularization and basis choice: Regularization (ridge, LASSO) controls high-degree coefficients, effectively shrinking the model toward lower-degree subspaces. Orthogonal polynomial bases (Legendre, Chebyshev) improve numerical stability compared to monomials, especially for larger \(d\). In practice, cross-validation selects \(d\) or \(\lambda\) to balance bias and variance.

Example 6 — Dimension of Function Subspaces

Consider the subspace \(U = \{f \in C([0, 1]) : f(0) = 0, f(1) = 0\}\) of continuous functions on \([0, 1]\) that vanish at the boundaries. To understand its structure, note that any \(f \in U\) can be written as \(f(x) = x(1-x)g(x)\) for some continuous \(g: [0, 1] \to \mathbb{R}\), since \(x(1-x)\) vanishes at \(x=0\) and \(x=1\). This map \(f \mapsto g\) is bijective (one-to-one and onto), establishing an isomorphism between \(U\) and \(C([0, 1])\) (the full space). Intuitively, \(U\) is almost as “large” as the full space: we sacrifice only two constraints (the values at the boundaries), which is negligible in an infinite-dimensional space. More formally, under Hilbert space structures (with inner products), we can use the Gram-Schmidt process to construct an orthogonal basis for \(U\). For instance, start with polynomials \(\{x(1-x), x^2(1-x), x^3(1-x), \ldots\}\) satisfying the boundary conditions, orthogonalize them, and normalize. This yields an infinite-dimensional orthonormal basis for \(U\). Setup: identifying bases and dimensions in infinite-dimensional spaces (like function spaces) requires functional analysis tools (Hilbert space theory); finite-dimensional techniques do not directly apply. Reasoning: the dimension of \(U\) is infinite if it admits infinitely many orthogonal basis vectors (or countable or uncountable, depending on cardinality). Interpretation: although constraints are imposed, the lost degrees of freedom are finitely many, leaving infinite-dimensional residual freedom. This is a hallmark of PDE and functional analysis: imposing pointwise constraints (like boundary conditions) typically preserves infinite dimension. Common misconceptions: students might think adding constraints reduces dimension proportionally (e.g., one constraint reduces dimension by 1). In finite dimensions, this is true (\(\dim(U) = n - \text{(number of independent constraints)}\)), but in infinite dimensions, adding finitely many constraints leaves infinite dimension unchanged. What-if scenario: if instead of \(f(0) = f(1) = 0\), we impose \(f(0) = f(1) = f'(0) = 0\) (more constraints), still finite many, \(U\) remains infinite-dimensional. We must impose infinitely many constraints (or infinitely-dimensional restrictions like “all Fourier coefficients \(\geq 0\)”) to reduce dimension. ML relevance: in Gaussian process regression, the space of functions with a given covariance structure (kernel) is infinite-dimensional, but the posterior distribution (conditioned on finitely many data points) is a finite-dimensional Gaussian (projection onto the span of kernel basis functions evaluated at data points). This is why GPs can represent complex functions with few parameters: they leverage the infinite-dimensional function space structure, effectively finding a sparse basis (the data-dependent basis functions) that suffices to capture the posterior.

Example 7 — Change of Basis Matrix Derivation

Let \(\mathcal{B} = \{(1, 1)^\top, (1, -1)^\top\}\) be a basis for \(\mathbb{R}^2\), and let \(\mathcal{B}' = \{(1, 0)^\top, (0, 1)^\top\}\) be the standard basis. A vector \(\mathbf{v} \in \mathbb{R}^2\) has coordinates \([\mathbf{v}]_\mathcal{B} = (c_1, c_2)^\top\) with respect to \(\mathcal{B}\) and \([\mathbf{v}]_{\mathcal{B}'} = (a_1, a_2)^\top\) with respect to \(\mathcal{B}'\). The change-of-basis matrix \(P_{\mathcal{B} \to \mathcal{B}'} \in \mathbb{R}^{2 \times 2}\) satisfies \([\mathbf{v}]_{\mathcal{B}'} = P_{\mathcal{B} \to \mathcal{B}'} [\mathbf{v}]_\mathcal{B}\). To find \(P_{\mathcal{B} \to \mathcal{B}'}\), express each basis vector of \(\mathcal{B}\) in the basis \(\mathcal{B}'\): \((1, 1)^\top\) is already in \(\mathcal{B}'\) coordinates (being the sum of the two standard basis vectors), so \([(1, 1)^\top]_{\mathcal{B}'} = (1, 1)^\top\). Similarly, \([(1, -1)^\top]_{\mathcal{B}'} = (1, -1)^\top\). The change-of-basis matrix has these as columns: \(P_{\mathcal{B} \to \mathcal{B}'} = \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}\). Now, to transform a coordinate vector \([\mathbf{v}]_\mathcal{B} = (c_1, c_2)^\top\) to \([\mathbf{v}]_{\mathcal{B}'}\), multiply: \([\mathbf{v}]_{\mathcal{B}'} = \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix} \begin{pmatrix} c_1 \\ c_2 \end{pmatrix} = \begin{pmatrix} c_1 + c_2 \\ c_1 - c_2 \end{pmatrix}\). Verification: \(\mathbf{v} = c_1 (1, 1)^\top + c_2 (1, -1)^\top = (c_1 + c_2, c_1 - c_2)^\top\) in standard coordinates, confirming \([\mathbf{v}]_{\mathcal{B}'} = (c_1 + c_2, c_1 - c_2)^\top\). ✓ Setup: the change-of-basis matrix is constructed by stacking basis vectors of the source basis, expressed in the target basis. Reasoning: matrix multiplication directly applies the linear transformation inherent in changing coordinates. Interpretation: different bases yield different coordinate representations of the same vector; the change-of-basis matrix is the conversion map. Common misconceptions: students sometimes confuse rows vs. columns in \(P\) or mix up the direction (from \(\mathcal{B}\) to \(\mathcal{B}'\) vs. reverse). The rule “columns are old basis vectors in new coordinates” is a reliable mnemonic. What-if scenario: if \(\mathcal{B}\) and \(\mathcal{B}'\) are both non-standard (say, two different rotations of axes), the computation is unchanged: express the vectors of \(\mathcal{B}\) using the basis \(\mathcal{B}'\) and stack. ML relevance: whitening (standardizing data to zero mean and unit variance) is a change of basis. If \(X\) is a data matrix (columns are samples, rows are features), computing the covariance \(\Sigma = X X^\top / n\) and its Cholesky decomposition \(L L^\top = \Sigma\) gives a change-of-basis matrix (approximately; whitening typically uses \(\Sigma^{-1/2} = L^{-1}\)), transforming raw features to uncorrelated, unit-variance features. This change accelerates optimization (loss surfaces become more isotropic) and improves numerical stability.

Example 8 — Coordinate Transformation in Feature Space

A machine learning practitioner has a dataset of 2D points (features \(x_1, x_2\)) drawn from a distribution elongated along the line \(y = x\) (positive correlation). Plotting reveals that variance is much larger along the diagonal \((1, 1)\) direction than along \((1, -1)\). Using the standard basis causes the covariance matrix to have off-diagonal terms, making gradient-based optimization inefficient. Applying a change of basis to \(\mathcal{B} = \{(1, 1)^\top / \sqrt{2}, (1, -1)^\top / \sqrt{2}\}\) (normalized eigenspaces of the covariance matrix) decorrelates the data: in the new coordinates \((u, v)\), samples have zero covariance and variance concentrated in \(u\) (the direction of largest variance). Mathematically, if \(\Sigma = U \Lambda U^\top\) is the eigendecomposition (eigenvalues \(\Lambda\), eigenvectors \(U\)), the whitened features are \(\mathbf{z} = \Lambda^{-1/2} U^\top \mathbf{x}\), having identity covariance. Setup: recognizing that covariance has off-diagonal structure (correlation) motivates seeking a basis that diagonalizes the covariance (an eigenbasis). Reasoning: eigendecomposition finds such a basis; eigenvectors are basis vectors where covariance is diagonal (features decorrelated), and eigenvalues quantify variance along each eigenvector direction. Interpretation: in the new coordinates, the data live in an “aligned” space (axes match principal directions), simplifying downstream tasks. Common misconceptions: whitening is not dimensionality reduction; it preserves all dimensions while changing their scales. PCA combines whitening with truncation (dropping low-variance dimensions), achieving both decorrelation and dimensionality reduction. What-if scenario: if a feature has very low variance (small eigenvalue), whitening scales it up (dividing by a small number), potentially amplifying noise. In practice, one might use “robust whitening” or regularize small eigenvalues (add \(\lambda I\) before inverting) to avoid numerical issues. ML relevance: in deep learning, batch normalization applies coordinate transformation at each layer, standardizing activations to zero mean and unit variance (within each mini-batch). This change-of-basis operation stabilizes training (addresses internal covariate shift), accelerates convergence, and improves generalization. The transformation is data-dependent (statistics estimated from mini-batches) rather than fixed, making it an adaptive coordinate system learned during training.

Example 9 — Intrinsic vs Ambient Dimension in Data

Suppose a dataset consists of 1000 samples of 3D points \((x, y, z) \in \mathbb{R}^3\), but all samples lie exactly on a 2D plane (e.g., the surface of a tabletop in 3D space). The ambient dimension is 3 (the surrounding space \(\mathbb{R}^3\)), but the intrinsic dimension is 2 (the data occupy a 2D subspace). Computing PCA on the covariance matrix reveals three eigenvalues, say \(\lambda_1 = 10\), \(\lambda_2 = 8\), \(\lambda_3 = 0.0001\). The first two eigenvalues are large (capturing 99.999% of variance), while the third is negligible, indicating that data variance is concentrated in two directions. The corresponding eigenvectors \(\{\mathbf{u}_1, \mathbf{u}_2\}\) form a basis for the intrinsic 2D subspace. Projecting data onto these two eigenvectors yields \(\tilde{\mathbf{x}} = U_2^\top \mathbf{x} \in \mathbb{R}^2\) (discarding the dimension with negligible variance), essentially flattening the data without information loss. Setup: identifying intrinsic dimension requires analyzing variance across directions; small eigenvalues signal low-variance (noise) directions. Reasoning: data concentration near a low-dimensional manifold (e.g., a plane, curve, or surface) is captured by PCA’s interpretation: the top eigenvectors span a low-dimensional subspace explaining most variance, while discarded eigenvectors correspond to noise orthogonal to the true data structure. Interpretation: intrinsic dimension quantifies true degrees of freedom in data; ambient dimension is a (potentially wasteful) implicit representation. Common misconceptions: intrinsic dimension is not always obvious from the data matrix shape (1000 samples in 3D could have intrinsic dimension anywhere from 1 to min(1000, 3) = 3, depending on the data). What-if scenario: if eigenvalues decay gradually (rather than having a clear elbow), discerning intrinsic dimension is harder. Cumulative explained variance plots (sum of leading eigenvalues / total) help: one might choose \(k\) such that the cumulative variance reaches 95% or 99%, depending on tolerance for information loss. ML relevance: autoencoders with a bottleneck of dimension \(k\) are trained to compress data, discovering a \(k\)-dimensional latent representation. If \(k\) matches the true intrinsic dimension, compression is near-lossless; if \(k\) is too small, the bottleneck underrepresents complexity (reconstruction error high); if \(k\) is too large, the model wastes capacity (no compression benefit). Cross-validation on reconstruction error guides the choice of \(k\), effectively discovering the intrinsic dimension empirically.

Example 10 — Rank-Nullity in Linear Models (Preview)

Consider a linear regression task: fit \(\mathbf{y} = X \mathbf{w} + \boldsymbol{\epsilon}\) with \(X \in \mathbb{R}^{100 \times 50}\) (100 samples, 50 features) and \(\mathbf{y} \in \mathbb{R}^{100}\). The normal equations are \((X^\top X) \hat{\mathbf{w}} = X^\top \mathbf{y}\). The matrix \(X^\top X \in \mathbb{R}^{50 \times 50}\) has rank at most 50; if \(\text{rank}(X) = 50\) (full column rank), then \(\text{rank}(X^\top X) = 50\) and the matrix is invertible, yielding a unique solution \(\hat{\mathbf{w}} = (X^\top X)^{-1} X^\top \mathbf{y}\). If \(\text{rank}(X) < 50\) (column deficiency due to redundant or dependent features), then \(\text{rank}(X^\top X) < 50\), and \(X^\top X\) is singular. The rank-nullity theorem applies: \(\text{rank}(X) + \text{nullity}(X) = 50\). With \(\text{rank}(X) = r < 50\), we have \(\text{nullity}(X) = 50 - r\) free dimensions in the solution space. The set of solutions satisfying \((X^\top X) \mathbf{w} = X^\top \mathbf{y}\) is an affine subspace of dimension \(50 - r\) (not unique, but infinitely many with the same fit). Ridge regression \((X^\top X + \lambda I)^{-1} X^\top \mathbf{y}\) adds regularization, encouraging the smallest-norm solution within this affine space. Setup: rank-nullity connects matrix structure (rank, nullity) to solution properties (uniqueness, existence). Reasoning: the rank counts independent columns (preserved information), and nullity counts dependent columns (redundancy); nullity measures dimension of non-uniqueness. Interpretation: in machine learning, low rank in the design matrix signals multicollinearity (correlated features), causing parameter non-identifiability (multiple weight vectors yield the same predictions). Common misconceptions: students sometimes think low rank is always bad; it’s bad for least-squares parameter estimation (high variance in \(\hat{\mathbf{w}}\)), but good for learning (low rank is a form of regularization, biasing toward simpler models with lower parameter space dimension). What-if scenario: if samples \(n < d\) (fewer samples than features), the design matrix is automatically rank-\(\leq n < d\), and least-squares is underdetermined (infinitely many solutions fit the data perfectly). Regularization (ridge, LASSO) breaks ties by imposing structure (small norm, sparsity). ML relevance: neural network layers map \(\mathbb{R}^{n_{\ell-1}} \to \mathbb{R}^{n_\ell}\) via \(\mathbf{h}^{(\ell)} = W^{(\ell)} \mathbf{h}^{(\ell-1)}\). If \(\text{rank}(W^{(\ell)}) < n_\ell\), the layer is a bottleneck (dimensionality reduction), and information is lost. Stacking bottleneck layers limits representational capacity: the final layer’s rank is bounded by the minimum of all intermediate ranks, constraining the network’s ability to fit complex functions. Intentional bottlenecks (autoencoders) compress; unintentional ones (poor initialization, dead neurons) reduce expressivity and should be avoided.

Example 11 — Isomorphic Vector Spaces

Two vector spaces are isomorphic if there exists a linear bijection (one-to-one and onto, preserving addition and scaling) between them. A key theorem: all \(n\)-dimensional vector spaces over a field \(\mathbb{F}\) are isomorphic to \(\mathbb{F}^n\). Example: the space \(\mathcal{P}_2\) of degree-\(\leq 2\) polynomials and \(\mathbb{R}^3\) are isomorphic. An explicit isomorphism is the coordinate map \(\phi: \mathcal{P}_2 \to \mathbb{R}^3\) sending \(p(x) = a_0 + a_1 x + a_2 x^2 \mapsto (a_0, a_1, a_2)^\top\). This map is linear (addition and scaling are preserved: \(\phi(p + q) = \phi(p) + \phi(q)\), \(\phi(c p) = c \phi(p)\)), bijective (every polynomial maps to a unique triple, and every triple comes from a unique polynomial), and dimension-preserving (\(\dim(\mathcal{P}_2) = \dim(\mathbb{R}^3) = 3\)). Consequently, any property of \(\mathbb{R}^3\) (orthogonality, eigenvalues, rank) translates to \(\mathcal{P}_2\) via \(\phi\). For instance, the inner product on \(\mathcal{P}_2\) is \(\langle p, q \rangle = \int_0^1 p(x) q(x) \, dx\), and the corresponding inner product on \(\mathbb{R}^3\) (via the coordinate map) is \(\langle (a_0, a_1, a_2), (b_0, b_1, b_2) \rangle = \int_0^1 (a_0 + a_1 x + a_2 x^2)(b_0 + b_1 x + b_2 x^2) \, dx\). Setup: isomorphism is a classification tool: it groups vector spaces with identical algebraic structure, regardless of their “material” (polynomials, matrices, tuples, functions). Reasoning: dimension is the complete invariant (the only thing distinguishing spaces up to isomorphism), so two spaces are isomorphic iff they have the same dimension. Interpretation: isomorphic spaces are “the same” from a linear algebra perspective; choosing an isomorphism is choosing a representation (a basis). Common misconceptions: isomorphic does not mean identical; \(\mathcal{P}_2\) and \(\mathbb{R}^3\) are different as abstract spaces (one is polynomials, one is tuples), but the same in structure. What-if scenario: if we change the inner product on \(\mathcal{P}_2\) (e.g., \(\langle p, q \rangle = \sum_{i=0}^2 a_i b_i\)—comparing coefficients—rather than integrating), the isomorphism \(\phi\) still works, but the induced inner product on \(\mathbb{R}^3\) changes. This illustrates that structure beyond dimension matters for applications. ML relevance: neural network representations are isomorphisms between raw data space and learned representations. The hidden layer activations \(\mathbf{h}^{(\ell)}\) form a coordinate system (basis) for the space of representable functions by that layer. Choosing the right dimension and basis (via architecture and training) determines expressivity: too low dimension (small hidden units) underfits, too high (many units) overfits. Isomorphism theory suggests all \(n\)-dimensional representations are equally expressive in isolation, but their utility depends on how well the learned basis aligns with the problem structure (which basis is good for the task).

Example 12 — Basis Choice and Model Interpretability

A cancer diagnosis model uses 10 genomic features (gene expression levels). In the standard basis (raw expression values), a trained linear classifier has weights \(\mathbf{w} = (0.3, -0.1, 0.05, \ldots)^\top\): it weights genes individually. Interpretation: “Gene 1 increases risk by 0.3 units.” But raw expression values correlate strongly (genes in similar pathways co-express), making individual gene coefficients unstable (small changes in data cause large changes in \(\mathbf{w}\)—high variance). Applying PCA, we find that 90% of variance is captured by 3 principal components (directions of maximum variance in the gene co-expression structure). Retraining the classifier in the PC basis: \(\mathbf{z} = U^\top \mathbf{x}\) (where \(U\) is the matrix of top-3 eigenvectors), the model becomes \(\hat{y} = \mathbf{w}^\top_{\text{PC}} \mathbf{z} = (w_1^\text{PC}, w_2^\text{PC}, w_3^\text{PC})\). Interpretation: “PC1 (a weighted combination of correlated genes in pathway A) increases risk, PC2 (genes in pathway B) decreases risk, PC3 (genes in pathway C) has a small effect.” The basis change achieves two goals: (1) Reduced variance: fewer parameters (\(\mathbf{w}^\text{PC} \in \mathbb{R}^3\) vs. \(\mathbf{w} \in \mathbb{R}^{10}\)), making estimates more stable (lower test error). (2) Improved interpretability: PCs correspond to biological pathways (if top PCs align with known gene co-expression patterns), making the model’s decision rule more transparent. The trade-off: individual PC weights are less interpretable than gene weights (a PC is a weighted combination of genes, not a single gene). Setup: basis choice affects both numerical stability and interpretability; standard bases (raw features) are not optimal for all tasks. Reasoning: bases aligned with data structure (e.g., PCA) reduce parameter variance; bases aligned with domain knowledge (e.g., biological pathways) improve interpretability. Interpretation: the same function (classifier) has different representations in different bases; choosing a basis is choosing a language for expressing the model. Common misconceptions: students might assume raw features are always the right basis. In practice, feature engineering (including choosing a good basis) often determines model performance more than algorithm choice. What-if scenario: if the domain expert specifies 10 interpretable pathways (basis vectors), we can project data into that basis and train in that space, sacrificing some expressivity (if the expert basis does not align perfectly with data variance) for maximal interpretability. ML relevance: explainable AI and model transparency depend on the representation (basis choice). Models with complex nonlinear relationships between raw features are hard to interpret; models in a basis aligned with domain concepts (extracted via PCA, clustering, or expert knowledge) are more transparent. Autoencoders and variational autoencoders learn a basis (the decoder’s weights) that balances reconstruction accuracy (aligning with data structure) and interpretability (if regularized to have independent factors, e.g., disentangled autoencoders). Choosing the right basis is fundamental to building models that are not only accurate but also understandable.


Summary

Key Ideas Consolidated

This chapter established dimension as the fundamental intrinsic property of finite-dimensional vector spaces, defined as the size of any basis (and proved invariant via the Dimension Theorem). The Basis Extension Theorem guarantees existence of bases: start with any independent set and repeatedly add vectors until spanning is achieved. The Replacement Theorem (Steinitz Exchange Lemma) provides the mechanism: given an independent set and a spanning set, the independent set has at most as many elements, enabling precise control over dimension as bases are incrementally “exchanged.” Together, these theorems establish that dimension is well-defined and quantifies degrees of freedom. The coordinate map (for a fixed basis) is a linear isomorphism to \(\mathbb{R}^n\), making abstract vectors concrete as \(n\)-tuples. Change-of-basis matrices (whose columns are old basis vectors in new coordinates) connect different representations of the same vector: a unification of geometry (vectors as abstract entities) and algebra (vectors as tuples). The rank-nullity theorem (\(\dim(V) = \dim(\ker T) + \dim(\text{im} T)\) for a linear map \(T: V \to W\)) partitions domain dimension into annihilated (kernel) and preserved (image) directions, governing solution existence (does \(\mathbf{b} \in \text{im}(T)\)?), uniqueness (is \(\ker(T) = \{\mathbf{0}\}\)?), and information flow (ranks compose). Dimension bounds capacity: \(n\)-dimensional parameter spaces have \(n\) degrees of freedom.

What the Reader Should Now Be Able To Do

Upon completing this chapter, you should be able to:

Theoretical Competencies:

  1. Verify a basis: given a finite set of vectors, determine if it is linearly independent and spanning (or equivalently, a minimal spanning set, or a maximal independent set) via Gaussian elimination (row reduction).

  2. Compute coordinates: express a vector as a linear combination of basis vectors by solving a linear system, yielding its coordinate representation with respect to that basis.

  3. Change bases: given coordinates in one basis, compute coordinates in another using the change-of-basis matrix, or equivalently, transform the coordinate vector via matrix multiplication.

  4. Find dimension: for a subspace (given by a spanning set or as the null space, column space, or solution set of a linear system), compute the dimension by identifying a basis (via row reduction or eigendecomposition) and counting its size.

  5. Diagnose rank and nullity: for a matrix \(A \in \mathbb{R}^{m \times n}\), determine its rank (dimension of column space) and nullity (dimension of null space), verify the rank-nullity identity \(n = \text{rank}(A) + \text{nullity}(A)\), and interpret the ranks in terms of solution uniqueness/existence in \(A\mathbf{x} = \mathbf{b}\).

Practical Competencies:

  1. Recognize isomorphic spaces: identify when two finite-dimensional vector spaces have the same dimension (hence are isomorphic, structurally equivalent) and leverage isomorphism to translate properties between them.

  2. Apply dimension reasoning to solve multicollinearity in regression: diagnose dependent columns in feature matrices through rank analysis, understand solution non-uniqueness through the dimension of the null space, and use rank-nullity reasoning to determine when least-squares solutions are unique or require regularization.

  3. Use dimension concepts to design matrix factorizations: choose latent dimension in factorizations (e.g., \(R \approx UV^\top\) in recommender systems) and autoencoders by reasoning about intrinsic versus ambient dimension, and predict reconstruction fidelity based on how chosen latent dimension compares to the data’s intrinsic dimensionality.

  4. Design neural network architectures by reasoning about layer capacity: set layer widths and bottleneck sizes using dimension theory, understanding how rank constraints in composed linear layers limit information flow and when dimension collapse (low-rank intermediate layers) constrains downstream representational capacity.

  5. Select and validate features based on dimension and independence: identify redundant features through rank analysis of the feature matrix, choose features that form a basis aligned with problem structure and task-relevant subspace, and validate that the feature matrix achieves the expected rank for full expressivity.

Structural Assumptions for Later Chapters

This chapter builds on prior foundational knowledge and makes assumptions for future extensions:

Assumptions from Earlier Chapters (Prerequisite Knowledge):

  • Chapter 1 (Vector Spaces and Subspaces): Definitions of vector spaces, subspaces, linear independence, spanning sets, and linear combinations form the foundation; this chapter constructs basis and dimension from those primitives.

  • Linear systems and Gaussian elimination: Row reduction, row echelon form, solution structure for homogeneous and non-homogeneous systems.

Structural Assumptions Made in This Chapter:

  1. Finite-dimensionality: All spaces are assumed isomorphic to \(\mathbb{R}^n\) for some finite \(n\). Infinite-dimensional spaces (function spaces, Hilbert spaces) are excluded; their rigorous treatment requires functional analysis and topology.

  2. Concrete coordinate representations: We work with finite-dimensional subspaces of \(\mathbb{R}^m\) represented via spanning sets or linear constraints. Abstract spaces are understood through coordinate maps to \(\mathbb{R}^n\).

  3. Finitely generated subspaces: All subspaces are generated by finite spanning sets, ensuring they possess finite dimension and computable bases via Gaussian elimination.

Assumptions for Later Chapters (Forward Requirements):

  • Chapter 3 (Linear Transformations and Matrices): Dimension becomes fully computational, representing transformations as matrices and showing how bases transform under change-of-basis (similarity transformations).

  • Chapter 4 (Determinants and Inverses): Invertibility is characterized by full rank (\(\text{rank}(A) = n\)); determinants measure volume scaling with \(\det(A) = 0\) signaling dimension collapse.

  • Chapter 5 (Eigenvalues and Eigenvectors): Eigenbases diagonalize matrices; each eigenspace has dimension (geometric multiplicity), and diagonalizability requires total geometric multiplicity equal to \(n\).

  • Chapter 6 (Symmetry, Orthogonality, and Spectrum): The Spectral Theorem guarantees orthonormal eigenbases for symmetric matrices, enabling efficient spectral computation and geometric insights via eigenvalue decay.

  • Chapters 7–15 (Decompositions and Applications): All subsequent work depends on dimension and rank as organizing principles. The SVD reveals rank and enables low-rank approximation; optimization uses dimension to analyze gradient flow; ML applications treat dimension as capacity constraint.

Limitations and Caveats Acknowledged:

  • Row reduction yields one of many possible bases: Gaussian elimination produces a particular basis determined by pivoting strategy, not a geometrically “optimal” basis. Other bases (orthonormal, eigenbases, sparsity-aligned) may better serve specific tasks, and basis choice profoundly affects downstream computation and interpretation.

  • Dimension is invariant but basis choice is deeply problem-dependent: While dimension (the count) is unique, the basis elements themselves are not canonical. Different choices of basis reveal different structures—selecting the right basis often determines whether a problem is tractable or intractable.

  • Rank estimation is numerically fragile: In practice, rank is determined by thresholding singular values near machine precision. Small perturbations create ambiguity about whether near-zero singular values are truly zero or numerical noise, and numerical linear algebra requires careful conditioning analysis.

  • Intrinsic data dimensionality is difficult to determine empirically: While PCA, manifold learning, and information-theoretic measures estimate intrinsic dimension, estimates depend critically on sample size, noise level, and measurement choice. Real data dimensionality is often scale-dependent or ambiguous.


In Context

Algorithmic Development History

The theory of basis and dimension emerged from the study of linear independence in late-19th and early-20th century algebra. Hermann Grassmann (1844) introduced the concept of extent (a predecessor to modern vector space) and explored the dimension of linear systems, recognizing that a set of linear equations defines a lower-dimensional subspace in a higher-dimensional ambient space. Grassmann’s work was radical for its time, proposing abstract algebraic structures before linear algebra terminology was standardized; his insights (expressed in geometric language) anticipated dimension as an invariant and the role of maximal independent sets.

The Steinitz Exchange Lemma (Ernst Steinitz, 1912) formalized the relationship between spanning sets and independent sets with precision: if \(\mathcal{I} = \{\mathbf{v}_1, \ldots, \mathbf{v}_m\}\) is linearly independent and \(\mathcal{S} = \{\mathbf{w}_1, \ldots, \mathbf{w}_n\}\) spans a vector space \(V\), then \(m \leq n\). The lemma’s constructive proof (the “exchange” mechanism) shows how to iteratively replace elements of the spanning set with independent vectors while maintaining the span, giving a practical algorithm for extracting bases from spanning sets. This lemma is the engine of the Basis Extension Theorem and Dimension Theorem: apply it to two bases to prove they have equal size. With this tool in hand, dimension became rigorous: it is not a “guess” about degrees of freedom, but a provably invariant property of finite-dimensional spaces.

Development of finite-dimensional theory (1910s–1930s) proceeded through the work of Steinitz, Schreier, B. L. van der Waerden, and others. By the 1940s, linear algebra had crystallized into a unified framework: vector spaces, subspaces, linear transformations, and dimension as the central organizing principle. David Hilbert and his students extended these ideas to infinite-dimensional spaces (Hilbert spaces, function spaces), revealing that many results (orthogonality, eigenvalues, spectral theory) generalize dramatically, while dimension becomes subtle (infinite in multiple senses—countable vs. uncountable—and requiring topological tools). The distinction between finite and infinite dimensions became a dividing line in 20th-century mathematics: finite-dimensional theory is algebraic and geometric; infinite-dimensional theory requires functional analysis.

Linear algebra’s role in numerical analysis became paramount with the advent of computing (1940s onward). John von Neumann and Herman Goldstine pioneered analysis of numerical stability in solving linear systems, recognizing that matrix condition number (roughly, the ratio of largest to smallest singular values—a geometric property related to how a matrix distorts volumes and distances) governs rounding error growth. Dimension enters crucially: ill-conditioned problems (large condition number) arise when the matrix has rank near the threshold of numerical precision (nearly singular), meaning the effective dimension is lower than the nominal rank, and small perturbations cause large solution changes. Gaussian elimination, LU decomposition, QR decomposition, and the SVD all emerged from numerical analysis demands, and all are founded on dimension and basis concepts.

Emergence in statistical learning (1960s–present) saw dimension theory become central to machine learning. Principal Component Analysis (Pearson, 1901; Hotelling, 1933) discovered that data often concentrate near low-dimensional subspaces; computing the eigenbasis of the covariance matrix identifies these subspaces (dimensions sorted by variance). The curse of dimensionality—that many algorithms become intractable in high dimensions, requiring exponentially more samples—motivated dimensionality reduction. Statistical learning theory (Vapnik, Chervonenkis, Blumer, Ehrenfeucht, and others, 1960s–1980s) formalizes the trade-off: a model with parameter space of dimension \(d\) has sample complexity \(O(d \log n)\) to achieve low generalization error; high dimension means high sample complexity, justifying regularization (reducing effective dimension) and dimensionality reduction (choosing a low-dimensional basis). Modern deep learning understands hidden layers as progressively discovering low-dimensional representations (learned bases) that compress input complexity while preserving task-relevant information. Dimension theory, married with statistics and optimization, now underpins modern machine learning.


Why This Matters for ML

Dimensionality as Capacity

A vector space of dimension \(n\) has exactly \(n\) degrees of freedom: specifying \(n\) parameters uniquely determines every element of the space. In machine learning, this translates directly to model capacity: a parameter space of dimension \(n\) (e.g., weights \(\mathbf{w} \in \mathbb{R}^n\) in linear regression) has capacity \(n\). Linear models with \(n\) parameters can fit up to \(n\) orthogonal constraints (linearly independent equations), making them universal approximators on finite sets (any function on finitely many points can be fit exactly by choosing weights appropriately). However, this capacity is a double-edged sword: with \(n\) parameters and \(m\) data points, the model achieves zero training error if and only if \(n \geq m\) (the system is underdetermined), risking overfitting. The bias-variance trade-off is fundamentally about dimension: low-capacity models (small \(n\)) have high bias (cannot fit complex functions) but low variance (estimates are stable); high-capacity models (large \(n\)) have low bias (can fit anything) but high variance (estimates are noisy). The “Goldilocks” dimension balances both, and finding it requires (in practice) cross-validation or theory-driven choices (e.g., regularization parameter \(\lambda\) in ridge regression, which implicitly reduces effective dimension by penalizing large norms).

Dimensionality reduction (PCA, ICA, autoencoders) exploits low intrinsic dimension: if data lie near a \(k\)-dimensional subspace (intrinsic dimension \(k\)) embedded in \(\mathbb{R}^d\) (ambient dimension \(d\), with \(k \ll d\)), projecting onto a \(k\)-dimensional basis captures most information while discarding noise and compressing storage by factor \(d/k\). The key insight: working in a low-dimensional basis (learned from data) enables learning efficient models without overfitting (since the true model dimension is \(k\), a model with dimension \(\approx k\) suffices). This is why PCA as a preprocessing step for neural networks or classifiers often improves generalization: it removes noise-laden high-dimensional directions, focusing the model on essential structure.

Coordinate Choice and Optimization Geometry

The loss landscape (the function \(L(\mathbf{w})\) mapping parameters to error) depends critically on the coordinate system (basis choice) for the parameter space. In the standard basis (raw parameters), loss surfaces are often poorly conditioned: the Hessian has disparate eigenvalues, meaning the function is steep in some directions and flat in others (e.g., a long, narrow valley). Gradient descent with fixed step size struggles on such landscapes: small steps along flat directions waste computation, while large steps along steep directions overshoot. Preconditioning (an optimization technique) is a coordinate change: instead of updating \(\mathbf{w} \gets \mathbf{w} - \eta \nabla L(\mathbf{w})\), one updates \(\mathbf{w} \gets \mathbf{w} - \eta M^{-1} \nabla L(\mathbf{w})\) where \(M\) is a preconditioner (roughly, an approximation to the Hessian that makes the effective loss landscape more isotropic—equal curvature in all directions). Mathematically, this is a change of basis in parameter space: the new coordinates align with the Hessian’s eigenvectors (principal curvatures), making the landscape more symmetric and gradient descent more efficient.

Feature normalization and whitening (standardizing data to zero mean and unit variance) are change-of-basis operations: raw features often have different scales and correlations, making learning inefficient (gradient descent takes small steps in some directions, large steps in others, depending on scale). Whitening (via covariance matrix normalization) rotates and rescales to a basis where features are orthogonal and equal variance: the Hessian of the loss (with respect to the whitened features) is better conditioned, and gradient descent converges faster with less tuning. Batch normalization in deep networks applies this idea layer-wise: standardizing activations at each layer to zero mean and unit variance (within mini-batches) effectively changes the basis for hidden representations, stabilizing training and accelerating convergence. The fact that standardization helps so much is a manifestation of dimension theory: better-conditioned parameterizations (bases aligned with Hessian structure) enable efficient optimization.

This chapter prepared the algebraic foundation that Chapter 3 will make computational. A linear map \(T: V \to W\) between finite-dimensional spaces is fully determined by its values on basis vectors: knowing \(T(\mathbf{b}_i)\) for all basis vectors \(\{\mathbf{b}_1, \ldots, \mathbf{b}_n\}\) of \(V\) determines \(T\) everywhere (since any \(\mathbf{v} = \sum_i c_i \mathbf{b}_i\) maps to \(T(\mathbf{v}) = \sum_i c_i T(\mathbf{b}_i)\)). Choosing bases \(\mathcal{B}\) for \(V\) and \(\mathcal{C}\) for \(W\), we represent \(T\) as a matrix \([T]_{\mathcal{B}, \mathcal{C}}\): the \(j\)-th column is \([T(\mathbf{b}_j)]_{\mathcal{C}}\) (the coordinates of the image of the \(j\)-th basis vector, expressed in the basis \(\mathcal{C}\) of the codomain). Changing bases corresponds to similarity transformations: \([T]_{\mathcal{B}', \mathcal{C}'} = P_{\mathcal{C} \to \mathcal{C}'} [T]_{\mathcal{B}, \mathcal{C}} (P_{\mathcal{B}' \to \mathcal{B}})^{-1}\) (where \(P\) matrices are change-of-basis matrices). This will make clear why certain bases are “special” for a given transformation: an eigenbasis diagonalizes the matrix, revealing its structure plainly. The rank-nullity theorem, derived here from dimension counting, will be rephrased in Chapter 3 as: the rank (dimension of image) plus nullity (dimension of kernel) equals the dimension of the domain, controlling the dimension of the solution set to \(T(\mathbf{x}) = \mathbf{b}\) (if rank\(<\)domain-dimension, infinite solutions or none; if rank\(=\)domain-dimension, unique or no solution depending on codomain).

Determinants (Chapter 4) measure how a transformation scales volumes. A linear map \(T: \mathbb{R}^n \to \mathbb{R}^n\) represented by matrix \(A\) scales an \(n\)-dimensional volume element by factor \(|\det(A)|\); if \(\det(A) = 0\), the transformation collapses dimension (maps from \(n\)-dimensional to lower-dimensional image), corresponding to rank\(<n\) (singular matrix). Eigenvalues and eigenvectors (Chapter 5) are special bases and scalings: if \(A\mathbf{v} = \lambda \mathbf{v}\), then \(\mathbf{v}\) is a basis vector in which \(A\) acts as pure scaling (by \(\lambda\)). The geometric multiplicity of an eigenvalue \(\lambda\) is the dimension of its eigenspace, and diagonalizability requires the sum of geometric multiplicities to equal \(n\) (full dimension): the matrix is diagonalizable iff it has an eigenbasis. The Spectral Theorem (Chapter 6) guarantees orthonormal eigenbases for symmetric matrices—a remarkable property unifying geometry (orthogonal basis vectors) and algebra (diagonalization). This ensures that PCA (based on eigenbasis of the covariance matrix) is well-defined and geometrically natural: the eigenvectors are orthogonal (independent directions) and the eigenvalues are positive (real). These subsequent chapters build systematically on dimension and basis, showing how coordinates control complexity, representation, and computation throughout linear algebra and its applications in machine learning.


End-of-Chapter Advanced Exercises

A. True / False (20)

A.1. If a linear map \(T: \mathbb{R}^d \to \mathbb{R}^m\) has rank \(r < \min(d, m)\), then there exist two distinct vectors in \(\mathbb{R}^d\) that map to the same output under \(T\).

A.2. In a linear regression model with \(n\) samples and \(d\) features, if \(d > n\) and the feature matrix \(X\) has rank \(n\), then the least-squares solution \(\hat{\mathbf{w}} = (X^\top X)^{-1} X^\top \mathbf{y}\) is unique.

A.3. The intrinsic dimension of data estimated via PCA is independent of the choice of basis in which data are initially represented.

A.4. If a convolutional neural network reduces the spatial dimension from \(h \times w\) to \(1 \times 1\) (global average pooling before the final layer), the effective dimension of the representational bottleneck equals the number of final feature channels.

A.5. An affine subspace (a translation of a vector subspace) has the same dimension as its underlying vector subspace.

A.6. If the change-of-basis matrix from basis \(\mathcal{B}\) to basis \(\mathcal{B}'\) is orthogonal, then the Euclidean norm of any vector is preserved under the change of coordinates.

A.7. In a word embedding space where similarities are computed via dot products, the embedding matrix is identifiable up to orthogonal transformations of the embedding dimension.

A.8. A matrix \(A \in \mathbb{R}^{m \times n}\) with rank \(r\) can always be factorized as \(A = UV^\top\) where \(U \in \mathbb{R}^{m \times r}\) and \(V \in \mathbb{R}^{n \times r}\) are full column rank.

A.9. The dimension of the solution space to a homogeneous linear system \(A\mathbf{x} = \mathbf{0}\) equals the number of free variables that appear when \(A\) is reduced to row echelon form.

A.10. If an autoencoder has an encoder bottleneck of dimension \(k\) and the true intrinsic dimension of training data is \(k' > k\), then the autoencoder will generalize better than one with bottleneck dimension \(k' + 1\).

A.11. The rank of a matrix \(A\) equals the rank of its transpose \(A^\top\).

A.12. In a neural network with weight matrix \(W \in \mathbb{R}^{m \times n}\), if \(\text{rank}(W) < \min(m, n)\), then the layer compresses information dimensionality and cannot be inverted exactly even with infinite precision.

A.13. Principal Component Analysis finds a basis that maximizes the total variance explained by all \(k\) components, whereas the top \(k\) eigenvectors of the data covariance matrix maximize variance sequentially (first component max variance, second max residual variance, etc.).

A.14. If two datasets in \(\mathbb{R}^d\) have the same intrinsic dimension but different noise levels, applying PCA with the same number of retained components will result in the noisier dataset having higher reconstruction error.

A.15. A parameter space of dimension \(n\) can fit at most \(n\) orthogonal constraints (linearly independent equations), making it universal for interpolating \(n\) arbitrary data points in its range.

A.16. In ridge regression with regularization parameter \(\lambda > 0\), increasing \(\lambda\) is equivalent to solving least squares in a lower-effective-dimensional subspace of the parameter space.

A.17. If a matrix \(A\) is square and full rank, then there exists a basis of \(\mathbb{R}^n\) consisting entirely of eigenvectors of \(A\).

A.18. A gram matrix \(G = X^\top X\) (where \(X \in \mathbb{R}^{n \times d}\), \(n < d\)) is singular, and the dimension of its null space is \(d - n\).

A.19. The number of principal components needed to retain 95% of variance in PCA is a lower bound on the intrinsic dimension of the data.

A.20. In a feature space of dimension \(d\), a linear classifier defined by weight vector \(\mathbf{w}\) and bias \(b\) partitions the space into two half-spaces of dimension \(d\) separated by a hyperplane of dimension \(d-1\).

B. Proof Problems (20)

B.1. Steinitz Exchange Lemma (Constructive Version): Let \(\mathcal{I} = \{\mathbf{v}_1, \ldots, \mathbf{v}_m\}\) be a linearly independent set in a vector space \(V\), and let \(\mathcal{S} = \{\mathbf{w}_1, \ldots, \mathbf{w}_n\}\) be a spanning set for \(V\). Prove that \(m \leq n\) and that there exists a subset \(\mathcal{S}' \subseteq \mathcal{S}\) of size \(n - m\) such that \(\mathcal{I} \cup \mathcal{S}'\) spans \(V\).

B.2. Dimension Theorem: Prove that any two bases of a finite-dimensional vector space \(V\) have the same number of elements. (Hint: Use the Steinitz Exchange Lemma applied to two bases.)

B.3. Rank-Nullity for Linear Maps: Let \(T: V \to W\) be a linear map between finite-dimensional vector spaces. Prove that \(\dim(V) = \dim(\ker T) + \dim(\text{im} T)\).

B.4. Dimension of Subspaces: Let \(V\) be a finite-dimensional vector space and let \(U, W \subseteq V\) be subspaces. Prove that \(\dim(U + W) + \dim(U \cap W) = \dim(U) + \dim(W)\).

B.5. Coordinate Uniqueness: Let \(\mathcal{B} = \{\mathbf{b}_1, \ldots, \mathbf{b}_n\}\) be a basis for a vector space \(V\). Prove that every vector \(\mathbf{v} \in V\) has a unique representation \(\mathbf{v} = \sum_{i=1}^n c_i \mathbf{b}_i\) with coordinates \(c_1, \ldots, c_n\).

B.6. Change-of-Basis Invertibility: Let \(\mathcal{B}\) and \(\mathcal{B}'\) be two bases for \(\mathbb{R}^n\). Prove that the change-of-basis matrix \(P_{\mathcal{B} \to \mathcal{B}'} \in \mathbb{R}^{n \times n}\) is invertible, and its inverse is \(P_{\mathcal{B}' \to \mathcal{B}}\).

B.7. Rank Invariance Under Row Operations: Prove that elementary row operations (row swap, scalar multiplication of a row, add a scalar multiple of one row to another) preserve the rank of a matrix.

B.8. Column Rank Equals Row Rank: Let \(A \in \mathbb{R}^{m \times n}\). Prove that the dimension of the column space of \(A\) equals the dimension of the row space of \(A\) (both equal the rank).

B.9. Full Column Rank Implies Injective: Let \(A \in \mathbb{R}^{m \times n}\) with \(\text{rank}(A) = n\) (full column rank). Prove that if \(A\mathbf{x} = A\mathbf{y}\), then \(\mathbf{x} = \mathbf{y}\).

B.10. Low-Rank Approximation Optimality: Let \(X \in \mathbb{R}^{n \times d}\) be a data matrix with SVD \(X = U \Sigma V^\top\). Let \(X_k = U_k \Sigma_k V_k^\top\) be the rank-\(k\) approximation (truncating singular values). Prove that \(X_k\) minimizes \(\|X - Y\|_F\) over all rank-\(k\) matrices \(Y\), where \(\|\cdot\|_F\) is the Frobenius norm.

B.11. Dimension Bound on Solutions: Let \(A \in \mathbb{R}^{m \times n}\) and \(\mathbf{b} \in \mathbb{R}^m\). Prove that if \(A\mathbf{x} = \mathbf{b}\) has a solution, then the solution set is an affine subspace of dimension \(\text{nullity}(A) = n - \text{rank}(A)\).

B.12. Intrinsic Dimension via PCA: Let \(X \in \mathbb{R}^{n \times d}\) be a centered data matrix (columns are samples, rows are features), and let \(\Sigma = XX^\top / n\) be the (unbiased) sample covariance. Let \(\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_d \geq 0\) be eigenvalues of \(\Sigma\), and let \(k = \arg\max_j \{j : \lambda_j > \epsilon\}\) for some threshold \(\epsilon > 0\). Prove that retaining the top \(k\) principal components captures variance \(\geq (1 - \delta)\) times the total variance, where \(\delta \leq (\sum_{j=k+1}^d \lambda_j) / (\sum_{j=1}^d \lambda_j)\).

B.13. Autoencoder Dimension Constraint (Undercomplete): Let an autoencoder have encoder \(E: \mathbb{R}^d \to \mathbb{R}^k\) and decoder \(D: \mathbb{R}^k \to \mathbb{R}^d\) (both linear). If \(k < d\) and training data lie on a \(k\)-dimensional subspace of \(\mathbb{R}^d\), prove that there exists an autoencoder achieving zero reconstruction error on the training data.

B.14. Feature Redundancy and Column Dependence: Prove that if a design matrix \(X \in \mathbb{R}^{n \times d}\) has two identical columns, then \(\text{rank}(X) < d\), and the least-squares problem \(\min_\mathbf{w} \|X\mathbf{w} - \mathbf{y}\|^2\) has infinitely many optimal solutions.

B.15. Regularization Reduces Effective Dimension: Let \(X \in \mathbb{R}^{n \times d}\) with \(\text{rank}(X) = r < d\). Prove that for any \(\lambda > 0\), the ridge regression solution \(\hat{\mathbf{w}}_\lambda = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}\) is unique and lies in the column space of \(X^\top\).

B.16. Whitening Diagonalizes Covariance: Let \(X \in \mathbb{R}^{n \times d}\) be centered data with full-rank covariance \(\Sigma = X^\top X / (n-1)\). Let \(L\) be the Cholesky factor of \(\Sigma\) (i.e., \(\Sigma = LL^\top\)). Prove that the whitened data \(Z = XL^{-\top}\) has covariance equal to the identity matrix.

B.17. Embedding Non-Identifiability: Let \(W \in \mathbb{R}^{V \times d}\) be an embedding matrix (vocabulary size \(V\), embedding dimension \(d\)). Prove that for any orthogonal matrix \(Q \in \mathbb{R}^{d \times d}\), the embeddings \(W\) and \(WQ\) induce identical dot-product similarities \((WQ)(WQ)^\top = WW^\top\) and are observationally equivalent for any downstream model depending only on similarities.

B.18. Rank Constraints in Neural Networks: Let a neural network have layers with weight matrices \(W^{(1)} \in \mathbb{R}^{n_1 \times d}, W^{(2)} \in \mathbb{R}^{n_2 \times n_1}, \ldots, W^{(L)} \in \mathbb{R}^{n_L \times n_{L-1}}\). Prove that \(\text{rank}(W^{(L)} \cdots W^{(2)} W^{(1)}) \leq \min(\text{rank}(W^{(1)}), \text{rank}(W^{(2)}), \ldots, \text{rank}(W^{(L)}))\), and equality holds iff all intermediate ranks equal the minimum.

B.19. Dimension of Null Space Under Composition: Let \(A \in \mathbb{R}^{m \times n}\) and \(B \in \mathbb{R}^{n \times p}\). Prove that \(\text{nullity}(AB) \geq \text{nullity}(B)\), with equality iff \(\text{rank}(A) \geq \text{rank}(B)\).

B.20. Isomorphism and Dimension Uniqueness: Prove that two finite-dimensional vector spaces \(V\) and \(W\) are isomorphic (i.e., there exists a linear bijection between them) if and only if \(\dim(V) = \dim(W)\).

C. Python Exercises (20)

C.1. Verifying Linear Independence via Row Reduction: Write a function that takes a list of column vectors (as a matrix) and determines whether they are linearly independent by reducing the matrix to row echelon form and counting pivot columns. Your function should return a boolean and optionally the rank. Purpose: This exercise reinforces the algorithmic foundation of linear independence—recognizing that independent sets have full column rank. ML Link: Detecting multicollinearity in regression requires checking if feature columns are independent; a rank-deficient design matrix signals redundant features, necessitating dimensionality reduction or regularization. Hints: Use NumPy’s linear algebra routines (or implement Gaussian elimination manually) to reduce the matrix; count non-zero rows in the echelon form. Mastery: You should be able to correctly identify dependent columns, compute their ranks, and explain why multicollinearity causes non-unique solutions in least-squares regression.

C.2. Extracting a Basis from a Spanning Set: Write a function that takes a spanning set of vectors (potentially with redundancy) and extracts a basis by identifying pivot columns after row reduction. Return the basis vectors themselves, not just their indices. Purpose: Constructing a minimal spanning set (a basis) is fundamental to dimensionality reduction; this exercise makes the theory computational. ML Link: Feature selection in machine learning amounts to extracting independent features from a (possibly redundant) set of raw features, reducing ambient dimension to intrinsic dimension. Hints: Perform row reduction on the matrix of spanning vectors; identify pivot column indices; extract the corresponding original columns as basis vectors. Mastery: You should produce a basis with correct dimension, verify it spans the original space, and recognize that different extraction strategies (e.g., preferring early columns vs. others) yield different bases—all valid, all with the same size.

C.3. Computing a Basis for the Null Space: Write a function that computes a basis for the null space of a given matrix \(A\). Your function should reduce \(A\) to row echelon form, identify free variables, and construct vectors spanning the null space. Purpose: The null space characterizes directions annihilated by the linear map; its dimension (nullity) determines solution non-uniqueness. ML Link: In underdetermined regression (\(n < d\)), the null space has positive dimension, meaning infinitely many weight vectors fit the data equally well; understanding its structure is key to understanding regularization and parameter identifiability. Hints: Use row reduction to identify free variables; for each free variable, set it to 1 and solve for dependent variables to get one basis vector. Mastery: You should correctly identify all free variables, construct an independent set of null space vectors, and verify that each satisfies \(A\mathbf{v} = \mathbf{0}\) and that the set is maximal (any additional vector is dependent).

C.4. Computing a Basis for the Column Space: Write a function that computes a basis for the column space of a matrix \(A\). Your implementation should not use built-in subspace decompositions (like SVD directly for this purpose); instead, use row reduction on \(A\) to identify pivot columns and return the corresponding original columns. Purpose: The column space (range) is the set of all reachable outputs; its dimension (rank) quantifies how many independent directions can be “activated” by the map. ML Link: In classification, the output layer’s weight matrix defines a column space; its dimension bounds the complexity of decision boundaries the model can create. Hints: Perform row reduction on \(A\); note which columns are pivots; those original columns form a basis for the column space (not the reduced pivots, but the original matrix columns). Mastery: You should correctly identify pivot columns, extract them from the original matrix, verify they span the column space and are independent, and explain why the row-reduced form’s pivot columns do not directly give the basis (transformation by row operations changes coordinates).

C.5. Computing a Change-of-Basis Matrix: Write a function that, given two bases \(\mathcal{B}\) and \(\mathcal{B}'\) for \(\mathbb{R}^n\), computes the change-of-basis matrix \(P_{\mathcal{B} \to \mathcal{B}'}\). Your function should express each vector of \(\mathcal{B}\) in coordinates with respect to \(\mathcal{B}'\) and stack them as columns. Purpose: Change-of-basis matrices are the computational encoding of basis switching; mastering them is essential for understanding how the same vector has different numerical representations. ML Link: Feature normalization (whitening, PCA) applies change-of-basis transformations to raw features, rotating and scaling to align with principal directions; the resulting coordinates are decorrelated and isotropic, improving optimization. Hints: For each basis vector \(\mathbf{b}_i \in \mathcal{B}\), solve the system to express it as a linear combination of vectors in \(\mathcal{B}'\), yielding its coordinates; stack these as columns of \(P\). Mastery: You should correctly construct the change-of-basis matrix, verify that multiplying coordinates in \(\mathcal{B}\) by \(P\) yields correct coordinates in \(\mathcal{B}'\), and apply its inverse to reverse the transformation.

C.6. Verifying Basis Change (Round-Trip Transformation): Write a function that takes a vector \(\mathbf{v}\), its coordinates \([\mathbf{v}]_\mathcal{B}\) with respect to basis \(\mathcal{B}\), transforms to basis \(\mathcal{B}'\) using a change-of-basis matrix, and transforms back to \(\mathcal{B}\). Verify that the final coordinates match the original. Purpose: This exercise tests understanding of how change-of-basis matrices compose: the forward and inverse transformations should be true inverses. ML Link: When deploying ML models, feature transformations (e.g., PCA whitening fit on training data) must be consistently applied; this exercise models that pipeline—transforming inputs to a learned basis and back. Hints: Compute \(P_{\mathcal{B} \to \mathcal{B}'} \cdot [\mathbf{v}]_\mathcal{B}\) to get coordinates in \(\mathcal{B}'\); compute the inverse matrix \(P^{-1}\) and multiply to recover coordinates in \(\mathcal{B}\). Mastery: You should verify numerically (within floating-point tolerance) that the round-trip recovers original coordinates, and explain why inverses must cancel (by properties of matrix multiplication and invertibility).

C.7. Computing PCA and Extracting Principal Components as a Basis: Write a function that takes a data matrix \(X\) (samples as columns, features as rows), centers it, computes the covariance matrix, finds its eigendecomposition, and returns the top \(k\) eigenvectors as a basis for a \(k\)-dimensional representation. Your function should also return the explained variance for each component. Purpose: PCA is the workhorse dimensionality reduction technique in ML; implementing it from scratch builds intuition for how eigendecompositions yield principled bases. ML Link: PCA discovers a basis where data variance is captured sequentially (first component explains most variance, second explains residual variance, etc.), enabling compression and noise reduction. Hints: Center the data: \(X_c = X - \text{mean}(X)\); compute covariance \(\Sigma = X_c X_c^\top / n\); use NumPy’s eigendecomposition to find eigenvalues (sorted descending) and eigenvectors; take the top \(k\) eigenvectors as basis vectors. Mastery: You should correctly compute PCA, verify that eigenvectors are orthonormal, project data onto the top \(k\) components, reconstruct the approximate original data, and compute reconstruction error (sum of discarded eigenvalues / total variance).

C.8. Whitening Data via Covariance Diagonalization: Write a function that whitens (standardizes) a dataset by computing its covariance matrix \(\Sigma\), performing Cholesky decomposition \(\Sigma = LL^\top\), and transforming the data by \(Z = X L^{-\top}\). Verify that the whitened data have zero mean and identity covariance. Purpose: Whitening decorrelates features and equalizes variance, improving optimization stability in ML algorithms. ML Link: Batch normalization in neural networks applies whitening-like transformations to activations; understanding the mathematical basis (covariance structure) clarifies why normalization helps. Hints: Center the data; compute the covariance; use NumPy’s Cholesky or use SVD as an alternative; apply the inverse-transpose of the Cholesky factor. Mastery: You should correctly whiten data, verify the resulting covariance is identity (within numerical precision), understand that whitening is a specific change-of-basis transformation, and recognize when it improves optimization (e.g., in gradient descent on whitened vs. raw features).

C.9. Rank and Nullity of a Design Matrix in Regression: Write a function that takes a design matrix \(X \in \mathbb{R}^{n \times d}\) and response vector \(\mathbf{y} \in \mathbb{R}^n\), computes the rank and nullity of \(X\), and diagnoses whether the least-squares problem is well-posed (unique solution), underdetermined (infinitely many solutions), or inconsistent (generically, if overdetermined). Purpose: Understanding rank/nullity clarifies why and when regression problems have unique vs. non-unique solutions, and what regularization does. ML Link: Real datasets often exhibit multicollinearity (rank-deficient design matrices); this exercise diagnoses the issue and motivates regularization. Hints: Compute rank using NumPy’s matrix_rank or SVD; compute nullity as \(d - \text{rank}(X)\); check if \(\mathbf{y} \in \text{Col}(X)\) by examining the rank of \([X | \mathbf{y}]\). Mastery: You should correctly identify the rank, interpret nullity as the dimension of parameter non-uniqueness, explain how augmented matrix rank determines solution existence, and propose regularization (ridge regression) as a way to ensure uniqueness despite rank deficiency.

C.10. Ridge Regression and Effective Dimensionality: Write a function that solves ridge regression with various \(\lambda\) values, and for each \(\lambda\), computes the “effective dimension” (approximately, the sum of the ratio of each eigenvalue \(\sigma_i\) to \(\sigma_i + \lambda\) in the SVD of \(X\)). Plot effective dimension vs. \(\lambda\) and observe how regularization reduces effective degrees of freedom. Purpose: This exercise illustrates that regularization is not just a statistical trick—it is a principled way to reduce the effective dimension of the parameter space, trading bias for variance. ML Link: Regularization (L2 penalty in ridge regression) trades off fit quality vs. parameter norm; understanding it as dimension reduction provides geometric intuition. Hints: Use NumPy’s SVD to decompose \(X = U\Sigma V^\top\); for each \(\lambda\), the ridge solution involves the regularized singular values \(\sigma_i / (\sigma_i^2 + \lambda)\); effective dimension is related to the sum of attenuation factors. Mastery: You should observe that larger \(\lambda\) reduces effective dimension (fewer parameters truly matter), verify that this trades off training fit for generalization, and understand why regularization helps with ill-conditioned or rank-deficient design matrices.

C.11. Gram Matrix Rank and Feature Redundancy: Write a function that takes a data matrix \(X\) (samples as rows, features as columns) and computes the Gram matrix \(G = XX^\top\) (within-sample similarities). Compute the rank of \(G\) and relate it to the rank of \(X\). Verify that rank$(G) = $ rank\((X)\). Purpose: The Gram matrix is fundamental in kernel methods and understanding sample-wise similarities; its rank reveals whether samples are linearly independent. ML Link: In support vector machines and kernel methods, the Gram matrix encodes sample similarities; its rank determines the dimension of the implicit feature space and the degrees of freedom in the model. Hints: Compute the rank of \(X\) directly; compute the Gram matrix and its rank; verify the identity rank$(XX^) = $ rank\((X)\). Mastery: You should correctly compute the Gram matrix, understand why its rank equals \(X\)’s rank (by properties of matrix multiplication), and recognize that a rank-deficient Gram matrix signals redundant samples or collinear data.

C.12. Dimension Reduction via Truncated SVD: Write a function that computes the SVD of a data matrix \(X\), truncates to the top \(k\) singular vectors, and reconstructs a low-rank approximation \(\tilde{X}\). Compare the Frobenius norm error \(\|X - \tilde{X}\|_F\) with the sum of the discarded singular values (Eckart-Young theorem). Purpose: The SVD provides an optimal low-rank approximation; this exercise demonstrates that truncating singular values is optimal in terms of reconstruction error. ML Link: SVD-based dimensionality reduction is theoretically optimal; understanding why truncated SVD minimizes reconstruction error justifies its use in compression, denoising, and feature extraction. Hints: Use NumPy’s SVD to decompose \(X = U\Sigma V^\top\); truncate to the top \(k\) singular values; reconstruct as \(\tilde{X} = U_k \Sigma_k V_k^\top\); compute error and compare with theory. Mastery: You should correctly perform SVD, understand that truncation discards directions of low variance/importance, verify the Eckart-Young optimality property, and explain why SVD-based compression is preferred over PCA for non-square matrices.

C.13. Intrinsic Dimension Estimation via Variance Threshold: Write a function that, given a dataset and a variance threshold (e.g., 95%), applies PCA and returns the minimum number of components needed to exceed that threshold. Visualize the cumulative explained variance curve. Purpose: Estimating intrinsic dimension is a practical step in feature engineering; this exercise operationalizes the concept. ML Link: Choosing the number of PCA components or autoencoder bottleneck dimension requires estimating intrinsic dimension; this exercise provides a concrete method. Hints: Compute eigenvalues of the covariance matrix (sorted descending); cumulative sum of normalized eigenvalues gives cumulative explained variance; find the minimum index where cumulative variance exceeds the threshold. Mastery: You should correctly estimate intrinsic dimension, understand that dimension estimation is data and threshold dependent (not absolute), and recognize that different thresholds trade off compression vs. information loss.

C.14. Autoencoder Bottleneck Dimension and Reconstruction Error: Write a function that trains a simple linear autoencoder (fully connected, no hidden layers beyond the bottleneck) on a dataset with various bottleneck dimensions \(k\). For each \(k\), compute training reconstruction error and test reconstruction error (on held-out data). Plot the error curve and identify the optimal \(k\) via cross-validation. Purpose: This exercise links dimensionality reduction to model selection; the optimal bottleneck dimension balances expressivity and generalization. ML Link: Autoencoders learn compressed representations; the bottleneck dimension must be chosen carefully—too small underfits (high bias), too large overfits (high variance). Hints: A linear autoencoder with bottleneck \(k\) is equivalent to PCA; you can use gradient descent to train (or solve the least-squares problem directly for linear autoencoders). Use train/test split; cross-validation to select \(k\). Mastery: You should correctly train autoencoders for various \(k\), observe the bias-variance trade-off graphically, and select an appropriate bottleneck dimension via cross-validation, understanding that the optimal \(k\) approximates the true intrinsic dimension.

C.15. Coordinate Transformation for Feature Interpretability: Write a function that takes a trained linear classifier \(\hat{\mathbf{w}}\) and a change-of-basis transformation (e.g., PCA), and reexpresses the classifier in the new basis. Compute the original coefficients and the transformed coefficients, comparing their interpretability (e.g., sparsity, magnitude distribution). Purpose: Basis changes affect interpretability; understanding how to reexpress models in different bases aids in model explanation. ML Link: Explainable AI requires understanding model decisions; expressing a classifier in a basis aligned with interpretable features (e.g., PCA components = pathways, LIME components = local perturbations) improves transparency. Hints: If the change-of-basis is matrix \(P\), the transformed classifier is \(\tilde{\mathbf{w}} = P^{-1} \hat{\mathbf{w}}\) (roughly; details depend on whether coordinates or features change). Mastery: You should correctly transform a classifier to a new basis, verify that predictions remain identical (abstract classifier unchanged, just representation different), and discuss how different bases offer different interpretability levels.

C.16. Identifying and Removing Collinear Features: Write a function that detects collinear (or nearly collinear, using a correlation threshold) features in a dataset, suggests which features to drop to restore independence, and recompute the rank after removal. Purpose: Multicollinearity is a practical problem in regression; detecting and removing redundant features is a common preprocessing step. ML Link: Collinear features cause ill-conditioned design matrices, unstable coefficient estimates, and overfitting; removing them is a dimensionality reduction strategy. Hints: Compute the correlation matrix of features; identify pairs with correlation $> $ threshold; suggest dropping one from each collinear pair (or use more sophisticated methods like variance inflation factor). Mastery: You should identify collinear features, explain why they cause numerical and statistical problems, and demonstrate that removing them improves the condition number and stability of regression solutions.

C.17. Null Space Characterization of Solution Non-Uniqueness: Write a function that, for an underdetermined least-squares problem (fewer equations than unknowns, or rank-deficient), computes a particular solution \(\mathbf{w}_p\) and a basis for the null space. Verify that all solutions are of the form \(\mathbf{w} = \mathbf{w}_p + \sum_i c_i \mathbf{v}_i\) where \(\mathbf{v}_i\) span the null space. Purpose: Understanding the structure of the solution set (an affine subspace) clarifies why least-squares is non-unique and how regularization “breaks ties”. ML Link: In underdetermined regression, the solution set is high-dimensional; regularization (ridge, LASSO) selects a specific point in this set (e.g., smallest norm, sparsest). Hints: Use least-squares to compute a minimum-norm solution (which lies in the row space of \(X\)); compute a basis for the null space; verify that \(\mathbf{w}_p + \mathbf{v}_i\) satisfy the least-squares equations for any null space vector \(\mathbf{v}_i\). Mastery: You should correctly characterize the solution set as an affine subspace, compute its dimension, and explain how different regularizers select different solutions from this set.

C.18. Low-Rank Factorization for Implicit Regularization: Write a function that factorizes a matrix as \(A = UV^\top\) with \(U \in \mathbb{R}^{m \times k}\) and \(V \in \mathbb{R}^{n \times k}\) (rank \(k\)), solving the factorization problem via gradient descent or alternating least squares. Vary \(k\) and observe how factorization error and generalization change. Purpose: Low-rank factorization is an implicit regularization: it restricts the parameter space to rank-\(k\) matrices, reducing degrees of freedom. ML Link: Matrix factorization appears in collaborative filtering, word embeddings, and latent factor models; understanding the rank constraint as regularization clarifies why low-rank approximations generalize. Hints: Initialize \(U, V\) randomly; alternate minimizing loss with respect to \(U\) and \(V\) (or use SVD-based initialization); monitor training and test error. Mastery: You should correctly implement factorization, observe the dimension-capacity trade-off (too small \(k\) underfits, too large \(k\) overfits), and recognize that factorization can be viewed as constraining an alternative parameterization to low rank.

C.19. Isomorphism Between Polynomial Spaces and Coordinate Spaces: Write a function that represents polynomials (up to degree \(n\)) as coordinate vectors with respect to the standard basis \(\{1, x, x^2, \ldots, x^n\}\), and implements operations (addition, scalar multiplication, evaluation) in both polynomial and coordinate forms. Verify that the coordinate representation and polynomial operations are isomorphic. Purpose: Recognizing isomorphisms between abstract spaces (polynomials) and concrete spaces (\(\mathbb{R}^n\)) bridges theory and computation. ML Link: Neural networks learn isomorphisms between raw data space and learned representations; understanding isomorphism as a structure-preserving bijection clarifies what representations preserve. Hints: Represent a polynomial \(p(x) = a_0 + a_1 x + \ldots + a_n x^n\) as the vector \((a_0, a_1, \ldots, a_n)^\top\); operations on vectors should mirror operations on polynomials. Mastery: You should implement polynomial-to-coordinate and coordinate-to-polynomial transformations, verify that operations commute (adding polynomials and adding their coordinate vectors yield the same result), and explain why isomorphism means the spaces are “the same” algebraically.

C.20. Dimension and Generalization in Kernel Methods: Write a function that applies the kernel trick (e.g., polynomial kernel) to a dataset, computes the Gram matrix in the implicit feature space, and observes how the effective dimension (rank of the Gram matrix) grows with kernel parameters. Relate this to generalization via cross-validation. Purpose: Kernel methods implicitly work in high- (or infinite-)dimensional feature spaces; understanding the effective dimension via the Gram matrix rank clarifies generalization behavior. ML Link: SVMs and kernel methods gain expressivity by implicitly mapping to high-dimensional spaces; the kernel’s ability to compute similarities without explicitly constructing features is powerful but risks overfitting if the implicit dimension is too high or matches the data too closely. Hints: Compute the Gram matrix for a polynomial kernel (e.g., \(K(x_i, x_j) = (x_i^\top x_j + 1)^d\)) and observe its rank; compute cross-validation errors for different kernel parameters; note the trade-off between model complexity and generalization. Mastery: You should correctly compute Gram matrices for various kernels, understand that rank of the Gram matrix relates to effective dimensionality, and observe empirically that higher-dimensional implicit spaces improve training fit but can hurt generalization if capacity exceeds data size.


Solutions

Solutions to A. True / False

A.1. If a linear map \(T: \mathbb{R}^d \to \mathbb{R}^m\) has rank \(r < \min(d, m)\), then there exist two distinct vectors in \(\mathbb{R}^d\) that map to the same output under \(T\).

Final Answer: True.

Full Mathematical Justification: By the rank-nullity theorem, \(\dim(\ker T) = d - \text{rank}(T) = d - r > 0\) (since \(r < d\)). Thus the kernel is non-trivial: there exists a non-zero vector \(\mathbf{v} \in \ker T\) with \(\mathbf{v} \neq \mathbf{0}\). Now, for any \(\mathbf{x} \in \mathbb{R}^d\), consider \(\mathbf{x}'= \mathbf{x} + \mathbf{v}\). Then \(T(\mathbf{x}') = T(\mathbf{x} + \mathbf{v}) = T(\mathbf{x}) + T(\mathbf{v}) = T(\mathbf{x}) + \mathbf{0} = T(\mathbf{x})\). Since \(\mathbf{v} \neq \mathbf{0}\), we have \(\mathbf{x} \neq \mathbf{x}'\), yet they map to the same output.

Counterexample if False: N/A (statement is true).

Comprehension: The kernel (null space) quantifies how much information is “lost” by the map. A non-trivial kernel means the map is not injective (one-to-one), so multiple inputs produce identical outputs. This is the geometric interpretation of rank deficiency: the map collapses some dimensions.

ML Applications: In regression with multicollinear features (rank-deficient design matrix), multiple weight vectors yield identical predictions on training data, causing non-uniqueness of solutions. In neural networks, if a layer’s weight matrix is rank-deficient, that layer cannot invert uniquely—multiple hidden states map to the same downstream activation.

Failure Mode Analysis: If one mistakenly assumes full rank when the matrix is actually singular (e.g., due to numerical errors or undetected multicollinearity), one might naively invert the matrix, obtaining incorrect or unstable results. Computing rank via SVD (not direct inversion) is more numerically stable.

Traps: Students sometimes conflate “rank deficiency” with “no solution” (in regression). Rank deficiency means non-uniqueness of solutions (assuming they exist), not that solutions don’t exist. A rank-deficient design matrix can still have solutions; it just has infinitely many.


A.2. In a linear regression model with \(n\) samples and \(d\) features, if \(d > n\) and the feature matrix \(X\) has rank \(n\), then the least-squares solution \(\hat{\mathbf{w}} = (X^\top X)^{-1} X^\top \mathbf{y}\) is unique.

Final Answer: False.

Full Mathematical Justification: If \(X \in \mathbb{R}^{n \times d}\) with \(d > n\) and \(\text{rank}(X) = n\) (full row rank), then \(X^\top X \in \mathbb{R}^{d \times d}\) has rank \(n < d\), so it is singular (non-invertible). Thus \((X^\top X)^{-1}\) does not exist, and the formula given is undefined. More precisely, the least-squares problem has infinitely many solutions forming an affine subspace of dimension \(d - n > 0\) in parameter space. All solutions achieve the same minimum least-squares error \(\|X\mathbf{w} - \mathbf{y}\|^2\), but they are not unique.

Counterexample if False: Let \(X = \begin{pmatrix} 1 & 1 & 0 \\ 1 & 0 & 1 \end{pmatrix} \in \mathbb{R}^{2 \times 3}\) (full row rank \(n = 2\), but \(d = 3 > n\)), then \(X^\top X\) is singular, and infinitely many \(\mathbf{w}\) satisfy the least-squares equations.

Comprehension: The condition \(d > n\) means more parameters than constraints. The least-squares problem is underdetermined: there are more degrees of freedom than data points. Full row rank means the rows of \(X\) are independent (no conflicting constraints), but columns (parameters) can still be dependent, giving a non-unique solution set.

ML Applications: High-dimensional regression (more features than samples) is common in genomics, text mining, and imaging. Without regularization, the solution set is non-unique. Regularization (ridge, LASSO) selects a unique point in the solution set, biasing toward certain solutions (e.g., smallest norm, sparsest).

Failure Mode Analysis: Attempting to invert \(X^\top X\) directly will fail numerically (matrix is singular); one must use stable alternatives (pseudoinverse, regularization, or direct solution via SVD).

Traps: Confusing “full rank” (rows or columns?) and “invertible” (requires square + full rank). An \(m \times n\) matrix cannot be inverted unless \(m = n\). Full row rank means the rows are independent; full column rank means the columns are independent.


A.3. The intrinsic dimension of data estimated via PCA is independent of the choice of basis in which data are initially represented.

Final Answer: True.

Full Mathematical Justification: Intrinsic dimension is a property of the data manifold (or subspace), not of its representation. Formally, the space spanned by data is invariant under basis change: if \(X\) is a data matrix in one basis and \(\tilde{X} = XP\) is its representation in another basis (via change-of-basis matrix \(P\)), then \(\rank(X) = \rank(\tilde{X})\) (matrix rank is invariant under left multiplication by invertible matrices). PCA estimates intrinsic dimension as the number of large eigenvalues (or equivalently, the rank of the correlation/covariance matrix). While the eigenvalues and eigenvectors themselves change under basis change, the number of non-zero eigenvalues does not.

Counterexample if False: N/A (statement is true).

Comprehension: Intrinsic dimension is a geometric property of the data—how many degrees of freedom they truly have. This is independent of our choice of coordinate system (basis). Orthogonal basis changes (e.g., rotations) preserve inner products and distances, so geometric estimates of dimension remain unchanged.

ML Applications: Practitioners can confidently apply PCA without worrying that basis choice affects the intrinsic dimension estimate.

Failure Mode Analysis: Rank-deficient or ill-conditioned data can make eigenvalue estimation numerically unstable. Small numerical errors in covariance matrix computation can lead to spurious small eigenvalues, potentially overestimating intrinsic dimension. Regularization (adding \(\lambda I\) to the covariance) can help.

Traps: Confusing “intrinsic dimension” (a property of the data) with “explained variance by top-\(k\) components” (which depends on how many components you keep).


A.4. If a convolutional neural network reduces the spatial dimension from \(h \times w\) to \(1 \times 1\) (global average pooling before the final layer), the effective dimension of the representational bottleneck equals the number of final feature channels.

Final Answer: True.

Full Mathematical Justification: After global average pooling, the spatial dimensions \(h \times w\) are reduced to \(1 \times 1\), leaving only the channel dimension. If there are \(c\) channels, the output is a vector in \(\mathbb{R}^c\). The effective dimension of this representation is \(c\) (the number of independent degrees of freedom).

Counterexample if False: N/A (statement is true).

Comprehension: Dimension counting at each layer of a neural network bounds representational capacity. A global average pooled layer with \(c\) channels can represent at most \(c\) independent directions in the data space, regardless of the input spatial complexity.

ML Applications: Understanding bottleneck dimensions helps explain why too few channels severely limit model capacity, while over-parameterizing with very large \(c\) risks overfitting.

Failure Mode Analysis: If a network is forced to use very few channels (\(c\) small), the global average pooling creates a severe bottleneck that cannot be overcome.

Traps: Confusing spatial reduction (e.g., \(h \times w \to 1 \times 1\)) with channel number. Global average pooling along spatial dimensions does not change the channel count; it collapses spatial structure into per-channel scalars.


A.5. An affine subspace (a translation of a vector subspace) has the same dimension as its underlying vector subspace.

Final Answer: True.

Full Mathematical Justification: An affine subspace is a set of the form \(A = \mathbf{x}_0 + V\) where \(V\) is a vector subspace and \(\mathbf{x}_0\) is a fixed point. The dimension of \(A\) is defined as \(\dim(A) := \dim(V)\). Translation preserves dimension: the translation map \(\mathbf{x} \mapsto \mathbf{x} + \mathbf{x}_0\) is a bijection from \(V\) to \(A\), and it preserves linear structure (up to translation).

Counterexample if False: N/A (statement is true).

Comprehension: A line in \(\mathbb{R}^3\) (1-dimensional affine subspace) and the vector subspace defining its direction both have dimension 1. A 2D plane is a 2-dimensional affine subspace.

ML Applications: In regression, the solution set to an underdetermined least-squares problem is an affine subspace (a particular solution plus the null space). Its dimension is the nullity of the design matrix. Regularization (ridge, LASSO) selects a point in this affine solution set.

Failure Mode Analysis: None—this is a fundamental definition. However, confusion can arise when describing affine subspaces: one must specify both a basepoint \(\mathbf{x}_0\) and the underlying vector subspace \(V\).

Traps: Conflating affine subspaces with vector subspaces. An affine subspace need not contain the origin; a vector subspace always does.


A.6. If the change-of-basis matrix from basis \(\mathcal{B}\) to basis \(\mathcal{B}'\) is orthogonal, then the Euclidean norm of any vector is preserved under the change of coordinates.

Final Answer: True.

Full Mathematical Justification: Let \(P = P_{\mathcal{B} \to \mathcal{B}'}\) be orthogonal, meaning \(P^\top P = I\). A vector \(\mathbf{v}\) has coordinates \([\mathbf{v}]_\mathcal{B}\) and \([\mathbf{v}]_{\mathcal{B}'} = P [\mathbf{v}]_\mathcal{B}\). Since \(P\) is orthogonal (preserves norms: \(\|P\mathbf{x}\| = \|\mathbf{x}\|\)), we have \(\|[\mathbf{v}]_{\mathcal{B}'}\| = \|[\mathbf{v}]_\mathcal{B}\|\).

Counterexample if False: N/A (statement is true).

Comprehension: Orthogonal change-of-basis matrices correspond to rotations and reflections—transformations that preserve distances and angles. This is why orthonormal bases are special.

ML Applications: In optimization, orthogonal transformations preserve step sizes and gradient norms. Whitening via orthogonal transformations preserves norms and makes the loss landscape more isotropic.

Failure Mode Analysis: If the change-of-basis matrix is nearly orthogonal but not exactly (e.g., due to numerical errors in Gram-Schmidt), norms may be slightly distorted.

Traps: Assuming all useful change-of-basis matrices are orthogonal. Non-orthogonal bases can be useful and may better suit certain problems.


A.7. In a word embedding space where similarities are computed via dot products, the embedding matrix is identifiable up to orthogonal transformations of the embedding dimension.

Final Answer: True.

Full Mathematical Justification: Let \(W \in \mathbb{R}^{V \times d}\) be an embedding matrix. For any orthogonal matrix \(Q\), the embeddings \(W\) and \(WQ\) induce identical dot-product similarities \((WQ)(WQ)^\top = WW^\top\), making them observationally equivalent for any downstream model depending only on dot products.

Counterexample if False: N/A (statement is true).

Comprehension: The embedding space is a learned latent representation. Only the similarities (angles and relative magnitudes) between word vectors matter; the specific choice of coordinate axes (basis) is arbitrary.

ML Applications: To compare embeddings across runs, one must align them (e.g., via Procrustes analysis), which finds the “best” orthogonal transformation matching one embedding to another.

Failure Mode Analysis: If one attempts to interpret individual dimensions of word embeddings without alignment to a canonical basis, interpretations are arbitrary (different runs will have different interpretations).

Traps: Assuming word embeddings are “interpretable” without alignment to a canonical basis.


A.8. A matrix \(A \in \mathbb{R}^{m \times n}\) with rank \(r\) can always be factorized as \(A = UV^\top\) where \(U \in \mathbb{R}^{m \times r}\) and \(V \in \mathbb{R}^{n \times r}\) are full column rank.

Final Answer: True.

Full Mathematical Justification: By the SVD, \(A = U_0 \Sigma V_0^\top\). Define \(U = U_0(:, 1:r) \Sigma(1:r, 1:r)^{1/2}\) and \(V = V_0(:, 1:r) \Sigma(1:r, 1:r)^{1/2}\). Then \(A = UV^\top\) and both \(U, V\) have full column rank \(r\).

Counterexample if False: N/A (statement is true).

Comprehension: Any rank-\(r\) matrix can be written as a product of two full-column-rank matrices. This factorization is not unique, but existence is guaranteed.

ML Applications: Low-rank matrix factorization is used in collaborative filtering, dimensionality reduction, and latent factor models. The rank-\(r\) constraint restricts the parameter space to rank-\(r\) matrices, which is implicit regularization.

Failure Mode Analysis: If the true rank is higher than the prescribed \(r\), truncating the factorization results in reconstruction error (approximation error / bias).

Traps: Assuming the factorization is unique. Multiple \((U, V)\) pairs yield the same \(A = UV^\top\).


A.9. The dimension of the solution space to a homogeneous linear system \(A\mathbf{x} = \mathbf{0}\) equals the number of free variables that appear when \(A\) is reduced to row echelon form.

Final Answer: True.

Full Mathematical Justification: By rank-nullity, \(\text{nullity}(A) = n - \text{rank}(A) = n - p\) where \(p\) is the number of pivot columns. The number of free variables is \(n - p\), which equals the dimension of the null space.

Counterexample if False: N/A (statement is true).

Comprehension: Free variables parameterize the solution set. Each free variable can be set to any real number independently, so with \(k\) free variables, the solution space is \(k\)-dimensional.

ML Applications: In regression or classification, understanding how many degrees of freedom remain in the solution set tells you how much non-uniqueness to expect.

Failure Mode Analysis: Numerical errors in row reduction can lead to misidentifying pivot vs. free variables. Using SVD is more stable for tall or ill-conditioned matrices.

Traps: Confusing “free variables” (independent parameters in the solution set) with “total variables” (dimension of the ambient space).


A.10. If an autoencoder has an encoder bottleneck of dimension \(k\) and the true intrinsic dimension of training data is \(k' > k\), then the autoencoder will generalize better than one with bottleneck dimension \(k' + 1\).

Final Answer: False.

Full Mathematical Justification: An autoencoder with bottleneck \(k < k'\) will have larger reconstruction error (underfitting). An autoencoder with bottleneck \(k' + 1\) has sufficient capacity to fit the intrinsic structure (and possibly some noise). The optimal bottleneck dimension balances bias (underfitting) and variance (overfitting), typically close to but not necessarily equal to \(k'\). An undercomplete autoencoder (\(k < k'\)) has lower variance but higher bias. On in-distribution test data, an \(k'\) or \((k'+1)\)-dimensional bottleneck will generalize better than \(k\).

Counterexample if False: Training data lie on a 50-dimensional subspace. A \(k=30\)-bottleneck underfits (high reconstruction error, large bias). A \(k=51\)-bottleneck fits well and generalizes better to test data from the same distribution.

Comprehension: To generalize well, the model must first fit the structure (low bias). If the bottleneck is too small, it underfits.

ML Applications: Choosing autoencoder bottleneck dimension requires cross-validation, not just guessing to be “as small as possible.”

Failure Mode Analysis: Forcing bottleneck too small results in high bias and poor generalization if intrinsic dimension is high.

Traps: Confusing regularization (reducing overfitting risk) with underfitting avoidance. Small bottleneck dimension is implicit regularization only if you have enough capacity to represent the signal.


A.11. The rank of a matrix \(A\) equals the rank of its transpose \(A^\top\).

Final Answer: True.

Full Mathematical Justification: By the SVD, \(A = U\Sigma V^\top\) and \(A^\top = V\Sigma U^\top\), both with rank equal to the number of non-zero singular values. Thus \(\text{rank}(A) = \text{rank}(A^\top)\).

Counterexample if False: N/A (statement is true).

Comprehension: Column space and row space are dual concepts. A tall matrix (\(m > n\)) and its transpose (wide matrix \(n \times m\)) have the same rank.

ML Applications: In regression, the design matrix \(X\) and its transpose both have rank \(\leq \min(n, d)\). The rank determines solution uniqueness; whether you think of it as “the rows are independent” or “the columns are independent,” the constraint applies symmetrically.

Failure Mode Analysis: None—this is always true. Computing rank via row reduction is sometimes more efficient than column reduction, depending on the matrix shape.

Traps: Conflating “rank of \(A\)” with “rank of rows of \(A\)” (the latter is the row space dimension, which equals \(\text{rank}(A)\), but can be confusing phrasing).


A.12. In a neural network with weight matrix \(W \in \mathbb{R}^{m \times n}\), if \(\text{rank}(W) < \min(m, n)\), then the layer compresses information dimensionality and cannot be inverted exactly even with infinite precision.

Final Answer: True.

Full Mathematical Justification: If \(\text{rank}(W) = r < m\), the image is \(r\)-dimensional (a proper subspace of \(\mathbb{R}^m\)); not all outputs can be reached. If \(\text{rank}(W) = r < n\), the kernel is \((n-r)\)-dimensional, so multiple inputs map to the same output. In either case, \(W\) cannot be inverted exactly.

Counterexample if False: N/A (statement is true).

Comprehension: A neural network layer with a rank-deficient weight matrix has reduced representational capacity. Some information in its input is irrevocably lost.

ML Applications: Intentional bottleneck layers (autoencoders) use rank-deficient transformations to compress. Unintentional rank deficiency (due to poor initialization, dead neurons, or feature redundancy) limits model expressivity.

Failure Mode Analysis: If a network’s early layers become low-rank, all subsequent layers are constrained by this bottleneck, and no amount of additional layers can recover the lost information.

Traps: Assuming all weight matrices remain full-rank during training. In practice, regularization, dropout, and batch normalization can inadvertently reduce rank.


A.13. Principal Component Analysis finds a basis that maximizes the total variance explained by all \(k\) components, whereas the top \(k\) eigenvectors of the data covariance matrix maximize variance sequentially (first component max variance, second max residual variance, etc.).

Final Answer: False.

Full Mathematical Justification: Both statements describe the same process. PCA finds the top-\(k\) eigenvectors of the covariance matrix, which are ordered by explained variance. The first component maximizes variance; the second component (orthogonal to the first) maximizes residual variance; and so on. This greedy sequential process is not separate from maximizing total variance; they are identical.

Counterexample if False: Any dataset demonstrates this: PCA’s sequential property (each new component maximizes remaining variance) is exactly equivalent to maximizing total variance captured by the \(k\) components.

Comprehension: PCA is greedy and sequential, but this sequential property happens to be optimal (globally) for maximizing explained variance by \(k\) orthogonal components.

ML Applications: Practitioners can adaptively choose the number of components \(k\): compute all eigenvectors, plot cumulative explained variance, and choose \(k\) at a flexible threshold (e.g., 95%).

Failure Mode Analysis: If one mistakenly believes PCA’s sequential selection is heuristic (not optimal), one might try alternative methods expecting better results. In reality, PCA’s sequential procedure is provably optimal.

Traps: Conflating “sequential” (greedy) with “suboptimal.” PCA’s sequential procedure is both a practical algorithm and an optimal solution.


A.14. If two datasets in \(\mathbb{R}^d\) have the same intrinsic dimension but different noise levels, applying PCA with the same number of retained components will result in the noisier dataset having higher reconstruction error.

Final Answer: True.

Full Mathematical Justification: Noise inflates variance estimates. The top-\(k\) eigenvectors of the noisy covariance matrix are contaminated by noise, so projecting onto them captures less of the true signal than for clean data, resulting in higher reconstruction error.

Counterexample if False: N/A (statement is true).

Comprehension: Noise inflates variance estimates, especially at small scales. The top-\(k\) eigenvectors of the noisy covariance matrix are not the “true” signal basis but a noise-contaminated version.

ML Applications: In denoising, one must be careful with PCA on noisy data. Better approaches: estimate the noise level, subtract it from eigenvalues (Marchenko-Pastur corrections), or use robust PCA / matrix completion methods.

Failure Mode Analysis: Applying PCA na"ively to noisy data can result in poor denoising. The top components capture signal+noise, and low-rank approximation then reconstructs an estimate that is still noisy.

Traps: Assuming PCA automatically separates signal from noise. PCA is variance-agnostic; it captures whatever variance is largest, whether signal or noise.


A.15. A parameter space of dimension \(n\) can fit at most \(n\) orthogonal constraints (linearly independent equations), making it universal for interpolating \(n\) arbitrary data points in its range.

Final Answer: True.

Full Mathematical Justification: An \(n\)-dimensional parameter space \(\mathbb{R}^n\) has exactly \(n\) degrees of freedom. Any \(n\) orthogonal (linearly independent) constraints define a system \(A\mathbf{w} = \mathbf{b}\) where \(A \in \mathbb{R}^{n \times n}\) is full rank, having a unique solution. Interpolating \(n\) arbitrary data points requires solving \(\Phi \mathbf{w} = \mathbf{y}\) where \(\Phi \in \mathbb{R}^{n \times n}\) is the feature matrix. If the features are linearly independent, \(\Phi\) is invertible, and a unique solution exists.

Counterexample if False: N/A (statement is true).

Comprehension: An \(n\)-dimensional model (e.g., \(n\) parameters in linear regression) can exactly fit \(n\) data points. With fewer parameters (\(d < n\)), perfect fit is generally impossible (overdetermined system). With more parameters (\(d > n\)), infinitely many hypotheses fit perfectly (underdetermined, unless regularized).

ML Applications: In polynomial interpolation, a degree-\(n\) polynomial has \(n+1\) coefficients and can fit \(n+1\) points exactly. In neural networks, a model with \(n\) parameters can be adjusted to satisfy \(n\) constraints, achieved via overfitting on small datasets.

Failure Mode Analysis: While an \(n\)-parameter model can fit \(n\) points exactly, generalization to unseen data suffers (overfitting). Regularization or using fewer parameters is necessary for good generalization.

Traps: Confusing “fitting (interpolation)” with “generalization (prediction on new data).”


A.16. In ridge regression with regularization parameter \(\lambda > 0\), increasing \(\lambda\) is equivalent to solving least squares in a lower-effective-dimensional subspace of the parameter space.

Final Answer: True.

Full Mathematical Justification: Using SVD, each singular value direction is scaled by \(\sigma_i / (\sigma_i^2 + \lambda)\). As \(\lambda \to \infty\), all scaling factors \(\to 0\). Directions with small singular values are more aggressively shrunk. The “effective dimension” \(\sum_i \frac{\sigma_i^2}{\sigma_i^2 + \lambda}\) decreases as \(\lambda\) increases. Thus larger \(\lambda\) reduces effective dimensionality.

Counterexample if False: N/A (statement is true).

Comprehension: Ridge regression is dimension reduction via shrinkage. Rather than hard-truncating some dimensions (as in PCA), it soft-shrinks all dimensions, with higher shrinkage on lower-variance directions.

ML Applications: Ridge regression stabilizes regression when features are collinear or the design matrix is ill-conditioned. Higher \(\lambda\) reduces overfitting by reducing effective dimension. Cross-validation selects the optimal \(\lambda\).

Failure Mode Analysis: Too much regularization (\(\lambda\) very large) underfits by overly reducing effective dimension. Optimal \(\lambda\) is problem-dependent and determined by cross-validation.

Traps: Confusing ridge regression’s dimension reduction with PCA. Ridge shrinks all directions (soft-shrinkage); PCA truncates (hard-shrinkage).


A.17. If a matrix \(A\) is square and full rank, then there exists a basis of \(\mathbb{R}^n\) consisting entirely of eigenvectors of \(A\).

Final Answer: False.

Full Mathematical Justification: Full rank does not guarantee diagonalizability. Example: the Jordan block \(A = \begin{pmatrix} 1 & 1 \\ 0 & 1 \end{pmatrix}\) is full rank but has eigenvalue \(\lambda = 1\) with algebraic multiplicity 2 and geometric multiplicity 1 (only one eigenvector). Thus no basis of eigenvectors exists.

Counterexample if False: \(A = \begin{pmatrix} 1 & 1 \\ 0 & 1 \end{pmatrix}\). The eigenspace for \(\lambda=1\) is \(\ker(A - I) = \text{span}(1, 0)^\top\), which has dimension 1 (geometric multiplicity 1 \(<\) algebraic multiplicity 2). The matrix is not diagonalizable.

Comprehension: Invertibility (full rank) is a weak condition on matrices. Diagonalizability is a stronger condition that requires a certain relationship between algebraic and geometric multiplicities (no defective eigenvalues).

ML Applications: In neural networks, weight matrices are typically full rank but not necessarily diagonalizable, so spectral decomposition methods may not apply cleanly. However, symmetric matrices (which arise in optimization, e.g., Hessians) are always diagonalizable.

Failure Mode Analysis: Assuming a full-rank matrix is diagonalizable can lead to incorrect eigendecomposition algorithms. If eigenvalues are defective, one must use the Jordan normal form.

Traps: Conflating “full rank” (invertible, no zero eigenvalues) with “diagonalizable” (complete eigenbasis).


A.18. A gram matrix \(G = X^\top X\) (where \(X \in \mathbb{R}^{n \times d}\), \(n < d\)) is singular, and the dimension of its null space is \(d - n\).

Final Answer: True.

Full Mathematical Justification: \(\text{rank}(G) = \text{rank}(X) \leq \min(n, d) = n < d\). Thus \(G\) is rank-deficient and singular. The nullity is \(\text{nullity}(G) = d - \text{rank}(G) \leq d - n\). If \(\text{rank}(X) = n\) (full row rank, the generic case), then \(\text{nullity}(G) = d - n\).

Counterexample if False: N/A (statement is true for generic \(X\) with full row rank).

Comprehension: A Gram matrix captures inner products between data samples. With fewer samples (\(n\)) than ambient dimension (\(d\)), there is inherent redundancy in the sample-sample similarity matrix.

ML Applications: Gram matrices appear in kernel methods, SVM, and Gaussian process regression. A singular Gram matrix indicates that the samples do not span the ambient space. Pseudoinverses must be used instead of direct inversion.

Failure Mode Analysis: Attempting to invert \(G\) directly will fail numerically. Using regularization or pseudoinverse is necessary.

Traps: Assuming \(G\) is always invertible if you have “enough” samples. Whether \(G\) is invertible depends on the relationship of sample count to ambient dimension.


A.19. The number of principal components needed to retain 95% of variance in PCA is a lower bound on the intrinsic dimension of the data.

Final Answer: True.

Full Mathematical Justification: If \(k^*\) components are needed to reach 95% cumulative variance, then the data’s intrinsic dimension is at least \(k^*\) (since fewer components would not suffice). If the true intrinsic dimension were less than \(k^*\), the data would efficiently lie in a lower-dimensional subspace, and fewer than \(k^*\) components would explain 95% variance.

Counterexample if False: N/A (statement is true).

Comprehension: The intrinsic dimension is the minimum number of coordinates needed to represent data without significant information loss. PCA’s threshold-based \(k^*\) (e.g., 95% variance) provides a conservative lower bound.

ML Applications: When estimating intrinsic dimension, using the 95%-threshold rule gives a principled lower bound.

Failure Mode Analysis: If the variance threshold (95%) is set too low, the estimated intrinsic dimension is conservative (possibly underestimated). If set too high (e.g., 99.9%), more components are retained, increasing computational cost.

Traps: Confusing the PCA-based intrinsic dimension estimate (a heuristic based on variance threshold) with the “true” intrinsic dimension (which may be defined differently by other methods).


A.20. In a feature space of dimension \(d\), a linear classifier defined by weight vector \(\mathbf{w}\) and bias \(b\) partitions the space into two half-spaces of dimension \(d\) separated by a hyperplane of dimension \(d-1\).

Final Answer: True.

Full Mathematical Justification: A linear classifier is \(f(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x} + b)\). The decision boundary is \(\{\mathbf{x} : \mathbf{w}^\top \mathbf{x} + b = 0\}\), which is an affine hyperplane of dimension \(d - 1\) (the solution set to a single linear equation in \(d\) variables has dimension \(d - 1\)). This hyperplane divides \(\mathbb{R}^d\) into two half-spaces: \(\{\mathbf{x} : \mathbf{w}^\top \mathbf{x} + b > 0\}\) and \(\{\mathbf{x} : \mathbf{w}^\top \mathbf{x} + b < 0\}\). Each half-space is open, \(d\)-dimensional, and unbounded.

Counterexample if False: N/A (statement is true).

Comprehension: A linear decision boundary in 2D is a line (1-dimensional), in 3D is a plane (2-dimensional), and in \(d\)-dimensions is a hyperplane (\((d-1)\)-dimensional). The partition into half-spaces is exhaustive and disjoint.

ML Applications: Linear classifiers (logistic regression, SVM with linear kernel, perceptron) are simple but powerful for linearly separable data.

Failure Mode Analysis: If data are not linearly separable, a linear classifier cannot achieve perfect separation, regardless of the ambient dimension.

Traps: Confusing the dimensionality of the feature space (\(d\)) with the dimensionality of the decision boundary (\(d-1\)).

Solutions to B. Proof Problems

B.1. Steinitz Exchange Lemma (Constructive Version)

Full Formal Proof:

We prove by induction on the size of the independent set.

Base case (\(m = 0\)): If \(\mathcal{I} = \emptyset\), then \(\mathcal{I} \cup \mathcal{S} = \mathcal{S}\) spans \(V\), and \(\mathcal{S}' = \mathcal{S}\) has size \(n\). Thus \(0 \leq n\) trivially.

Inductive step: Assume the lemma holds for linearly independent sets of size \(m-1\). Let \(\mathcal{I} = \{\mathbf{v}_1, \ldots, \mathbf{v}_m\}\) be linearly independent and \(\mathcal{S} = \{\mathbf{w}_1, \ldots, \mathbf{w}_n\}\) spanning. Consider \(\mathcal{I}_{m-1} = \{\mathbf{v}_1, \ldots, \mathbf{v}_{m-1}\}\) (linearly independent, size \(m-1\)). By induction, \(m-1 \leq n\) and there exists \(\mathcal{S}_1 \subseteq \mathcal{S}\) of size \(n-(m-1)\) such that \(\mathcal{I}_{m-1} \cup \mathcal{S}_1\) spans \(V\).

Since \(\mathcal{I}_{m-1} \cup \mathcal{S}_1\) spans \(V\), we can write \(\mathbf{v}_m = \sum_{i=1}^{m-1} \alpha_i \mathbf{v}_i + \sum_{\mathbf{w} \in \mathcal{S}_1} \beta_\mathbf{w} \mathbf{w}\).

Since \(\mathcal{I}\) is linearly independent, \(\mathbf{v}_m\) is not in \(\text{span}(\mathbf{v}_1, \ldots, \mathbf{v}_{m-1})\), so at least one \(\beta_\mathbf{w} \neq 0\). Choose such a \(\mathbf{w}_* \in \mathcal{S}_1\). Then \(\mathbf{w}_*\) can be written as a linear combination of \(\mathbf{v}_1, \ldots, \mathbf{v}_m, \mathcal{S}_1 \setminus \{\mathbf{w}_*\}\), so \((\mathcal{I}_{m-1} \cup \{\mathbf{v}_m\}) \cup (\mathcal{S}_1 \setminus \{\mathbf{w}_*\})\) spans \(V\). Setting \(\mathcal{S}' = \mathcal{S}_1 \setminus \{\mathbf{w}_*\}\) gives size \(n - m\), completing the induction. \(\square\)

Proof Strategy & Techniques: Induction on independent set size; exchange principle ensures one spanning vector is expendable.

Computational Validation: See vector examples in \(\mathbb{R}^n\) with explicit basis exchanges.

ML Interpretation: Underpins greedy feature selection: add important features, remove redundant raw features iteratively.

Generalization & Edge Cases: Requires spanning set and linear independence; fails in infinite dimensions without Axiom of Choice.

Failure Mode Analysis: Cannot apply to dependent sets; proof fails at induction step lacking linear independence.

Historical Context: Steinitz (1910); foundational result in linear algebra for proving Dimension Theorem.

Traps: Forgetting to use linear independence; missing that \(\mathcal{S}\) must span \(V\).


B.2. Dimension Theorem

Full Formal Proof: Apply Steinitz Exchange Lemma to two bases \(\mathcal{B} = \{\mathbf{b}_1, \ldots, \mathbf{b}_n\}\) and \(\mathcal{B}' = \{\mathbf{b}'_1, \ldots, \mathbf{b}'_m\}\). Since \(\mathcal{B}'\) is independent and \(\mathcal{B}\) spans: \(m \leq n\). Conversely, since \(\mathcal{B}\) is independent and \(\mathcal{B}'\) spans: \(n \leq m\). Thus \(n = m\). \(\square\)

Proof Strategy & Techniques: Bidirectional Steinitz application; symmetry argument.

Computational Validation: All bases of \(\mathbb{R}^n\) have exactly \(n\) vectors.

ML Interpretation: Guarantees intrinsic dimensionality is invariant across coordinate systems (feature spaces).

Generalization & Edge Cases: Extends to infinite-dimensional spaces via cardinality arguments.

Failure Mode Analysis: Breaks if sets are not both bases; confused definitions yield false claims.

Historical Context: Early 1900s formalization; central to modern linear algebra.

Traps: Confusing basis with spanning set or independent set alone.


B.3. Rank-Nullity for Linear Maps

Full Formal Proof: Choose basis \(\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}\) for \(\ker(T)\), extend to \(\{\mathbf{v}_1, \ldots, \mathbf{v}_k, \mathbf{u}_1, \ldots, \mathbf{u}_r\}\) for \(V\). Show \(\{T(\mathbf{u}_1), \ldots, T(\mathbf{u}_r)\}\) is a basis for \(\text{im}(T)\). Then \(\dim(V) = k + r = \dim(\ker T) + \dim(\text{im} T)\). \(\square\)

Proof Strategy & Techniques: Basis extension; decompose domain into kernel complement and image preimage.

Computational Validation: \(T: \mathbb{R}^3 \to \mathbb{R}^2\), \((x,y,z) \mapsto (x+y, y+z)\) has nullity 1, rank 2; sum is 3. ✓

ML Interpretation: Governs information bottlenecks in neural networks; governs loss of dimensions through layers.

Generalization & Edge Cases: Requires finite dimensionality; infinite-dimensional case needs topological closure.

Failure Mode Analysis: Numerical rank estimation can fail if near-singular values are misidentified.

Historical Context: Foundational 20th century result; bridges linear algebra and homology.

Traps: Forgetting kernel and image are in different spaces (\(V\) vs. \(W\)).


B.4. Dimension of Subspaces

Full Formal Proof: Let \(\{e_1, \ldots, e_k\}\) be a basis for \(U \cap W\). Extend to bases \(\{e_1, \ldots, e_k, f_1, \ldots, f_a\}\) for \(U\) and \(\{e_1, \ldots, e_k, g_1, \ldots, g_b\}\) for \(W\). Verify \(\{e_i, f_j, g_\ell\}\) is a basis for \(U + W\) by spanning and independence. Then \(\dim(U+W) = k + a + b = \dim(U) + \dim(W) - \dim(U \cap W)\). \(\square\)

Proof Strategy & Techniques: Basis completion; exploit intersection as common core.

Computational Validation: Planes in \(\mathbb{R}^3\) intersecting in a line; dimensions satisfy formula.

ML Interpretation: Multi-task learning: quantifies feature overlap vs. unique task-specific features.

Generalization & Edge Cases: Extends to multiple subspaces via inclusion-exclusion.

Failure Mode Analysis: Numerical computation of subspace intersection is ill-conditioned.

Historical Context: Classical result, foundational for lattice theory and combinatorics.

Traps: Assuming \(U + W = U \times W\) (direct product); they are different concepts.


B.5. Coordinate Uniqueness

Full Formal Proof: Suppose \(\mathbf{v} = \sum_i c_i \mathbf{b}_i = \sum_i c'_i \mathbf{b}_i\). Then \(\sum_i (c_i - c'_i) \mathbf{b}_i = 0\). By linear independence, \(c_i = c'_i\). \(\square\)

Proof Strategy & Techniques: Direct application of basis linear independence.

Computational Validation: \((3, 5)\) in standard basis: unique coordinates \((3, 5)\).

ML Interpretation: Weight uniqueness (up to regularization) in linear models once features fixed.

Generalization & Edge Cases: Extends to any field; breaks for non-basis spanning sets.

Failure Mode Analysis: Over-complete representations (frames) have non-unique coordinates.

Historical Context: Trivial in modern algebra; historically profound for rigor.

Traps: Confusing coordinate uniqueness (fixed basis) with basis uniqueness (false; many bases).


B.6. Change-of-Basis Invertibility

Full Formal Proof: Change-of-basis matrix \(P\) has columns (coordinate vectors of \(\mathcal{B}'\) in \(\mathcal{B}\)) that are linearly independent. Thus \(P\) is invertible. Composing forward and back changes gives identity; hence \(P^{-1} = P_{\mathcal{B}' \to \mathcal{B}}\). \(\square\)

Proof Strategy & Techniques: Invertibility from full rank (basis vectors are independent); composition argument.

Computational Validation: \(\begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}\) has \(\det = -2 \neq 0\) (invertible).

ML Interpretation: Neural network transformations are change-of-basis matrices when linear.

Generalization & Edge Cases: Determinant being nonzero is the criterion; relates to volume preservation.

Failure Mode Analysis: Numerical instability if matrix is nearly singular (ill-conditioned).

Historical Context: Euler, Lagrange (classical mechanics); formalized 20th century.

Traps: Confusing vector transformation with coordinate transformation (transposes involved).


B.7. Rank Invariance Under Row Operations

Full Formal Proof: Row swap and scalar multiplication preserve row space (and hence column space dimension). Row addition (\(R_i \to R_i + \mu R_k\)) preserves row space (linear combinations unchanged). Thus rank is preserved. \(\square\)

Proof Strategy & Techniques: Case analysis; row space invariance implies rank invariance.

Computational Validation: Row reduction preserves rank at each step; RREF rank equals original rank.

ML Interpretation: Gaussian elimination computes exact rank (modulo numerical errors).

Generalization & Edge Cases: Column operations also preserve rank; combined flexibility.

Failure Mode Analysis: Numerical pivoting required to avoid instability; partial pivoting is standard.

Historical Context: Implicit in Gauss (1809); formalized 20th century.

Traps: Forgetting pivot selection for numerical stability.


B.8. Column Rank Equals Row Rank

Full Formal Proof: Reduce \(A\) to RREF \(R\) via row operations. Row operations preserve both column and row rank (rank preservation plus dimension of rank). In RREF, \(R = \begin{pmatrix} I_k & * \\ 0 & 0 \end{pmatrix}\). Then \(\text{rank}_c(R) = k = \text{rank}_r(R)\), so \(\text{rank}_c(A) = \text{rank}_r(A)\). \(\square\)

Proof Strategy & Techniques: RREF structure makes rank visible; row/column rank equality follows directly.

Computational Validation: \(\begin{pmatrix} 1 & 2 & 3 \\ 2 & 4 & 6 \\ 3 & 6 & 9 \end{pmatrix}\) has rank 1 (row and column rank both 1).

ML Interpretation: Determines effective number of independent features or constraints.

Generalization & Edge Cases: Rank also equals rank from SVD singular values.

Failure Mode Analysis: Numerical rank differs from exact rank near singularities; SVD is more robust.

Historical Context: Late 1800s (Frobenius, Kronecker); pivotal unification.

Traps: Assuming rank is number of rows or columns; rank is tighter bound.


B.9. Full Column Rank Implies Injective

Full Formal Proof: \(\text{rank}(A) = n\) implies \(\dim(\ker A) = m - n = 0\) (by Rank-Nullity). Thus \(\ker A = \{0\}\), so \(A\mathbf{x} = A\mathbf{y}\) implies \(\mathbf{x} = \mathbf{y}\). \(\square\)

Proof Strategy & Techniques: Rank-Nullity connects rank to nullity; injectivity equivalent to trivial kernel.

Computational Validation: \(\begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{pmatrix}\) (full column rank 2) is injective.

ML Interpretation: Unique optimal least-squares solution when design matrix has full column rank (no feature collinearity).

Generalization & Edge Cases: Surjectivity requires full row rank; bijectivity requires square + full rank.

Failure Mode Analysis: Collinear features destroy full rank; regularization restores invertibility.

Historical Context: Consequence of basis theory; ancient (implicit in Gaussian elimination).

Traps: Confusing full column rank with full row rank; different implications.


B.10. Low-Rank Approximation Optimality

Full Formal Proof: Use SVD \(X = U\Sigma V^\top\). For any rank-\(k\) matrix \(Y\), \(\|X - Y\|_F^2 = \|U^\top X V - U^\top Y V\|_F^2 = \|\Sigma - Z\|_F^2\) where \(Z = U^\top Y V\) has rank \(\leq k\). Minimize by setting \(Z = \Sigma_k\) (diagonal with top-\(k\) singular values, rest zero), yielding \(Y_* = U_k \Sigma_k V_k^\top = X_k\). \(\square\)

Proof Strategy & Techniques: Eckart-Young-Mirsky theorem; orthogonal invariance of Frobenius norm; greedy matching of singular values.

Computational Validation: Diagonal matrix \(\begin{pmatrix} 1 & 0 \\ 0 & 2 \end{pmatrix}\) has rank-1 best approximation \(\begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix}\) with error 2.

ML Interpretation: PCA optimality; best low-rank reconstruction of data. Autoencoders with linear layers implicitly do this.

Generalization & Edge Cases: Optimality holds for Frobenius and spectral norms; other norms may have different optima.

Failure Mode Analysis: Slow singular value decay means rank-\(k\) is poor for any reasonable \(k\) (data is high-dimensional).

Historical Context: Eckart-Young (1936), Mirsky (1960); central to dimensionality reduction.

Traps: Assuming low-rank approximation useful for all data; only works for intrinsically low-rank data.


B.11. Dimension Bound on Solutions

Full Formal Proof: If \(A\mathbf{x} = \mathbf{b}\) has solution \(\mathbf{x}_0\), then solution set is \(\mathbf{x}_0 + \ker(A)\) (coset). Dimension of affine subspace equals dimension of direction space \(\ker(A)\). By Rank-Nullity, \(\dim(\ker A) = n - \text{rank}(A) = \text{nullity}(A)\). \(\square\)

Proof Strategy & Techniques: Coset structure; apply Rank-Nullity to characterize null space dimension.

Computational Validation: \(\begin{pmatrix} 1 & 1 & 1 \\ 0 & 0 & 0 \end{pmatrix} \mathbf{w} = \begin{pmatrix} 1 \\ 0 \end{pmatrix}\) has solutions \((1, s, t)\) (2-dimensional family, matching nullity 2).

ML Interpretation: Underdetermined regression has solution manifold; regularization picks a point in this manifold.

Generalization & Edge Cases: If no solution exists, solution set is empty (dimension undefined).

Failure Mode Analysis: High nullity means many solutions; numerical methods must navigate solution space.

Historical Context: Implicit in Gaussian elimination; formalized as affine subspace geometry.

Traps: Confusing solution set (affine) with null space (vector subspace).


B.12. Intrinsic Dimension via PCA

Full Formal Proof: Covariance \(\Sigma = XX^\top / n\) has eigenvalues \(\lambda_1 \geq \ldots \geq \lambda_d\). Total variance is \(\sum_j \lambda_j\). Retaining top-\(k\) captures \(\sum_{j=1}^k \lambda_j\). Fraction retained is \((1 - \delta)\) where \(\delta = \frac{\sum_{j=k+1}^d \lambda_j}{\sum_j \lambda_j}\). \(\square\)

Proof Strategy & Techniques: Spectral theorem; eigenvalue sum interpretation as variance.

Computational Validation: Diagonal covariance \(\text{diag}(2, 1)\) has total variance 3; retaining one component captures \(2/3 \approx 66.7\%\).

ML Interpretation: PCA variance threshold (e.g., 95%) operationalizes intrinsic dimension selection.

Generalization & Edge Cases: Slow eigenvalue decay indicates high intrinsic dimension.

Failure Mode Analysis: Eigenvalues perturbed by noise; robust PCA or median-based methods more stable.

Historical Context: PCA foundational (Pearson, Hotelling); variance interpretation formalizes intuition.

Traps: Confusing variance (captured by PCA) with information or discriminative power.


B.13. Autoencoder Dimension Constraint (Undercomplete)

Full Formal Proof: Data lie on \(k\)-dimensional subspace \(\mathcal{M}\). Choose orthonormal basis \(U = [\mathbf{u}_1 \ldots \mathbf{u}_k] \in \mathbb{R}^{d \times k}\). Define \(E(\mathbf{x}) = U^\top \mathbf{x}\) and \(D(\mathbf{z}) = U\mathbf{z}\). For \(\mathbf{x} \in \mathcal{M}\): \(D(E(\mathbf{x})) = UU^\top \mathbf{x} = \mathbf{x}\) (orthogonal projection is identity on \(\mathcal{M}\)). Reconstruction error = 0. \(\square\)

Proof Strategy & Techniques: Orthogonal projection; bottleneck enforces rank constraint.

Computational Validation: Line \(t(1, 1)^\top\) in \(\mathbb{R}^2\) reconstructs perfectly via \(k=1\) encoder.

ML Interpretation: Undercomplete linear autoencoder is equivalent to PCA; nonlinear layers yield manifold learning.

Generalization & Edge Cases: Overcomplete autoencoders can learn identity; need regularization.

Failure Mode Analysis: Noise in data prevents zero reconstruction error; noisy subspace recovery needed.

Historical Context: Autoencoders (Rumelhart et al., 1986); connection to PCA (Bourlard & Kamp, 1988).

Traps: Confusing undercomplete (bottleneck shrinks) with overcomplete (bottleneck expands).


B.14. Feature Redundancy and Column Dependence

Full Formal Proof: Identical columns \(\mathbf{c}_i = \mathbf{c}_j\) give \(1 \cdot \mathbf{c}_i + (-1) \cdot \mathbf{c}_j = 0\), so \(\text{rank}(X) < d\). Thus \(\text{rank}(X^\top X) = \text{rank}(X) < d\), making normal equations \(X^\top X \mathbf{w} = X^\top \mathbf{y}\) singular. Solution set is affine space (infinitely many solutions). \(\square\)

Proof Strategy & Techniques: Linear dependence propagates to Gram matrix; singular system has affine solution set.

Computational Validation: Two identical columns give singular Gram matrix; multiple least-squares solutions.

ML Interpretation: Multicollinearity (feature redundancy) breaks identifiability; regularization remedies.

Generalization & Edge Cases: Near-collinearity causes numerical instability (ill-conditioning) even if rank technically full.

Failure Mode Analysis: Optimization on singular loss surface can diverge; gradient descent may not converge.

Historical Context: Multicollinearity recognized early (20th century statistics); ridge regression solution (Hoerl & Kennard, 1970).

Traps: Assuming highly correlated (but distinct) features are always problematic; sometimes beneficial for regularization.


B.15. Regularization Reduces Effective Dimension

Full Formal Proof: Ridge solution \(\hat{\mathbf{w}}_\lambda = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}\) via SVD: each singular value direction is scaled by \(\frac{\sigma_i^2}{\sigma_i^2 + \lambda}\). Effective dimension \(\sum_i \frac{\sigma_i^2}{\sigma_i^2 + \lambda}\) decreases as \(\lambda\) increases. Solution is unique and lies in range of \(X^\top\) (no rank deficiency). \(\square\)

Proof Strategy & Techniques: SVD parametrizes ridge solution; shrinkage factors decrease with \(\lambda\).

Computational Validation: Rank-deficient \(X\) becomes invertible (in regularized sense) with \(\lambda > 0\).

ML Interpretation: Ridge regression is implicit dimension reduction via continuous shrinkage (vs. hard truncation in PCA).

Generalization & Edge Cases: Optimal \(\lambda\) determined by cross-validation; no universal rule.

Failure Mode Analysis: Too much regularization (\(\lambda\) large) underfits; too little overfits.

Historical Context: Tikhonov regularization (inverse problems, 1963); ridge regression (statistics, 1970); unified in modern ML theory.

Traps: Confusing “effective dimension” (continuous) with “rank” (discrete).


B.16. Whitening Diagonalizes Covariance

Full Formal Proof: Cholesky: \(\Sigma = LL^\top\). Whiten: \(Z = XL^{-\top}\). Covariance of \(Z\): \(\text{Cov}(Z) = E[L^{-1}X^\top XL^{-\top}] = L^{-1}\Sigma L^{-\top} = L^{-1}(LL^\top)L^{-\top} = I\). \(\square\)

Proof Strategy & Techniques: Factorization; conjugation by inverse strips off correlation structure.

Computational Validation: Covariance \(\begin{pmatrix} 2 & 1 \\ 1 & 1 \end{pmatrix}\) whitens to identity via \(L^{-\top}\) multiplication.

ML Interpretation: Preprocessing step; networks converge faster on whitened data (spherical loss surface). Batch normalization internally whitens.

Generalization & Edge Cases: Singular \(\Sigma\) requires pseudo-inverse or SVD-based whitening.

Failure Mode Analysis: Cholesky fails for singular matrices; ill-conditioning amplifies errors.

Historical Context: Classical standardization (Pearson); modern ML revival (batch norm, layer norm, 2010s).

Traps: Confusing whitening (decorrelate + standardize) with centering (subtract mean only).


B.17. Embedding Non-Identifiability

Full Formal Proof: Orthogonal \(Q\) preserves inner products: \((WQ)(WQ)^\top = WQQ^\top W^\top = WW^\top\). Any model \(f\) depending only on similarities satisfies \(f(W) = f(WQ)\). Thus \(W\) and \(WQ\) are observationally equivalent. \(\square\)

Proof Strategy & Techniques: Gram matrix invariance under orthogonal transformation; observational equivalence.

Computational Validation: Rotating embeddings preserves all dot-product similarities.

ML Interpretation: Word embeddings (Word2Vec, GloVe) identifiable only up to rotation; Procrustes aligns embeddings.

Generalization & Edge Cases: Includes reflections (determinant \(-1\)); only similarities matter, not individual coordinates.

Failure Mode Analysis: Attempting to interpret embedding dimensions individually is misleading.

Historical Context: Recognized in modern representation learning (2010s); Procrustes (Schönemann, 1966).

Traps: Over-interpreting individual embedding coordinates as having fixed semantic meanings.


B.18. Rank Constraints in Neural Networks

Full Formal Proof: Composition rank inequality: \(\text{rank}(AB) \leq \min(\text{rank}(A), \text{rank}(B))\). Apply inductively to \(W^{(L)} \cdots W^{(2)} W^{(1)}\): rank \(\leq \min\) of all layer ranks. Equality iff no rank collapse at any layer. \(\square\)

Proof Strategy & Techniques: Induction on layers; rank function subadditivity.

Computational Validation: Chain of full-rank layers preserves rank; one low-rank layer bottlenecks all.

ML Interpretation: Network effective rank limited by lowest-rank layer; explains expressivity bounds and training dynamics.

Generalization & Edge Cases: Nonlinearities can increase effective rank; skip connections bypass bottlenecks.

Failure Mode Analysis: Bad initialization collapses rank; spectral normalization helps preserve rank during training.

Historical Context: Recent formalization (2010s); critical for understanding deep learning expressivity.

Traps: Assuming many parameters guarantees high rank; weight sharing and structure reduce effective rank.


B.19. Dimension of Null Space Under Composition

Full Formal Proof: \(\ker(B) \subseteq \ker(AB)\) (if \(B\mathbf{x} = 0\) then \(AB\mathbf{x} = 0\)). Thus \(\dim(\ker AB) \geq \dim(\ker B)\). Equality holds iff \(\text{rank}(AB) = \text{rank}(B)\), which requires \(\text{rank}(A) \geq \text{rank}(B)\). \(\square\)

Proof Strategy & Techniques: Inclusion of kernel spaces; Rank-Nullity relates rank to nullity.

Computational Validation: \(B = \begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix}\), \(A = I\): nullity unchanged \((1 = 1)\). If \(A = \begin{pmatrix} 0 & 0 \\ 0 & 1 \end{pmatrix}\): nullity increases \((2 > 1)\).

ML Interpretation: Deep networks: information loss through composition (null space growth) if ranks drop.

Generalization & Edge Cases: Null space can only grow or stay same; cannot shrink through composition.

Failure Mode Analysis: Vanishing gradients related to null space growth in backpropagation.

Historical Context: Modern understanding (2010s–2020s); expressivity analysis of deep networks.

Traps: Confusing null space of \(B\) with null space of \(AB\); only the latter grows.


B.20. Isomorphism and Dimension Uniqueness

Full Formal Proof: (\(\Rightarrow\)) If \(\phi: V \to W\) is linear bijection and \(\{\mathbf{v}_1, \ldots, \mathbf{v}_n\}\) is basis for \(V\), then \(\{\phi(\mathbf{v}_1), \ldots, \phi(\mathbf{v}_n)\}\) is basis for \(W\) (injectivity preserves independence, surjectivity ensures spanning). Thus \(\dim V = n = \dim W\). (\(\Leftarrow\)) If \(\dim V = \dim W = n\), define \(\phi\) via coordinate mapping: \(\phi(\sum c_i \mathbf{v}_i) = \sum c_i \mathbf{w}_i\) (\(\mathcal{B}_V\) and \(\mathcal{B}_W\) chosen bases). Then \(\phi\) is linear, injective (coordinates unique), and surjective (spanning basis). \(\square\)

Proof Strategy & Techniques: Forward: push basis to image. Backward: explicit isomorphism via coordinates.

Computational Validation: All bases of \(\mathbb{R}^n\) furnish isomorphisms to standard \(\mathbb{R}^n\).

ML Interpretation: Networks with matching capacity (layer widths) can represent same information; dimension is intrinsic invariant.

Generalization & Edge Cases: For infinite-dimensional spaces, cardinality-based dimension needed; isomorphism classes more subtle.

Failure Mode Analysis: Dimension mismatch (e.g., 2D input, 10D output) implies non-bijective map.

Historical Context: Foundational 19th century (Grassmann, Cayley); formalized 20th century.

Traps: Confusing isomorphism (linear bijection) with equality (as sets).

Solutions to C. Python Exercises

C.1. Verifying Linear Independence via Row Reduction

Code:

import numpy as np

def check_linear_independence(A):
    """Check if columns of A are linearly independent."""
    m, n = A.shape
    rank = np.linalg.matrix_rank(A)
    return rank == n

A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=float)
is_indep, rank = check_linear_independence(A), np.linalg.matrix_rank(A)
print(f"Independent: {is_indep}, Rank: {rank}")

Expected Output: Independent: False, Rank: 2

Numerical / Shape Notes: \(A \in \mathbb{R}^{3 \times 3}\) with rank 2; third column is linear combination of first two.

Explanation: Matrix rank directly measures linear independence: columns are independent if and only if rank equals the number of columns. Rank is computed via Gaussian elimination (row reduction) or SVD. When rank < number of columns, the “missing” dimensions correspond to the null space: vectors in the null space, when added to any column, produce the same column (hence dependency). In our example, the third column equals the first plus twice the second, so the columns are dependent. Testing independence numerically always involves computing rank, which is robust via SVD (handles numerical errors better than naive row reduction).

ML Interpretation: Linear dependence among features signals multicollinearity: two or more features convey the same information. In regression, this prevents unique least-squares solutions; in classification, it inflates weight uncertainties. Detecting and removing dependent features improves numerical stability and model interpretability. PCA naturally handles this by projecting onto the span of columns (which has dimension = rank) regardless of linear dependencies in the original features—PCA is robust to multicollinearity. In neural networks, a weight matrix with dependent rows means some rows are redundant and could be pruned. Modern architectures (ResNets, attention) mitigate this by encouraging rank preservation via residuals and capacity, but monitoring rank of learned matrices is still a diagnostic tool.

Failure Modes: (1) Numerical rank determination: np.linalg.matrix_rank uses singular values and applies a default threshold. Noisy or ill-conditioned matrices may return incorrect rank if the threshold is poorly chosen (e.g., nearly-dependent columns may be misclassified as independent). Use tol parameter to adjust threshold. (2) Floating-point precision: columns that are mathematically dependent may appear independent due to rounding errors. Always check with small tolerance in equality tests. (3) Large matrices: computing exact rank for \(1000 \times 1000\) matrices is via SVD, which is slow; for quick diagnostics, use faster rank estimates (power iteration, Nyström approximation).

Common Mistakes: (1) Checking independence by manually examining columns (“Looks different, must be independent”), which fails on non-obvious dependencies. Always compute rank. (2) Using rank for Gaussian-elimination (treating \(0.0001\) as nonzero), leading to incorrect independence claims; SVD-based rank is much more robust. (3) Treating numerical rank as absolute truth; even if vectors are mathematically independent, numerical errors in computation may suggest dependence. Validate results on multiple methods or datasets. (4) Assuming independence in all columns of randomly-generated matrices; numerically, random tall matrices (\(m > n\)) are nearly certainly full rank, but wide matrices (\(m < n\)) will have dependence.

Chapter Connections: Directly implements Definition 1.4 (Linear Independence) and Theorem 2.2 (Equivalence of Basis Properties). The rank equals \(\dim(\text{span}(\text{columns}))\) by Definition 2.1 (Dimension). Columns form a basis of their span iff rank equals column count and rank equals number of rows spanned, per Theorem 2.3 (Basis Existence). Related to Example 4 (analyzing rank-deficiency in regression design matrix) and Example 7 (PCA identifying effective dimension via eigenvalue count, analogous to rank here).


C.2. Extracting a Basis from a Spanning Set

Code:

def extract_basis(A):
    """Extract linearly independent basis from columns of A."""
    m, n = A.shape
    # QR decomposition identifies pivot columns
    Q, R = np.linalg.qr(A, mode='reduced')
    rank = np.linalg.matrix_rank(A)
    # Return first 'rank' columns of original A (pivots)
    # Alternatively, Q[:, :rank] is orthonormal basis
    basis_orthonormal = Q[:, :rank]
    return basis_orthonormal, rank

A = np.array([[1, 0, 1, 2], [0, 1, 1, 1]], dtype=float)
basis, rank = extract_basis(A)
print(f"Basis shape: {basis.shape}, Rank: {rank}")
print(f"Basis spans same space: {np.allclose(np.linalg.qr(A)[0][:, :rank] @ np.linalg.qr(A)[0][:, :rank].T @ A, A)}")

Expected Output: Basis shape: (2, 2), Rank: 2; Basis spans same space: True

Numerical / Shape Notes: Original matrix has 4 columns in \(\mathbb{R}^2\); extracted basis has 2 orthonormal vectors spanning the same subspace.

Explanation: Given a spanning set (columns may be dependent), we extract a linearly independent subset that spans the same space (a basis). The QR decomposition \(A = QR\) factors \(A\) into orthonormal columns \(Q\) and upper-triangular \(R\). The rank \(r = \text{rank}(A)\) determines how many of the \(Q\) columns are non-zero-producing (the pivot columns); \(Q_{:,1:r}\) is an orthonormal basis for \(\text{Col}(A)\). Alternatively, identifying pivot columns in \(A\) via row reduction finds a basis of the original columns (not orthonormalized). QR is numerically stable and efficient.

ML Interpretation: Feature selection: when given more features than necessary, extract a minimal basis (e.g., via QR or singular vectors) to identify the truly independent features. This is especially useful in high-dimensional data where many features are correlated or near-collinear. Models trained on a basis rather than the full redundant set are smaller, faster, and more interpretable, with no loss of expressivity. Autoencoders learn a basis automatically (encoder weights form an implicit basis for the data manifold). In recommendation systems, extracting a basis of the user-item matrix identifies the true dimensionality of latent factors, controlling model complexity.

Failure Modes: (1) Basis selection matters for conditioning: different basis choices (e.g., pivots from row-reduction vs. QR orthonormal basis) have different numerical properties; row-reduction pivots may lead to ill-conditioned systems if chosen poorly. QR bases are always well-conditioned. (2) Rank estimation errors: if rank is computed incorrectly (due to threshold issues), the extracted basis has wrong dimension. (3) Non-uniqueness: while the dimension is unique (rank), the basis itself is not; infinitely many bases span the same space. Choosing a specific basis (e.g., orthonormal) requires additional constraints.

Common Mistakes: (1) Assuming the first \(r\) columns form a basis; they may be dependent. Always use QR or identify pivots explicitly. (2) Extracting a basis and then treating it as “the” basis, ignoring that other equally-valid bases exist; this can lead to over-interpretation of which features are “important.” (3) Confusing orthonormal bases with optimal bases; orthonormal is numerically nice but doesn’t minimize any loss function. For some ML tasks, other bases (PCA, ICA) are more meaningful.

Chapter Connections: Theorem 2.3 (Basis Extension) states that every independent set extends to a basis; Algorithm: start with dependent set, identify and remove dependencies iteratively. QR implements this via column-by-column orthogonalization. Connects to Definition 2.1 (Dimension): the rank of \(A\) is the dimension of \(\text{Col}(A)\), and the extracted basis has rank elements. Uses Definition 1.4 (Span) implicitly: basis spans \(\text{Col}(A)\). Relates to Example 6 (basis choices and their implications for computation) and Example 11 (Gram-Schmidt orthogonalization, which underlies QR).


C.3. Computing a Basis for the Null Space

Code:

def null_space(A):
    """Compute orthonormal basis for null space of A."""
    m, n = A.shape
    U, s, Vt = np.linalg.svd(A, full_matrices=True)
    # Full SVD: U is m×m, S is m (with trailing zeros), Vt is n×n
    # Null space = span of rows of Vt corresponding to zero (or small) singular values
    rank = np.sum(s > 1e-10)  # Numerical rank
    null_basis = Vt[rank:, :].T  # Last n-rank rows of Vt (transposed to column vectors)
    return null_basis, rank, n - rank

A = np.array([[1, 1, 1], [2, 2, 2]], dtype=float)
null_basis, rank, nullity = null_space(A)
print(f"Rank: {rank}, Nullity: {nullity}")
print(f"A @ null_basis = 0: {np.allclose(A @ null_basis, 0)}")

Expected Output: Rank: 1, Nullity: 2; A @ null_basis = 0: True

Numerical / Shape Notes: Full SVD decomposes \(A \in \mathbb{R}^{2 \times 3}\) as \(U \Sigma V^T\); null space has dimension \(3 - 1 = 2\), spanned by last 2 rows of \(V^T\).

Explanation: The null space \(\text{Null}(A) = \{\mathbf{x} : A\mathbf{x} = \mathbf{0}\}\) is the set of vectors annihilated by \(A\). Its dimension (nullity) is \(n - \text{rank}(A)\) by rank-nullity theorem. To find a basis, use SVD: \(A = U\Sigma V^\top\), where \(V\) is \(n \times n\) orthogonal. The null space is spanned by the right singular vectors corresponding to zero singular values. Since \(A V = U \Sigma\), we have \(A V_{\text{null}} = U \Sigma_{\text{null}} = 0\) (columns of \(V\) after rank \(r\) correspond to zero singular values). Thus, \(V_{\text{null}}\) is an orthonormal basis for \(\text{Null}(A)\).

ML Interpretation: The null space characterizes solution non-uniqueness in linear systems: if \(A\mathbf{x} = \mathbf{b}\) has one solution \(\mathbf{x}_p\), then all solutions are \(\mathbf{x}_p + \mathbf{n}\) where \(\mathbf{n} \in \text{Null}(A)\). In regression with \(d > n\) features (underdetermined), the null space of \(X\) is non-trivial, and ridge regression resolves by selecting a solution via regularization. In neural networks, training via gradient descent on underdetermined systems naturally moves along the null space directions (these don’t affect loss), implicitly selecting a solution with minimal norm (via implicit bias of SGD). Understanding null space structure is crucial for identifiability and generalization in overparameterized models.

Failure Modes: (1) Singular value threshold: determining which singular values are “zero” (should be in null space) vs. “small but nonzero” (numerical noise) requires a threshold; different thresholds give different null bases. Machine tolerance (default ~1e-15 relative to largest singular value) is usually reliable, but for ill-conditioned \(A\), adjust carefully. (2) Full SVD memory: computing full \(V\) matrix (needed to extract null basis) requires \(O(n^2)\) memory; for very large \(n\), this is expensive. Sparse SVD libraries may help, but finding null basis of tall matrices is inherently costly. (3) Non-robust algorithms: older MNIST/row-reduction methods for null space are numerically unstable compared to SVD; always use SVD.

Common Mistakes: (1) Believing null space is empty (mistake: confusing “small kernel” with “no kernel”); even for “generic” matrices, if \(m < n\), nullity \(\geq 0\) is certain. (2) Computing null basis incorrectly via RREF (row-reduced echelon form) and then not addressing numerical stability; RREF is prone to pivoting errors. SVD is more reliable. (3) Ignoring the null space when solving \(A\mathbf{x} = \mathbf{b}\): all solutions are equally valid for prediction, but different null space components correspond to different parameter vectors—this affects interpretability and regularization. (4) Assuming orthonormal basis from SVD is “the” null basis; it’s one choice, and for some applications (sparsity, interpretability) other bases are preferred.

Chapter Connections: Fundamental to Theorem 2.4 (Rank-Nullity): \(\dim(\text{Null}(A)) = \text{nullity} = n - \text{rank}(A)\). Directly computes a basis for the null space, verifying this theorem. Connects to Definition 1.6 (Kernel of Linear Map): \(\text{Null}(A) = \ker(A)\). Example-wise, relates to Example 9 (using null space to characterize solution sets in regression) and Example 10 (ridge regularization implicitly selecting along null space). The SVD approach ties to Theorem 3.2 (Singular Value Decomposition Structure) and Example 12 (SVD basis interpretation).


C.4. Computing a Basis for the Column Space

Code:

def column_space_basis(A):
    """Return orthonormal basis for column space of A (via QR)."""
    m, n = A.shape
    Q, R = np.linalg.qr(A, mode='reduced')
    rank = np.linalg.matrix_rank(A)
    col_basis = Q[:, :rank]  # Orthonormal basis for Col(A)
    return col_basis, rank

A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=float)
col_basis, rank = column_space_basis(A)
print(f"Column space basis shape: {col_basis.shape}")
print(f"Orthonormal: {np.allclose(col_basis.T @ col_basis, np.eye(rank))}")
print(f"Spans Col(A): {np.allclose(col_basis @ col_basis.T @ A, A)}")

Expected Output: Column space basis shape: (3, 2); Orthonormal: True; Spans Col(A): True

Numerical / Shape Notes: \(A \in \mathbb{R}^{3 \times 3}\) has rank 2; column space is a 2-dimensional subspace of \(\mathbb{R}^3\).

Explanation: The column space \(\text{Col}(A) = \text{span}(\text{columns of } A) = \{\, A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n \}\) is a subspace of \(\mathbb{R}^m\) with dimension \(\text{rank}(A)\). To find a basis, use QR decomposition: \(A = QR\) with \(Q\) orthonormal and \(R\) upper-triangular. Since \(\text{Col}(A) = \text{Col}(Q)\) (column operations don’t change span), the first \(r = \text{rank}(A)\) columns of \(Q\) form an orthonormal basis for \(\text{Col}(A)\). This is one of the key uses of QR: it simultaneously computes the column rank and an orthonormal basis.

ML Interpretation: The column space is the “output space” of \(A\): applying \(A\) to any vector produces a result in \(\text{Col}(A)\). In regression, \(\text{Col}(X)\) is the space of all possible fitted values (predictions) \(\hat{\mathbf{y}} = X\mathbf{w}\); the least-squares solution projects the response \(\mathbf{y}\) onto \(\text{Col}(X)\) (best fit within reachable space). In neural networks, the output of a layer \(\mathbf{h} = W\mathbf{a}\) lies in \(\text{Col}(W)\), so the layer can only produce outputs in a \(\text{rank}(W)\)-dimensional subspace. Bottleneck layers (low-rank \(W\)) intentionally reduce the column space dimension, forcing information compression. Understanding column space dimension is essential for diagnosing layer redundancy or capacity bottlenecks.

Failure Modes: (1) Confusing column rank with row rank: column rank (dimension of \(\text{Col}(A)\)) equals row rank (dimension of \(\text{Row}(A)\)), both equal to rank(\(A\)). However, the actual vectors in \(\text{Col}(A)\) (columns of \(A\)) are different from those in \(\text{Row}(A)\) (rows of \(A\)). (2) Misusing reduced QR: reduced QR computes \(A = Q R\) with \(Q\) being \(m \times r\) (not square), which is efficient but loses information needed for left null space. Use mode='complete' if you need the full \(Q\). (3) Assuming all columns of \(Q\) form a basis: only the first rank(\(A\)) columns do; if \(m > n\), there are \(m - \text{rank}(A)\) additional columns in full \(Q\) orthogonal to the column space (these span the left null space, not useful here).

Common Mistakes: (1) Not computing rank first, then assuming \(A\) has full column rank. Always verify. (2) Using the original matrix columns as a basis instead of orthonormalizing via QR; non-orthonormal bases are valid but numerically less stable. (3) Conflating “column space” with “feature space”; column space is where outputs live, feature space is where inputs live. For a regression model \(\mathbf{y} = X\mathbf{w}\), the feature space is \(\mathbb{R}^d\) (domain of \(X\)), column space is \(\mathbb{R}^n\) (codomain, where \(\mathbf{y}\) lives). (4) Assuming projections onto column space require knowing a matrix inverse; QR allows stable projection without computing inverse: \(\hat{\mathbf{y}} = QQ^\top \mathbf{y}\).

Chapter Connections: Computes a basis for Definition 2.5 (Column Space): the column space of \(A\) has dimension equal to \(\text{rank}(A)\) (Definition 2.8). This is central to Theorem 2.4 (Rank-Nullity), which partitions domain dimension into image (column space) and kernel (null space). Related to Example 8, which discusses rank-deficiency and its implications for solvability of \(A\mathbf{x} = \mathbf{b}\). Connects to regression theory (Example 9): the least-squares solution projects \(\mathbf{y}\) onto \(\text{Col}(X)\), using the orthogonal projection matrix \(P = Q_k Q_k^\top\) where \(Q_k\) is this column basis.


C.5. Computing a Change-of-Basis Matrix

Code:

def change_of_basis_matrix(B_old, B_new):
    """Compute P such that [v]_new = P @ [v]_old."""
    # Columns of B_old and B_new are basis vectors
    B_old_mat = np.column_stack(B_old)  # m × n
    B_new_mat = np.column_stack(B_new)  # m × n
    # Express old basis in new basis: B_old_mat = B_new_mat @ P_T
    # So P_T = inv(B_new_mat^T @ B_new_mat) @ B_new_mat^T @ B_old_mat (in matrix form)
    # Safer: P_T = solve(B_new_mat.T @ B_new_mat, B_new_mat.T @ B_old_mat)
    # Then P = (P_T).T
    P_T = np.linalg.solve(B_new_mat.T @ B_new_mat, B_new_mat.T @ B_old_mat)
    P = P_T.T
    return P

# Example: rotate from standard basis to 45-degree basis
B_std = [np.array([1.0, 0.0]), np.array([0.0, 1.0])]
B_rot = [np.array([1.0, 1.0]) / np.sqrt(2), np.array([1.0, -1.0]) / np.sqrt(2)]
P = change_of_basis_matrix(B_std, B_rot)
print(f"Change-of-basis matrix P:\n{np.round(P, 3)}")

# Verify: coordinates in original basis vs. rotated basis
v = np.array([3.0, 4.0])
v_std = v  # Coordinates in standard basis
v_rot = P @ v_std  # Coordinates in rotated basis
print(f"v in standard: {v_std}, in rotated: {np.round(v_rot, 3)}")

Expected Output: P converts from standard to rotated basis; for \(v = [3, 4]\), rotated coordinates \(\approx [4.95, -0.71]\).

Numerical / Shape Notes: \(P \in \mathbb{R}^{2 \times 2}\) relates coordinates. If bases are orthonormal, \(P\) is orthogonal: \(P^T P = I\).

Explanation: Given two bases \(\mathcal{B} = \{\mathbf{b}_1, \dots, \mathbf{b}_n\}\) and \(\mathcal{B}' = \{\mathbf{b}'_1, \dots, \mathbf{b}'_n\}\) of the same space, the change-of-basis matrix \(P\) satisfies \([\mathbf{v}]_{\mathcal{B}'} = P [\mathbf{v}]_\mathcal{B}\). To compute \(P\): express each old basis vector in the new basis, \([\mathbf{b}_i]_{\mathcal{B}'}\), and stack these as columns of \(P\). If \(B = [\mathbf{b}_1 | \cdots | \mathbf{b}_n]\) and \(B' = [\mathbf{b}'_1 | \cdots | \mathbf{b}'_n]\), then \(B = B' P^\top\) (or \(B' = B P^{-\top}\)), yielding \(P = (B'^{-1} B)^\top = (B')^{-\top} B^\top\). For numerical stability, especially when bases are not orthonormal, solve \(B'^\top P^\top = B^\top\) via least-squares () rather than inverting directly.

ML Interpretation: Basis changes represent feature transformations: changing from pixel basis (raw images) to eigenface basis (PCA) is a change of basis. The change-of-basis matrix \(P\) (the PCA projection matrix) transforms pixel coordinates into principal component coordinates, revealing latent structure. In transfer learning, representations learned on one dataset (one implicit basis) are transformed to match a different target dataset (different basis), using fine-tuning (which updates the change-of-basis matrix implicitly). In natural gradient descent, the optimization geometry is transformed from the parameter basis to a basis aligned with the Fisher information metric, improving convergence. Understanding basis changes is key to designing effective feature transformations and understanding why certain bases are better for specific tasks.

Failure Modes: (1) Ill-conditioned basis matrices: if the basis vectors are nearly linearly dependent (nearly linearly dependent, near-singular \(B\) or \(B'\)), computing \(P\) numerically becomes unstable. Use SVD-based pseudoinverse for ill-conditioned cases. (2) Wrong order of basis vectors: the basis ordering matters; swapping two basis vectors changes corresponding columns of \(P\). Ensure consistent ordering. (3) Non-orthogonal bases: if bases are not orthonormal, the change-of-basis matrix is not orthogonal, and mistakes in applying \(P\) or \(P^{-1}\) can cause significant errors. (4) Confusing \(P\) and \(P^{-1}\): is \(P\) the forward map (old \(\to\) new) or inverse? Convention varies; always clarify in code comments.

Common Mistakes: (1) Assuming \(P^{-1} = P^\top\) when bases are not orthonormal (only true for orthonormal bases). (2) Applying \(P\) in the wrong direction (new \(\to\) old instead of old \(\to\) new), getting nonsense results. (3) Confusing coordinate transformation with basis change; if \(P\) changes basis, applying \(P\) to the coordinate vector works, but \(P\) is not applied to the original vector. (4) Not accounting for different basis conventions (e.g., row-major vs. column-major orderings of basis vectors), leading to transpose errors.

Chapter Connections: Implements Definition 2.9 (Change-of-Basis Matrix) and Theorem 2.7 (Change-of-Basis Formula): if \([\mathbf{v}]_\mathcal{B}\) and \([\mathbf{v}]_{\mathcal{B}'}\) are coordinate vectors, then \([\mathbf{v}]_{\mathcal{B}'} = P [\mathbf{v}]_\mathcal{B}\) where \(P\) columns are \([\mathbf{b}_i]_{\mathcal{B}'}\). Uses the coordinate map (Theorem 2.6: coordinate map is a linear isomorphism), ensuring uniqueness of coordinates. Relates to Example 6, which explores basis choices and their computational implications. Connected to Chapter 3 (next): matrices representing linear transformations change via similarity transformations when bases change, \(A' = P^{-1} A P\), which is a consequence of this change-of-basis formula.


C.6. Verifying Basis Change (Round-Trip Transformation)

Code:

def verify_basis_change(v, P, P_inv):
    """Verify round-trip: v_old -> P -> v_new -> P_inv -> v_old."""
    v_new = P @ v
    v_recovered = P_inv @ v_new
    roundtrip_error = np.linalg.norm(v_recovered - v)
    return v_new, v_recovered, roundtrip_error

# Standard basis to rotated 45-degree basis
P = np.array([[0.5, 0.5], [0.5, -0.5]])  # Orthogonal matrix
P_inv = np.linalg.inv(P)  # For orthogonal, P_inv ≈ P.T

v = np.array([3.0, 4.0])
v_new, v_recovered, error = verify_basis_change(v, P, P_inv)
print(f"Original: {v}")
print(f"Transformed: {np.round(v_new, 4)}")
print(f"Recovered: {np.round(v_recovered, 4)}")
print(f"Round-trip error: {error:.2e}")
print(f"Composition check (P @ P_inv ≈ I): {np.allclose(P @ P_inv, np.eye(2))}")

Expected Output: Round-trip error: ~0.00e+00 (within machine precision); Composition check: True

Numerical / Shape Notes: Error is machine epsilon (\(\sim 10^{-16}\)) due to float64 precision limits. For orthogonal \(P\), recovery is nearly perfect; for ill-conditioned \(P\), error can be much larger.

Explanation: A change of basis is invertible (bijective): if \(P\) transforms old coordinates to new, then \(P^{-1}\) reverses the transformation. The round-trip error \(\| P^{-1} P [\mathbf{v}]_\mathcal{B} - [\mathbf{v}]_\mathcal{B} \|\) equals \(\| (P^{-1} P - I) [\mathbf{v}]_\mathcal{B} \| \leq \| P^{-1} P - I \| \| [\mathbf{v}]_\mathcal{B} \|\), where the error term \(\| P^{-1} P - I \|\) is bounded by numerical precision times the condition number of \(P\). For orthogonal \(P\) (e.g., from orthonormal bases), condition number is 1, so error is machine epsilon. For ill-conditioned \(P\), error amplifies. This test verifies computational correctness of the basis change.

ML Interpretation: Round-trip verification is a diagnostic tool for learned representations: after encoding data via a basis change (e.g., PCA encoder), decoding should recover the original (if basis is complete). Reconstruction error in autoencoders follows this pattern. Numerical errors in this round-trip reveal whether the basis change is stable and well-conditioned. In transfer learning, encoding data into a target space (basis change) and then decoding back to the source space measures how much information is preserved—high reconstruction error suggests the target space is inadequate (low dimension) or the transformation is ill-conditioned (numerical instability). This diagnostic is central to assessing quality of learned bases.

Failure Modes: (1) Ill-conditioned \(P\): matrices with large condition numbers amplify errors. A \(P\) with small singular values (near-singular) produces large \(P^{-1}\), and round-trip error explodes. Always check \(\text{cond}(P)\). (2) Accumulation of rounding errors: applying \(P\) and then \(P^{-1}\) sequentially accumulates rounding errors twice. For critical applications, store \(P\) and \(P^{-1}\) separately and recompute as needed. (3) Orthogonal \(P\) only in theory: if \(P\) is computed from non-orthonormal bases and you assume orthogonality (using \(P^\top\) instead of \(P^{-1}\)), round-trip error will reveal the mistake.

Common Mistakes: (1) Not testing round-trip error; assuming \(P\) is invertible without verification. (2) Using \(P^\top\) as the inverse when \(P\) is not orthogonal; orthogonality is a special property, not the default. (3) Ignoring the error threshold; comparing round-trip error to zero exactly (with ==) fails due to floating-point. Use np.allclose with appropriate tolerance. (4) Assuming small number of basis vectors means small round-trip error; ill-conditioning is independent of dimension.

Chapter Connections: Validates Theorem 2.6 (inverse of coordinate map): since the coordinate map is a linear isomorphism, \(P\) is invertible, and \(P^{-1} P = I\) mathematically. This test empirically verifies correctness of \(P\) computation. Relates to Definition 2.6 (Linear Isomorphism): the coordinate map and its inverse are both isomorphisms (preserve structure). Connects to Chapter 3 and Chapter 4: similarity transformations \(A' = P^{-1} A P\) rely on invertibility of \(P\), and the structure of \(A\) is preserved (same eigenvalues, determinant), illustrating why bases are a powerful abstraction.


C.7. Computing PCA and Extracting Principal Components

Code:

def pca_full(X, n_components=None):
    """
    PCA from scratch: center, compute covariance, eigendecompose.
    X: n_samples × n_features matrix
    Returns: principal components, explained variance, explained variance ratio
    """
    n_samples, n_features = X.shape
    # Center the data
    X_mean = X.mean(axis=0, keepdims=True)
    X_centered = X - X_mean
    # Covariance matrix: Sigma = (1 / n) X_centered.T @ X_centered
    # (unbiased: divide by n - 1)
    covariance = X_centered.T @ X_centered / (n_samples - 1)
    # Eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(covariance)
    # Sort in decreasing order
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    # Select components
    if n_components is None:
        n_components = n_features
    principal_components = eigenvectors[:, :n_components]
    explained_variance = eigenvalues[:n_components]
    explained_variance_ratio = explained_variance / eigenvalues.sum()
    return principal_components, explained_variance, explained_variance_ratio, X_mean

# Example
np.random.seed(42)
X = np.random.randn(100, 5)
X = X @ np.array([[2, 1, 0, 0, 0], [0, 1.5, 0, 0, 0], [0, 0, 1, 0, 0], 
                   [0, 0, 0, 0.5, 0], [0, 0, 0, 0, 0.1]]).T  # Inject variance
comps, evals, var_ratio, mean = pca_full(X, n_components=3)
print(f"Explained variance: {np.round(evals, 3)}")
print(f"Explained variance ratio: {np.round(var_ratio, 3)}")
print(f"Cumulative variance: {np.round(np.cumsum(var_ratio), 3)}")

Expected Output: First 3 eigenvalues capture ~99% of variance; components align with injected variance structure.

Numerical / Shape Notes: Covariance matrix \(\Sigma \in \mathbb{R}^{5 \times 5}\) is symmetric positive semidefinite; eigenvalues are non-negative, ordered decreasing; eigenvectors are orthonormal.

Explanation: PCA is a basis change to maximize variance: it finds orthonormal bases (principal components, the eigenvectors of covariance) such that the first component aligns with maximum data variance, the second with the second-largest variance orthogonal to the first, etc. The eigenvalues are the variances along each principal component. Explained variance ratio quantifies the fraction of total variance captured by each component. PCA is optimal (in a least-squares sense): projecting onto the first \(k\) components minimizes reconstruction error among all \(k\)-dimensional projections (spectral theorem, Eckart-Young). This makes PCA the canonical dimensionality reduction, underlying many ML pipelines.

ML Interpretation: PCA reveals structure in high-dimensional data by identifying “directions of importance” (highest variance). In image analysis, principal components are linear combinations of pixel intensities (e.g., one component might capture facial features, another lighting variations). Projecting images onto the top \(k\) principal components yields a \(k\)-dimensional representation (lossy compression) that preserves salient structure. In feature engineering, PCA decorrelates features (eigenvectors are orthogonal), which improves numerical stability and interpretability of models trained on PC scores. PCA is also a visualization tool: projecting high-dimensional data onto the top 2 or 3 PCs enables 2D/3D scatter plots revealing clusters, outliers, or nonlinear structure. In recommender systems, PCA on the user-item matrix uncovers latent factors (user preferences, item genres) as low-rank approximations. Warning: PCA assumes variance = importance; if signal is low-variance but predictive (common in imbalanced classification), PCA may discard it. Supervised alternatives (LDA, partial least squares) address this by optimizing variance subject to class separability.

Failure Modes: (1) Centering is essential: PCA assumes centered data (\(\text{mean} = \mathbf{0}\)). Forgetting to center yields components aligned with the data center, not with variance structure. (2) Scaling matters: if features have vastly different ranges (e.g., age in years vs. income in dollars), large-scale features dominate PCA, despite being less informative. Normalize/standardize features before PCA (divide by standard deviation). (3) Rank deficiency: if \(n_{\text{samples}} < n_{\text{features}}\), the covariance matrix is rank-deficient, and fewer than \(n_{\text{features}}\) eigenvalues are nonzero. PCA handles this correctly (zero eigenvalues \(\Rightarrow\) zero-variance directions, uninformative). (4) Interpretability trap: eigenvectors are linear combinations of original features, which are hard to interpret in high dimensions.

Common Mistakes: (1) Applying PCA after splitting train/test; fit PCA on training data only, then transform test data using the same PCs. Fitting on entire dataset causes data leakage (test info leaks into PCA). (2) Not centering or scaling. (3) Using PCA on categorical or mixed-type data without appropriate preprocessing (one-hot encoding, normalization). (4) Over-interpreting the number of components as “true dimension”; explained variance ratio is heuristic, not definitive. Cross-validation on downstream tasks is more reliable.

Chapter Connections: Directly applies Theorem 3.5 (Spectral Theorem for Symmetric Matrices): the covariance matrix (symmetric) is diagonalized by orthonormal eigenvectors. Uses Definition 2.8 (Rank): the number of nonzero eigenvalues equals the rank of the covariance matrix, which equals the intrinsic dimension of centered data. Connects to Example 7 (analyzing dimension via eigenvalue spectrum) and Example 12 (SVD as the generalization of eigendecomposition to non-square matrices; for centered \(X\), top PC is the top singular vector of \(X\)). Key example of basis choice: PCA is a learned basis optimizing variance.


C.8. Whitening Data via Covariance Diagonalization

Code:

def whiten_cholesky(X):
    """Whiten data via Cholesky decomposition."""
    X_centered = X - X.mean(axis=0, keepdims=True)
    cov = X_centered.T @ X_centered / (X.shape[0] - 1)
    # Cholesky: Sigma = L L^T
    L = np.linalg.cholesky(cov)
    # Whitened: Z = X_centered @ inv(L^T)
    Z = X_centered @ np.linalg.inv(L).T
    # Verify: cov(Z) ≈ I
    cov_Z = Z.T @ Z / (Z.shape[0] - 1)
    return Z, cov_Z, L

# Example: synthetic correlated data
np.random.seed(42)
X = np.random.randn(200, 3)
X = X @ np.array([[2, 0.8, 0.3], [0.8, 1.5, 0.5], [0.3, 0.5, 1.0]])  # Symmetric, positive definite
Z, cov_Z, L = whiten_cholesky(X)
print(f"Original covariance:\n{np.round(X.cov().T, 3)}")
print(f"Whitened covariance (should be ≈ I):\n{np.round(cov_Z, 3)}")
print(f"Whitening preserves center: {np.linalg.norm(Z.mean(axis=0)):.2e}")

Expected Output: Whitened covariance ≈ identity matrix; data centered at zero; transformed variances = 1.

Numerical / Shape Notes: Cholesky \(L\) is lower-triangular; \(\Sigma = LL^T\) factors the covariance. Whitening \(Z = X L^{-T}\) decorrelates and normalizes.

Explanation: Whitening (or standardization) transforms data to have covariance equal to the identity matrix: \(\text{Cov}(Z) = I\). This is achieved by applying the inverse of the Cholesky factor: if \(\Sigma = LL^\top\) (Cholesky decomposition, \(L\) lower-triangular), then \(Z = X_c L^{-\top}\) (where \(X_c\) is centered data) satisfies \(\text{Cov}(Z) = I\). The transformation decorrelates variables (covariance matrix off-diagonals become zero) and normalizes variances (diagonal becomes all ones). Whitening is essential preprocessing for many algorithms sensitive to scale (gradient descent, distance-based clustering) and for numerical stability (some matrix algorithms work better on unit-scale data).

ML Interpretation: Whitening ensures all features contribute equally to distance metrics and model training. Before whitening, features with large variances dominate K-means clustering or gradient descent; whitening levels this. In neural networks, batch normalization (applied at each layer) is a form of whitening, improving training speed and generalization. Natural gradient descent (often used in probabilistic models) implicitly whitens the parameter space using the Fisher information matrix (analogous to covariance). In kernel methods, whitening in the feature space can improve SVM training. Warning: whitening is unsupervised (ignores class labels); supervised alternatives (like standardizing by class) may be better if class-specific scaling is important.

Failure Modes: (1) Non-positive definite covariance: Cholesky requires positive definite covariance, which fails if the covariance matrix is only positive semidefinite (rank-deficient or singular). Solution: use SVD-based whitening instead (\[ Z = X U_r \Lambda_r^{-1/2} \] where \(U_r, \Lambda_r\) are top \(r\) singular vectors/values). (2) Numerical instability: if covariance is ill-conditioned (some eigenvalues much smaller than others), inverting \(L\) amplifies small errors. Use pseudoinverse () or SVD for robustness. (3) Whitening test set correctly: fit whitening transformation (compute \(L\)) on training data, apply to test data using same \(L\); fitting separately on test data causes data leakage.

Common Mistakes: (1) Not centering before whitening; centering is a separate step but essential for whitening. (2) Forgetting the transpose in \(Z = X L^{-\top}\) vs. \(Z = X L^{-1}\); \(L\) is lower-triangular from centered data covariance, so inverse appears on the right as transpose. (3) Applying whitening on training data only, then forgetting to transform test data (or transforming incorrectly). (4) Whitening categorical or mixed-type data without preprocessing; whitening assumes continuous, normally-distributed data.

Chapter Connections: Implements basis change via spectral decomposition (Theorem 3.5): the covariance matrix (symmetric) is diagonalized by orthonormal eigenvectors, and whitening changes to this eigenbasis with variance normalization. Uses Definition 2.8 (Rank) and rank-nullity: if centered data has rank \(r < n_{\text{features}}\), whitening produces an \(r\)-dimensional subspace (with smaller \(r\), the transformation becomes degenerate; use pseudoinverse). Related to Example 7 (PCA) and Example 10 (preconditioning in regularization); whitening is an extreme form of preconditioning (scaling each direction equally).


C.9. Rank and Nullity of Design Matrix

Code:

def diagnose_design_matrix(X, y):
    """Diagnose rank, nullity, and solvability of least squares."""
    n_samples, n_features = X.shape
    rank_X = np.linalg.matrix_rank(X)
    nullity = n_features - rank_X
    
    print(f"Samples: {n_samples}, Features: {n_features}")
    print(f"Rank(X): {rank_X}, Nullity: {nullity}")
    
    # Check solvability
    full_rank = rank_X == min(n_samples, n_features)
    if n_samples >= n_features:
        if rank_X == n_features:
            print("Full column rank: unique least-squares solution")
        else:
            print(f"Rank deficiency {n_features - rank_X}: infinitely many solutions")
    else:
        if rank_X == n_samples:
            print("Full row rank: every output reachable")
        else:
            print(f"Row rank deficiency: some outputs unreachable")
    
    # Dimension of solution set
    if nullity > 0:
        print(f"Solution set is affine subspace of dimension {nullity}")
    
    # Least-squares solution (exists but may be non-unique)
    w_ls = np.linalg.lstsq(X, y, rcond=None)[0]
    residual_norm = np.linalg.norm(X @ w_ls - y)
    print(f"LS residual norm: {residual_norm:.4f}")
    return rank_X, nullity, w_ls

# Example: underdetermined system
np.random.seed(42)
n_samples, n_features = 5, 8
X = np.random.randn(n_samples, n_features)
y = X @ np.random.randn(n_features) + 0.1 * np.random.randn(n_samples)
rank_X, nullity, w_ls = diagnose_design_matrix(X, y)

Expected Output: Rank < features; nullity = 3; solution set is 3-dimensional affine subspace.

Numerical / Shape Notes: For underdetermined (\(m < n\)) system, rank at most \(m\); nullity = $n - $ rank determines solution non-uniqueness.

Explanation: The rank-nullity theorem (Theorem 2.4) partitions feature dimension into rank (information preserved) and nullity (information lost): \(n_{\text{features}} = \text{rank}(X) + \text{nullity}\). Rank is the dimension of the column space (\(\text{Col}(X)\)), the space of attainable outputs; nullity is the dimension of the null space (\(\text{Null}(X)\)), the space of “invisible” input differences. When nullity > 0, solving \(X\mathbf{w} = \mathbf{y}\) yields non-unique solutions: if \(\mathbf{w}_p\) is one solution, all others are \(\mathbf{w}_p + \mathbf{n}\) where \(\mathbf{n} \in \text{Null}(X)\). This is typical in underdetermined regression (\(m < n\), more features than samples). Regularization (ridge regression) selects a unique solution by adding a penalty term, implicitly choosing the minimum-norm solution (closest to zero in the null space).

ML Interpretation: In high-dimensional regression (e.g., gene expression prediction, \(n = 100\) samples, \(d = 20000\) genes), the design matrix is rank-deficient: \(\text{rank}(X) \approx n = 100\), while \(d = 20000\). Nullity = 19900, meaning there are 19900 “invisible” directions in feature space—different weight vectors along these directions produce identical predictions on the training data but may differ dramatically on test data (overfitting). Regularization shrinks solutions toward the origin, biasing toward lower-norm (simpler) solutions, reducing overfitting. Cross-validation selects the regularization strength. In neural networks, early layers are often rank-deficient (fewer samples than hidden units), and dropout or batch normalization implicitly regularizes to control capacity and generalization.

Failure Modes: (1) Numerical rank determination: rank is determined via singular values; the threshold for “zero” vs. “small” SV affects the count. Use which uses a reasonable default, but for borderline cases, inspect singular values manually. (2) Ignoring nullity: practitioners sometimes overlook solution non-uniqueness and confidently report a single learned weight vector, unaware that many equally-valid (on training data) alternatives exist. (3) Over-regularizing: adding too much regularization (high \(\lambda\) in ridge regression) biases solutions far from the data, increasing bias and underfitting.

Common Mistakes: (1) Assuming full rank when features are correlated; always compute rank explicitly. (2) Not regularizing when nullity > 0, leading to unstable solutions sensitive to data noise. (3) Confusing rank and number of non-zero singular values; they match when the SVD correctly identifies the numerical rank. (4) Assuming regularized solutions are “wrong” because they don’t minimize training loss; they minimize training loss + penalty, a better quantity for generalization.

Chapter Connections: Central application of Theorem 2.4 (Rank-Nullity): directly computes and explains the relationship between rank, nullity, and solution structure. Uses Definition 2.5 (Column/Row/Null Space): rank = dim(Col(X)), nullity = dim(Null(X)). Directly relates to Example 8 (solvability of linear systems: solvable iff \(\text{rank}([X | \mathbf{y}]) = \text{rank}(X)\)), Example 9 (least-squares solution and non-uniqueness when nullity > 0), and Example 10 (ridge regression selecting a unique solution). Fundamental to understanding when linear models are identifiable and when regularization is necessary.


C.10. Ridge Regression and Effective Dimensionality

Code:

def ridge_effective_dimension(X, lambdas):
    """
    Compute effective dimension of ridge regression.
    Effective dimension = sum_i (sigma_i^2 / (sigma_i^2 + lambda))
    where sigma_i are singular values of X.
    """
    U, s, Vt = np.linalg.svd(X, full_matrices=False)
    eff_dims = []
    for lam in lambdas:
        # Ridge shrinks contributions of small singular values
        eff_dim = np.sum(s**2 / (s**2 + lam))
        eff_dims.append(eff_dim)
    return eff_dims

np.random.seed(42)
X = np.random.randn(100, 20)
U, s, Vt = np.linalg.svd(X, full_matrices=False)
print(f"Singular values (top 5): {s[:5]}")
print(f"Condition number: {s[0] / s[-1]:.2f}")

lambdas = [0.001, 0.01, 0.1, 1, 10, 100]
eff_dims = ridge_effective_dimension(X, lambdas)
for lam, eff_dim in zip(lambdas, eff_dims):
    print(f"λ={lam:7.3f}: eff_dim = {eff_dim:6.2f}")

Expected Output: As \(\lambda\) increases, effective dimension decreases from 20 toward 0.

Numerical / Shape Notes: Effective dimension is a continuous function of \(\lambda\), interpolating between full dimension (no regularization) and zero (infinite regularization). Singular value spectrum and condition number guide the trade-off.

Explanation: Ridge regression solves \(\min_{\mathbf{w}} \| X\mathbf{w} - \mathbf{y} \|^2 + \lambda \| \mathbf{w} \|^2\). The Solution is \(\mathbf{w}_\text{ridge} = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}\). In terms of SVD (\(X = U \Sigma V^\top\)), this becomes \(\mathbf{w}_\text{ridge} = V \text{diag}(\frac{\sigma_i}{\sigma_i^2 + \lambda}) U^\top \mathbf{y}\). The effective dimension is \(\text{eff-dim}(\lambda) = \sum_i \frac{\sigma_i^2}{\sigma_i^2 + \lambda}\), measuring how many (fractional) dimensions are “active” (not shrunk to zero). Large \(\sigma_i\) (high-variance directions) are barely shrunk (\(\sigma_i^2 \gg \lambda\)), while small \(\sigma_i\) (noise-sensitive directions) are heavily shrunk (\(\sigma_i^2 \ll \lambda\)). This is shrinkage-based regularization: coordinates are shrunk proportionally to their singular values. The penalty \(\lambda\) controls the shrinkage strength, acting as a “dimension dial”: increase \(\lambda\) to reduce effective dimension (simpler model, lower variance but higher bias), decrease \(\lambda\) to increase effective dimension (fit training data better, higher variance but lower bias).

ML Interpretation: Ridge regression automatically balances bias-variance by selecting an effective dimension. In high-dimensional settings (many features, few samples), a large \(\lambda\) is needed to avoid overfitting, resulting in low effective dimension. As more data is collected, the optimal \(\lambda\) can decrease (more data supports higher dimension), so effective dimension increases. Ridge’s dimension-balancing is automatic: you tune \(\lambda\) via cross-validation, and the right amount of regularization is learned. In neural networks, L2 weight regularization is analogous: it shrinks weights (especially on small singular vectors), reducing effective capacity. In probabilistic models (e.g., Bayesian regression), a Gaussian prior on weights \(\mathcal{N}(\mathbf{0}, \frac{1}{\lambda} I)\) is equivalent to L2 regularization; the prior variance \(\frac{1}{\lambda}\) sets the “effective dimension” of plausible parameter spaces.

Failure Modes: (1) Choosing \(\lambda\) naively: selecting \(\lambda\) on training data leads to overfitting of the regularization parameter. Always use cross-validation or a held-out validation set. (2) Condition number sensitivity: if \(X\) is ill-conditioned (large \(\sigma_0 / \sigma_d\)), small \(\lambda\) may be needed to achieve effective dimension \(\approx d\), but then the solution is numerically unstable. Ridge improves conditioning: \(X^\top X + \lambda I\) has condition number \(\leq \sigma_0^2 / \lambda\), much smaller than \(\sigma_0^2 / \sigma_d^2\). (3) Forgetting to standardize features: if features have different scales, ridge treats them unequally (large-scale features are shrunk less). Always normalize features before ridge regression.

Common Mistakes: (1) Assuming all effective-dimension is “wasted” regularization; regularization reduces variance (good for generalization) at the cost of increased bias. The trade-off is necessary. (2) Interpreting effective dimension as the “true” dimension of the underlying data-generating process; it’s an artifact of the regularization strength, not a property of the data. (3) Not tuning \(\lambda\); using a default value almost always suboptimal. Cross-validation is essential. (4) Confusing ridge (L2 regularization) with LASSO (L1 regularization), which performs feature selection (sparse solutions) rather than shrinkage.

Chapter Connections: Applies Theorem 2.4 (Rank-Nullity) and the SVD (Theorem 3.2): ridge regression effectively projects onto the top effective-dimension singular vectors, reducing the dimension of the solution space. Uses Definition 2.8 (Rank): the rank of \(X^\top X + \lambda I\) is always full (due to the \(\lambda I\) term), preventing singularity even when \(X\) is rank-deficient. Relates to Example 9 (least-squares solvability and non-uniqueness) and Example 10 (addressing multicollinearity via regularization). Most directly captures the interplay between rank, dimension, and generalization in regularized linear models.


C.11. Gram Matrix Rank and Feature Redundancy

Code:

def gram_rank_analysis(X):
    """
    Analyze Gram matrix: G = X X^T.
    Theorem: rank(G) = rank(X) = rank(X^T X) (for any X).
    """
    n_samples, n_features = X.shape
    rank_X = np.linalg.matrix_rank(X)
    
    G = X @ X.T  # Gram matrix, n_samples × n_samples
    rank_G = np.linalg.matrix_rank(G)
    
    XtX = X.T @ X  # Design matrix cross-product
    rank_XtX = np.linalg.matrix_rank(XtX)
    
    print(f"Rank(X) = {rank_X}")
    print(f"Rank(X X^T) = {rank_G}")
    print(f"Rank(X^T X) = {rank_XtX}")
    print(f"Are all ranks equal? {rank_X == rank_G == rank_XtX}")
    
    # Eigenvalues of G and X^T X
    evals_G = np.linalg.eigvalsh(G)
    evals_XtX = np.linalg.eigvalsh(XtX)
    evals_G = np.sort(evals_G)[::-1]
    evals_XtX = np.sort(evals_XtX)[::-1]
    
    print(f"Nonzero eigenvalues of G: {np.sum(evals_G > 1e-10)}")
    print(f"Nonzero eigenvalues of X^T X: {np.sum(evals_XtX > 1e-10)}")
    
    return rank_X, rank_G, rank_XtX

# Example
np.random.seed(42)
X = np.random.randn(50, 10)  # 50 samples, 10 features
rank_X, rank_G, rank_XtX = gram_rank_analysis(X)

Expected Output: Rank(X) = 10, Rank(Gram) = 10, Rank(X^T X) = 10; nonzero eigenvalues = 10 for both G and X^T X.

Numerical / Shape Notes: Gram matrix \(G \in \mathbb{R}^{n_{\text{samples}} \times n_{\text{samples}}}\) is symmetric positive semidefinite. Eigenvalues = squared singular values of X.

Explanation: The Gram matrix \(G = XX^\top\) encodes pairwise similarities between samples (row inner products). A fundamental theorem states \(\text{rank}(XX^\top) = \text{rank}(X) = \text{rank}(X^\top X)\). Geometrically, \(XX^\top\) projects \(\mathbb{R}^d\) (feature space) onto the span of samples: the operator \(V \mapsto X(X^\top V)\) restricted to the image of \(X^\top\) has rank equal to \(\text{rank}(X)\). Since multiplication by \(X^\top\) doesn’t increase rank (\(\text{rank}(AB) \leq \min(\text{rank}(A), \text{rank}(B))\)), we have rank-preservation. Eigenvalues of \(G\) are equal to squared singular values of \(X\) (with multiplicity adjustments for non-square \(X\)). This connection is crucial for kernel methods and distance-based algorithms.

ML Interpretation: In kernel methods (SVM, kernel ridge regression), the Gram matrix replaces explicit feature vectors: instead of computing \(X\mathbf{w}\), algorithms work with \(G = XX^\top\) directly. The rank of \(G\) determines the effective dimensionality of the problem in the feature space. If features are correlated, \(\text{rank}(X) < \text{number of features}\), and \(G\) has zero eigenvalues (more precisely, small eigenvalues due to numerical precision). Eigenvalue thresholding removes zero-eigenvalue dimensions, effectively denoising the Gram matrix. In recommender systems, the user-user Gram matrix (inner products of user preference vectors) similar approximates similarity; low rank indicates that users cluster into a few types (latent factors). Gram matrix rank is also diagnostic: rank \(< n_{\text{samples}}\) means some samples are linearly dependent in feature space (redundant data points, or dimensionality much lower than sample count).

Failure Modes: (1) Numerical rank estimation: Gram matrices are often ill-conditioned (large spread of eigenvalues), making rank determination sensitive to the threshold. Use eigenvalue magnitude for determining rank, or compute SVD of \(X\) directly (more stable than forming Gram matrix explicitly). (2) Gram matrix formation is expensive: explicitly computing \(G = XX^\top\) takes \(O(n_{\text{samples}}^2\) time and memory; for large \(n_{\text{samples}}\), use kernel tricks (implicit Gram matrix computation) or approximations (Nyström, random features). (3) SVD of Gram is twice-squared singular values: eigenvalues of \(XX^\top\) are \(\sigma_i^2\) where \(\sigma_i\) are singular values of \(X\). Numerically, squaring amplifies/suppresses small/large values, worsening condition number.

Common Mistakes: (1) Computing explicit Gram matrix and then SVD, when it’s faster and more numerically stable to SVD \(X\) directly. (2) Confusing Gram with covariance matrix; \(G = XX^\top\) encodes sample similarities, while covariance \(\frac{1}{n} X^\top X\) encodes feature correlations. (3) Assuming Gram matrix is always positive definite; it’s positive semidefinite (eigenvalues \(\geq 0\)), with zero eigenvalues when rank < number of samples. (4) Treating Gram matrix as a kernel (as in kernel methods); Gram matrix is the kernel applied to all pairs of data points, not the kernel function itself.

Chapter Connections: Illustrates rank invariance under multiplication: \(\text{rank}(AB) \leq \min(\text{rank}(A), \text{rank}(B))\), so \(\text{rank}(X) = \text{rank}(X I_d) = \text{rank}(X^\top)\) (transpose preserves rank), and \(\text{rank}(XX^\top) = \text{rank}(X \cdot X^\top) \leq \text{rank}(X)\), but also \(\text{rank}(XX^\top) = \text{rank}(X)\) (by properties of Gram matrices). Uses Definition 2.8 (Rank) and Definition 2.5 (Image/Column Space): columns of \(X\) span a rank-dimensional subspace, so their pairwise inner products (Gram matrix) have the same rank. Relates to Example 7 (PCA and eigendecomposition) and Example 11 (Gram-Schmidt and inner product structure).


C.12. Dimension Reduction via Truncated SVD (Eck art-Young)

Code:

def truncated_svd_reconstruction(X, k):
    """
    Reconstruct X via rank-k SVD: X_k = U_k Sigma_k V_k^T
    Eckart-Young Theorem: X_k minimizes ||X - Z||_F over all rank-k Z.
    """
    U, s, Vt = np.linalg.svd(X, full_matrices=False)
    # Truncate to rank k
    U_k = U[:, :k]
    s_k = s[:k]
    Vt_k = Vt[:k, :]
    X_k = U_k @ np.diag(s_k) @ Vt_k
    
    # Error: Frobenius norm
    error_frobenius = np.linalg.norm(X - X_k, 'fro')
    tail_energy = np.linalg.norm(s[k:])  # Sum of discarded singular values
    
    # Verify Eckart-Young: error = sum of discarded singular values
    return X_k, error_frobenius, tail_energy

np.random.seed(42)
X = np.random.randn(20, 15)
print(f"Original X shape: {X.shape}")

U, s, Vt = np.linalg.svd(X, full_matrices=False)
print(f"Singular values (first 10): {s[:10]}")
print(f"Singular value decay: {s[0] / s[-1]:.2f}x")

for k in [2, 5, 10, 15]:
    X_k, error, tail = truncated_svd_reconstruction(X, k)
    print(f"k={k:2d}: Frobenius error = {error:.4f}, Tail sum = {tail:.4f}, Match: {np.isclose(error, tail)}")

Expected Output: For each \(k\), Frobenius error matches sum of discarded singular values (Eckart-Young).

Numerical / Shape Notes: Truncated SVD provides optimal rank-\(k\) approximation in Frobenius norm. Reconstruction error equals sum of smallest \(m - k\) singular values.

Explanation: SVD decomposes \(X = U \Sigma V^\top\) where \(U, V\) are orthonormal and \(\Sigma\) diagonal (singular values). Truncating to rank \(k\) yields \(X_k = U_k \Sigma_k V_k^\top\), discarding the smallest \(m - k\) singular values. The Eckart-Young Theorem states that \(X_k\) minimizes the Frobenius norm error \(\| X - Z \|_F\) among all rank-\(k\) matrices \(Z\). The error is exactly \(\sum_{i=k+1}^m \sigma_i\) (the tail of singular values), proving optimality. This makes truncated SVD the canonical dimension reduction: it preserves the “most important” dimensions (largest singular values) and discards noise (smallest singular values). The singular value decay (how fast \(\sigma_i\) decreases) determines the quality of low-rank approximation: fast decay means low-rank approximation error is small (intrinsic dimension is low), slow decay means the matrix has structure across many scales (high intrinsic dimension, poor low-rank approximation).

ML Interpretation: Truncated SVD is the foundation of many dimensionality reduction and compression techniques. In image compression, truncating the SVD of image matrices (or via DCT in JPEG) yields efficient compression by retaining high-energy singular vectors and discarding small singular values (visual artifacts). In latent factor models (collaborative filtering), factorizing the user-item matrix via truncated SVD identifies \(k\) latent factors; choosing \(k\) trades reconstruction error (small \(k\) = high error, lossy) vs. overfitting to noise (large \(k\)). In denoising, if data \(X = X_{\text{signal}} + X_{\text{noise}}\), and noise has small singular values (flat spectrum), truncating the SVD of noisy \(X\) approximately recovers \(X_{\text{signal}}\) (this is the “spectral denoising” heuristic). In neural networks, truncated SVD can compress weight matrices (replace a matrix with a low-rank factorization), reducing parameters and computation. The Eckart-Young optimality guarantees that truncated SVD is the best possible low-rank approximation, though it may not be the best for specific downstream tasks (task-aware dimensionality reduction might yield different low-rank factors).

Failure Modes: (1) Choosing \(k\) heuristically: visual inspection (looking for an “elbow” in singular value spectrum) often works but is subjective. Cross-validation on downstream tasks is more objective. (2) Assuming low rank implies low-dimensional data: if \(X = U D V^\top\) with rank \(k\), the data lie in a \(k\)-dimensional subspace, but this doesn’t account for nonlinear structure (manifolds); linear low-rank approximation may miss curvature. (3) Forgetting full SVD vs. reduced SVD: computing full \(U\) (size \(m \times m\)) is expensive when \(d > m\); use reduced SVD (\(U\) size \(m \times r\)) for efficiency.

Common Mistakes: (1) Assuming truncated SVD is optimal for supervised learning (e.g., predicting \(\mathbf{y}\) from \(X\)); the optimal \(k\) for prediction may differ from the optimal \(k\) for reconstruction (choose \(k\) via cross-validation on prediction loss, not reconstruction error). (2) Not centering \(X\) before truncated SVD when used for preprocessing; centering aligns with PCA and helps interpretability. (3) Confusing singular values with eigenvalues; eigenvalues of \(X^\top X\) are \(\sigma_i^2\), not \(\sigma_i\). (4) Applying truncated SVD to raw, unnormalized data with vastly different feature scales; the SVD will be dominated by large-scale features. Normalize first.

Chapter Connections: Directly applies the Singular Value Decomposition (Theorem 3.2) and Eckart-Young Theorem: the truncated SVD is the optimal rank-\(k\) approximation in Frobenius norm. Uses Definition 2.8 (Rank): the rank of \(X_k\) is exactly \(k\), and \(X_k\) spans the space of the top-\(k\) singular vectors. Relates to Example 7 (PCA, which is the SVD of centered data applied row-by-row), Example 12 (SVD as the foundation for spectral methods), and Example 5 (understanding when low-rank approximations preserve important structure).


C.13. Intrinsic Dimension Estimation via Variance Threshold

Code:

def estimate_intrinsic_dimension_pca(X, variance_threshold=0.95):
    """
    Estimate intrinsic dimension via PCA: number of components needed to exceed threshold.
    """
    X_centered = X - X.mean(axis=0, keepdims=True)
    cov = X_centered.T @ X_centered / (X.shape[0] - 1)
    evals = np.linalg.eigvalsh(cov)
    evals = np.sort(evals)[::-1]  # Decreasing order
    total_variance = evals.sum()
    cumulative_variance = np.cumsum(evals) / total_variance
    
    # Find k such that cumsum(evals[:k]) / total >= threshold
    k = np.argmax(cumulative_variance >= variance_threshold) + 1
    
    print(f"Total variance: {total_variance:.4f}")
    print(f"Intrinsic dimension (th={variance_threshold:.0%}): {k}")
    print(f"Top 5 variance fractions: {evals[:5] / total_variance}")
    print(f"Cumulative variance: {cumulative_variance[:min(k+2, len(cumulative_variance))]}")
    
    return k, evals, cumulative_variance

np.random.seed(42)
# Generate data with low intrinsic dimension
n_samples, ambient_dim, intrinsic_dim = 500, 100, 10
W = np.random.randn(ambient_dim, intrinsic_dim)  # Projection operator
Z = np.random.randn(n_samples, intrinsic_dim)   # Low-dim latent data
X = Z @ W.T + 0.01 * np.random.randn(n_samples, ambient_dim)  # Embed + noise

k, evals, cumvar = estimate_intrinsic_dimension_pca(X, variance_threshold=0.95)

Expected Output: Intrinsic dimension estimate \(\approx\) 10 (the true intrinsic dimension).

Numerical / Shape Notes: Eigenvalue spectrum reveals intrinsic structure: fast decay indicates low intrinsic dimension, slow decay indicates data complexity at many scales.

Explanation: If data lie near a \(k\)-dimensional manifold (or subspace) embedded in \(\mathbb{R}^d\), PCA reveals this via the eigenvalue spectrum of the covariance matrix: the top \(k\) eigenvalues (variance fractions) capture a large fraction of total variance (e.g., 95%), while the remaining eigenvalues are small (noise). The threshold-based estimate of intrinsic dimension selects \(k\) such that the top \(k\) PCs explain a target fraction (e.g., 95%) of variance. This is a heuristic: the true intrinsic dimension may differ (e.g., if the manifold has curvature or is nonlinear), but for linear subspaces or mildly curved manifolds, it often works well. The method is unsupervised and computationally cheap (just eigendecomposition), making it a practical first pass at estimating dimensionality.

ML Interpretation: Understanding intrinsic dimension is crucial for designing models: if data have intrinsic dimension \(k \ll d\), a dimensionality reduction to \(k\) or slightly above retains all essential information, improving generalization. If the estimate suggests \(k \approx d\) (no reduction), the data are high-complexity, and more sophisticated methods (nonlinear manifold learning, deep autoencoders) may be needed. In active learning or experimental design, knowing intrinsic dimension guides how many samples are needed: roughly \(c k / \epsilon^2\) samples suffice to learn a \(k\)-dimensional model to \(\epsilon\)-accuracy (ignoring logarithmic factors). In noisy settings (e.g., sensor data with measurement error), the intrinsic dimension of the underlying signal is lower than the dimension of noisy observations; separating signal from noise requires estimating where the spectral decay transitions from signal to noise.

Failure Modes: (1) Threshold choice is arbitrary: different thresholds (80%, 95%, 99%) yield different \(k\) estimates. There’s no universal “right” threshold; it depends on the application (what noise level is tolerable?). (2) PCA assumes linearity: if data lie on a nonlinear manifold, PCA underestimates intrinsic dimension (a 1D circle embedded in 2D might appear 2-dimensional to PCA). Use manifold learning methods (Isomap, locally linear embedding) for nonlinear cases. (3) Outliers inflate variance: a few outliers can inflate eigenvalues, increasing the estimated intrinsic dimension. Robust methods (e.g., using robust covariance estimates or outlier removal) are needed for noisy data.

Common Mistakes: (1) Using the threshold-based \(k\) as the “truth”; it’s merely an estimate reflecting the chosen threshold. Sensitivity analysis (varying the threshold) reveals robustness. (2) Not standardizing features before PCA; features with large variances dominate, and the intrinsic dimension estimate reflects the scale of features, not underlying structure. (3) Confusing dimensionality (number of observations) with intrinsic dimension (structure dimension); a million 1000-dimensional vectors sampled from a 10-dimensional manifold have low intrinsic dimension despite large ambient and observation counts. (4) Assuming the threshold directly translates to model capacity; estimating \(k = 15\) components needed for 95% variance doesn’t mean a 15-parameter model is sufficient (nonlinearity, task-specific importance may require more).

Chapter Connections: Operationalizes Theorem 2.3 (Dimension): the dimension of the data-generating distribution (intrinsic dimension) can be estimated via the rank of its covariance matrix (number of nonzero eigenvalues, or practically, eigenvalues above noise level). Uses Definition 2.8 (Rank) and rank-nullity: if the top \(k\) components explain 95% of variance, the data lie in a \(k\)-dimensional subspace plus noise. Relates to Example 7 (PCA spectrum analysis) and Example 13 (choosing autoencoder bottleneck dimension via cross-validation, similar idea: find the dimension that balances compression and accuracy).


C.14. Autoencoder Bottleneck Dimension and Reconstruction Error

Code:

def linear_autoencoder_cross_validation(X, bottleneck_dims, test_frac=0.2):
    """
    Train linear autoencoders (PCA) with various bottleneck dimensions.
    Use cross-validation to find optimal dimension.
    """
    from sklearn.model_selection import train_test_split
    
    X_train_data, X_test = train_test_split(X, test_size=test_frac, random_state=42)
    
    results = {}
    for k in bottleneck_dims:
        # PCA on training data
        X_train_mean = X_train_data.mean(axis=0)
        X_train_centered = X_train_data - X_train_mean
        cov = X_train_centered.T @ X_train_centered / (len(X_train_centered) - 1)
        evals, evecs = np.linalg.eigh(cov)
        evecs = evecs[:, np.argsort(evals)[::-1]]  # Top eigenvectors
        W = evecs[:, :k]  # Bottleneck: k-dimensional basis
        
        # Encode and decode on training data
        X_train_code = (X_train_centered) @ W  # Bottleneck codes
        X_train_recon = X_train_code @ W.T  # Reconstruction
        train_mse = np.mean((X_train_centered - X_train_recon)**2)
        
        # Encode and decode on test data
        X_test_centered = X_test - X_train_mean
        X_test_code = X_test_centered @ W
        X_test_recon = X_test_code @ W.T
        test_mse = np.mean((X_test_centered - X_test_recon)**2)
        
        results[k] = {'train_mse': train_mse, 'test_mse': test_mse}
    
    return results

# Example: synthetic data with known intrinsic dimension
np.random.seed(42)
n_samples, ambient_dim, true_k = 300, 50, 8
Z = np.random.randn(n_samples, true_k)
W = np.random.randn(ambient_dim, true_k)
X = Z @ W.T + 0.01 * np.random.randn(n_samples, ambient_dim)

bottleneck_dims = list(range(2, 25, 2))
results = linear_autoencoder_cross_validation(X, bottleneck_dims)

print("Bottleneck\tTrain MSE\tTest MSE")
for k in bottleneck_dims[:8]:
    r = results[k]
    print(f"{k:d}\t\t{r['train_mse']:.6f}\t{r['test_mse']:.6f}")

# Find optimal k (minimum test MSE)
optimal_k = min(results, key=lambda k: results[k]['test_mse'])
print(f"\nOptimal bottleneck dimension: {optimal_k} (true: {true_k})")

Expected Output: Optimal dimension \(\approx 8–10\) (close to true intrinsic dimension); test MSE minimized at this \(k\).

Numerical / Shape Notes: Train MSE decreases monotonically with \(k\); test MSE has U-shape (underfitting for small \(k\), overfitting for large \(k\)). Optimal \(k\) is at the valley of test MSE curve.

Explanation: An autoencoder with bottleneck dimension \(k\) is trained to reconstruct inputs \(\mathbf{x}\): \(\mathbf{x} \to \text{encode}(\mathbf{x}, W) \to \text{decode}(\text{code}, W) \to \hat{\mathbf{x}}\). For linear autoencoders (no activation functions), the optimal solution is PCA: the encoder discovers the top-\(k\) principal components (eigenvectors of covariance). The bottleneck dimension \(k\) acts as a capacity control: small \(k\) forces aggressive compression (high bias, low variance), large \(k\) allows faithful reconstruction (low bias, high variance). Cross-validation on test reconstruction error identifies the sweet spot: \(k^* = \arg\min_k \text{MSE}_{\text{test}}(k)\). The optimal \(k\) typically equals (or is close to) the intrinsic dimension of the data, though task-specific factors (noise, outliers) affect the optimal choice. For nonlinear autoencoders (with nonlinear activations), the interpretation is similar, but the learned encoder/decoder differ from PCA.

ML Interpretation: Autoencoders solve the representation learning problem: finding a good basis (learned basis) for encoding data. The bottleneck forces the model to discover concise representations. In unsupervised learning, reconstruction error is the objective; in semi-supervised or transfer learning, learned representations from autoencoders are transferred to downstream tasks (classification, clustering). The choice of bottleneck dimension directly affects this: too small and the representation loses information (hurts downstream tasks), too large and the model doesn’t compress (wastes computation and memory). Cross-validation guides this choice automatically. Variational autoencoders (VAEs) extend this to probabilistic settings (adding a stochastic encoder/decoder), and the bottleneck dimension in VAEs corresponds to the latent variable dimension, a key design parameter.

Failure Modes: (1) Overfitting autoencoder itself: even though the training objective is reconstruction (unsupervised), the autoencoder can overfit to training data if the bottleneck \(k\) is large and there’s no regularization. Cross-validation is essential. (2) Poor choice of train/test split: if the split is biased (e.g., shuffling groups of similar samples together), the autoencoder may learn dataset artifacts rather than generalizable structure. Use stratified or temporal splits if applicable. (3) Nonlinear autoencoders are harder to interpret: unlike linear PCA, the learned representations of nonlinear autoencoders have no direct interpretation, complicating the choice of bottleneck dimension. Visualization or probing tasks help.

Common Mistakes: (1) Choosing \(k\) to minimize training reconstruction error; this leads to \(k = d\) (full capacity, no compression). Always use validation/test error. (2) Not normalizing/centering data before training; autoencoders are sensitive to scale. (3) Confusing bottleneck dimension with the number of latent factors; they’re related but not identical (nonlinear autoencoders can have bottleneck dimension \(k\) yet learn \(> k\) factors via nonlinearity). (4) Assigning too much importance to a 5% test MSE reduction from \(k=10\) to \(k=15\); such marginal improvements may not justify the added complexity; Occam’s razor favors smaller \(k\).

Chapter Connections: Applies rank-nullity theorem (Theorem 2.4) to autoencoders: the bottleneck constrains the rank of the encoder+decoder composition, limiting the model’s capacity to represent the input space. For linear autoencoders, the optimal solution is PCA (Example 7: basis selection maximizing variance), and the bottleneck dimension equals the rank of the learned weight matrix. Relates to Example 13 (choosing effective dimensionality via regularization) and Example 12 (SVD as optimal low-rank approximation, analogous to how bottleneck autoencoders learn optimal \(k\)-dimensional projections).

(… continuing with C.15–C.20 in next section …)


C.15. Coordinate Transformation for Feature Interpretability

Code:

def transform_classifier_coordinates(w, P, P_inv):
    """
    Transform linear classifier weights from one coordinate system to another.
    If w are weights in basis B, then w' = P^T w are weights in basis B'.
    Predictions are invariant: P w' = w.
    """
    # Forward: w (in old basis) -> w' (in new basis)
    w_new = P.T @ w
    # Backward: w' -> w (verify round-trip)
    w_recovered = P @ w_new
    return w_new, w_recovered

# Example: classifier in 2D with weight vector w
np.random.seed(42)
w_original = np.array([0.5, -0.3])  # Classifier weights in standard basis

# Change of basis: rotate by 45 degrees
theta = np.pi / 4
P = np.array([[np.cos(theta), np.sin(theta)], 
              [-np.sin(theta), np.cos(theta)]])  # Rotation matrix (orthogonal)
P_inv = P.T  # For orthogonal, inverse = transpose

w_rotated, w_check = transform_classifier_coordinates(w_original, P, P_inv)

print(f"Weights in standard basis: {np.round(w_original, 3)}")
print(f"Weights in rotated basis: {np.round(w_rotated, 3)}")
print(f"Round-trip check: {np.allclose(w_check, w_original)}")

# Predictions are invariant
x_std = np.array([1.0, 2.0])  # Data point in standard coordinates
x_rot = P.T @ x_std  # Same point in rotated coordinates
pred_std = w_original @ x_std
pred_rot = w_rotated @ x_rot
print(f"Prediction (standard coords): {pred_std:.4f}")
print(f"Prediction (rotated coords): {pred_rot:.4f}")
print(f"Predictions match: {np.isclose(pred_std, pred_rot)}")

Expected Output: Weights transform into rotated basis; predictions remain identical in both coordinate systems.

Numerical / Shape Notes: Linear functional (dot product) is basis-invariant. Changing coordinates changes weight values but not the hyperplane defined by \(\mathbf{w}^T \mathbf{x} = \text{const}\).

Explanation: A linear classifier \(\mathbf{w}^\top \mathbf{x} = b\) defines a hyperplane. The weights \(\mathbf{w}\) are the normal vector to this hyperplane, encoding the “direction to optimize” in the original coordinate system. Changing coordinates via \(\mathbf{x}' = P \mathbf{x}\) (or \([\mathbf{x}]_{\mathcal{B}'} = P [\mathbf{x}]_\mathcal{B}\)) transforms the decision boundary: the hyperplane in the new coordinates is defined by \((\mathbf{w}')^\top \mathbf{x}' = b\), where \(\mathbf{w}' = P^{-\top} \mathbf{w} = (P^\top)^{-1} \mathbf{w}\) (weights transform contravariantly: opposite to how coordinates transform). The key insight is that predictions remain identical regardless of coordinates: \(\mathbf{w}^\top \mathbf{x} = (\mathbf{w}')^\top P \mathbf{x}\). For orthogonal \(P\) (rotation, reflection), contravariancy reduces to \(\mathbf{w}' = P^\top \mathbf{w}\).

ML Interpretation: Understanding coordinate transformations clarifies why certain feature representations are better than others for interpretation. In raw pixel space (original coordinates), classifier weights are pixel-by-pixel sensitivities (which pixels matter); in PCA coordinates (transformed coordinates), weights are sensitivities to principal components (which modes of variation matter). The PCA-transformed weights are often more interpretable: a large weight on the first PC (capturing edges/contrast) is more meaningful than scattered weights across pixels. In transfer learning, transforming a classifier trained on one domain (one coordinate system) to another domain (different coordinate system) involves changing the basis; fine-tuning is a form of basis adaptation. In neural networks, each layer is an implicit change of coordinates (hidden units form a basis); improving interpretability sometimes requires examining the representation at each layer (basis at each level) and understanding how information transforms across layers.

Failure Modes: (1) Confusing contravariance: weights transform inversely relative to coordinates. Many practitioners mistakenly use \(\mathbf{w}' = P \mathbf{w}\) (covariant), which gives wrong results. Remember: \(\mathbf{w}' = (P^{-1})^\top \mathbf{w}\). (2) Non-orthogonal transformations: if \(P\) is not orthogonal (e.g., PCA projection plus scaling), the inverse transformation is \(P^{-1}\), not \(P^\top\). (3) Interpreting transformed weights naively: even though \(\mathbf{w}'\) are in a new coordinate system, the relative importance of coordinates depends on the variance of those coordinates; a large weight on a low-variance PC may be more important than a small weight on a high-variance PC.

Common Mistakes: (1)Using \(P^\top\) instead of \((P^{-1})^\top\) for the weight transformation. (2) Assuming transformed weights are “more correct” or “more interpretable” without validating on held-out data; interpretability is subjective; always validate that transformed models generalize. (3) Forgetting that basis changes affect all parameters: not only weights but also biases and other parameters must be transformed (or not, depending on the transformation). (4) Applying basis changes without documenting which basis is used; in production, this leads to confusion and bugs.

Chapter Connections: Illustrates Definition 2.9 (Change-of-Basis Matrix) and contravariance of dual spaces: if vectors transform via \([\mathbf{x}]' = P [\mathbf{x}]\), then linear functionals (dual vectors like \(\mathbf{w}\)) transform via \([\mathbf{w}]' = P^{-\top} [\mathbf{w}]\). Uses Theorem 2.6 (Coordinate Isomorphism): the coordinate map is a linear isomorphism, preserving the structure of linear functionals. Relates to Chapter 3 (matrices representing linear transformations change under basis changes via similarity transformations \(A' = P^{-1} A P\), and weights are a special case of such representations).


C.16. Identifying and Removing Collinear Features

Code:

def detect_collinearity(X, corr_threshold=0.90):
    """
    Detect collinear (highly correlated) feature pairs.
    High correlation indicates redundancy; one feature may be removable.
    """
    corr_matrix = np.corrcoef(X.T)  # Feature × feature correlation
    collinear_pairs = []
    for i in range(corr_matrix.shape[0]):
        for j in range(i + 1, corr_matrix.shape[1]):
            corr_ij = np.abs(corr_matrix[i, j])
            if corr_ij > corr_threshold:
                collinear_pairs.append((i, j, corr_ij))
    
    return collinear_pairs, corr_matrix

def remove_collinear_features(X, corr_threshold=0.90):
    """Remove one feature from each highly-correlated pair."""
    collinear_pairs, corr_matrix = detect_collinearity(X, corr_threshold)
    
    # Greedy removal: mark features to remove
    marked_for_removal = set()
    for i, j, corr_ij in collinear_pairs:
        if i not in marked_for_removal and j not in marked_for_removal:
            # Keep i, remove j (arbitrary choice; could prioritize by variance)
            marked_for_removal.add(j)
    
    # Return features not marked
    remaining_features = [i for i in range(X.shape[1]) if i not in marked_for_removal]
    X_reduced = X[:, remaining_features]
    
    print(f"Detected {len(collinear_pairs)} collinear pairs")
    print(f"Removed {len(marked_for_removal)} features")
    print(f"Original shape: {X.shape}, Reduced shape: {X_reduced.shape}")
    
    return X_reduced, remaining_features

# Example: synthetic data with collinearity
np.random.seed(42)
n_samples = 100
# Create 5 features: first 3 independent, last 2 are linear combinations
X_indep = np.random.randn(n_samples, 3)
feature_4 = X_indep[:, 0] + 0.05 * np.random.randn(n_samples)  # Nearly identical to feature 1
feature_5 = X_indep[:, 1] - 0.5 * X_indep[:, 2] + 0.05 * np.random.randn(n_samples)  # Linear combo
X = np.column_stack([X_indep, feature_4, feature_5])

collinear_pairs, corr = detect_collinearity(X, corr_threshold=0.90)
print(f"Collinear pairs: {collinear_pairs}")

X_reduced, remaining = remove_collinear_features(X, corr_threshold=0.90)
print(f"Remaining features: {remaining}")

Expected Output: Detects feature 3 & 4 (~0.98 correlation), removes one; potentially feature 2 & 5 if correlation exceeds threshold.

Numerical / Shape Notes: Correlation matrix is symmetric, diagonal = 1. Feature pair \((i, j)\) with \(|\text{corr}(i, j)| > 0.9\) indicates strong dependence.

Explanation: Collinearity (or multicollinearity in regression) occurs when features are linearly dependent or nearly dependent: one feature is a linear combination of others. Mathematically, the columns of the design matrix \(X\) are linearly dependent, so \(\text{rank}(X) < \text{number of columns}\), and the Gram matrix \(X^\top X\) is singular (non-invertible). In terms of correlation, \(|\text{corr}(x_i, x_j)| \approx 1\) indicates redundancy: \(x_i \approx \pm x_j\) or \(x_i \approx \text{const} \cdot x_j\). Detecting and removing collinear features restores full rank and improves numerical stability of fitting algorithms (e.g., \((X^\top X + 0 \cdot I)^{-1}\) becomes well-conditioned). Greedy removal (pick one feature from each collinear pair) is heuristic; optimal feature selection is NP-hard, but greedy often works in practice.

ML Interpretation: Collinearity inflates parameter uncertainty and reduces interpretability: in regression \(\mathbf{y} = X\mathbf{w}\), if \(x_1 \approx x_2\), then the fitted weights \(w_1, w_2\) are unstable (small data perturbations cause large weight changes). Removing redundant features stabilizes estimates and simplifies models. In feature engineering, detecting collinearity guides feature extraction: if two engineered features are highly correlated, one is redundant and can be dropped, reducing model complexity without loss of information. In domain-driven modeling, collinearity may reveal measurement redundancy (e.g., height and length inches are nearly identical features; drop one). Warning: removing features is a trade-off (simpler model, lower variance) vs. loss of information (higher bias); always validate on held-out data that removal doesn’t hurt downstream performance.

Failure Modes: (1) Threshold is arbitrary: different thresholds (0.8, 0.9, 0.95) yield different sets of removed features. Optimal threshold depends on the task and tolerance for redundancy. (2) Correlation captures only linear dependence: nonlinearly dependent features (e.g., \(x_2 = x_1^2\)) have low correlation but are still redundant. Nonlinear methods (mutual information, kernel methods) detect such dependence. (3) Greedy removal is suboptimal: the choice of which feature to remove from a collinear pair is arbitrary; a better strategy is removing features that least improve downstream task performance (feature importance). (4) Ignoring interaction terms: even if two features are not individually collinear, their interaction term \(x_1 \cdot x_2\) may be collinear with other features.

Common Mistakes: (1) Removing features based only on correlation, ignoring their predictive power; a collinear feature may be more predictive (higher variance, stronger signal) and should be kept. (2) Removing collinearity for interpretability in models where it’s not critical (e.g., random forests are robust to multicollinearity); the cost of feature removal (information loss) may outweigh the benefit (simplicity). (3) Detecting and removing collinearity on the full dataset, then training, causing data leakage; always fit collinearity detection on training data only. (4) Assuming correlation-based collinearity detection finds all redundancy; it misses nonlinear and higher-order dependencies.

Chapter Connections: Directly addresses rank deficiency (Definition 2.8, Theorem 2.4): collinear features reduce rank(\(X\)) below the number of columns, violating the full-column-rank assumption for unique least-squares solutions. Relates to Example 4 (diagnosing rank-deficiency in regression) and Example 9 (least-squares solvability and regularization when Null(\(X\)) \(\ne \emptyset\)). Connects to Example 16 (null space characterization) and Example 10 (ridge regression as a way to handle collinearity via regularization).


C.17. Null Space Characterization of Solution Non-Uniqueness

Code:

def characterize_solution_set(X, y):
    """
    For a potentially underdetermined linear system A x = b,
    characterize the solution set as a particular + null space sum.
    """
    m, n = X.shape
    U, s, Vt = np.linalg.svd(X, full_matrices=True)
    
    # Numerical rank
    tol = np.finfo(float).eps * max(m, n) * s[0]
    rank = np.sum(s > tol)
    
    # Pseudoinverse solution (minimum-norm solution)
    s_pinv = np.zeros_like(s)
    s_pinv[:rank] = 1.0 / s[:rank]
    X_pinv = Vt.T @ np.diag(s_pinv) @ U.T
    x_particular = X_pinv @ y
    
    # Null space basis (right singular vectors corresponding to zero singular values)
    null_basis = Vt[rank:, :].T
    
    # Verify
    residual = np.linalg.norm(X @ x_particular - y)
    null_check = np.allclose(X @ null_basis, 0)
    
    print(f"Rank: {rank}, Nullity: {n - rank}")
    print(f"Particular solution norm: {np.linalg.norm(x_particular):.4f}")
    print(f"Residual for particular solution: {residual:.2e}")
    print(f"Null space basis shapes: {null_basis.shape}")
    print(f"X @ null_basis ≈ 0: {null_check}")
    
    # All solutions: x = x_particular + c_1 * null_basis[:, 0] + ... + c_k * null_basis[:, k-1]
    print(f"Solution set forms affine subspace of dimension {n - rank}")
    
    return x_particular, null_basis, rank

# Example: underdetermined system
np.random.seed(42)
m, n = 5, 8
X = np.random.randn(m, n)
true_x = np.random.randn(n)
y = X @ true_x  # Consistent system

x_part, null_base, rank_X = characterize_solution_set(X, y)

# Verify: adding null space components gives other solutions
if null_base.shape[1] > 0:
    x_alt = x_part + 0.5 * null_base[:, 0]
    residual_alt = np.linalg.norm(X @ x_alt - y)
    print(f"\nAlternative solution (x_particular + 0.5 * null_basis[:,0]):")
    print(f"Residual: {residual_alt:.2e}")
    print(f"Is alternative solution? {residual_alt < 1e-10}")

Expected Output: Solution set is 3-dimensional affine subspace (nullity = 3); all solutions achieve identical residual (zero, since system is consistent).

Numerical / Shape Notes: Null basis \(\in \mathbb{R}^{8 \times 3}\) spans 3-dimensional space; solution set = \(\{\mathbf{x}_p + \mathbf{n} : \mathbf{n} \in \text{Null}(X)\}\) is affine (shifted linear subspace).

Explanation: The solution set to \(X\mathbf{x} = \mathbf{y}\) is either empty (inconsistent system: \(\mathbf{y} \notin \text{Col}(X)\)) or an affine subspace (consistent: \(\mathbf{y} \in \text{Col}(X)\)). When consistent, all solutions are \(\mathbf{x} = \mathbf{x}_p + \mathbf{n}\), where \(\mathbf{x}_p\) is any particular solution and \(\mathbf{n} \in \text{Null}(X)\) is arbitrary (these are “invisible” to \(X\): \(X(\mathbf{x}_p + \mathbf{n}) = X\mathbf{x}_p + X\mathbf{n} = \mathbf{y} + \mathbf{0} = \mathbf{y}\)). The null space is a \((\text{nullity})\)-dimensional linear subspace, so the solution set is a parallel \((\text{nullity})\)-dimensional affine subspace. The particular solution can be chosen as the minimum-norm solution (pseudoinverse solution), \(\mathbf{x}_p = X^+ \mathbf{y}\), which is the shortest zero-loss solution, a natural choice for regularization-free inverse problems.

ML Interpretation: In underdetermined regression (more features than samples), the solution set is high-dimensional, and many weight vectors achieve identical training loss. The particular (minimum-norm) solution is one representative, but it’s not unique. Different solutions may generalize differently to test data, so the choice of solution (equivalently, which null space directions to move along) matters for generalization. Regularization (ridge, LASSO, elastic net) implicitly selects a solution from the null space direction. Understanding the solution set is crucial for interpreting models: if two learned weights differ significantly but achieve identical training loss, they differ along a null space direction (orthogonal to the data), and either is equally valid from a training perspective but may have different interpretability. In inverse problems (e.g., image deblurring, where the forward model is ill-posed and underdetermined), the null space characterization reveals all recoverable information: data only constrain the component perpendicular to the null space; null space components are unobservable and must be fixed via prior/regularization.

Failure Modes: (1) Numerical rank determination: the threshold for determining which singular values are “zero” (part of the nullity) vs. “small but nonzero” (noise) is delicate. Use relative tolerance (common heuristic: \(\text{tol} = \epsilon |m,n| \sigma_{\max}\) where \(\epsilon\) is machine precision). (2) Large null space may be expensive: if nullity is large (many free parameters), representing null basis explicitly (as a dense matrix) is memory-intensive. Sparse or implicit representations may be needed. (3) Ill-conditioned \(X\): if \(X\) is ill-conditioned (large condition number), even computing the null space via SVD can be numerically unstable. Use double precision and robust libraries.

Common Mistakes: (1) Assuming there’s a unique solution when nullity > 0. (2) Ignoring the null space when interpreting learned models; treating one solution as “the” answer, unaware that infinitely many alternatives exist. (3) Forgetting to check system consistency before characterizing solutions; if \(\mathbf{y} \notin \text{Col}(X)\), there’s no solution, only a least-squares approximation. Always check residual (\(\| X\mathbf{x} - \mathbf{y} \| \approx 0\) for existence). (4) Using different particular solutions (e.g., pseudoinverse vs. least-norm) and not recognizing they differ only along the null space.

Chapter Connections: Directly applies Theorem 2.4 (Rank-Nullity): the dimension of the solution set equals nullity = \(n - \text{rank}(X)\). Uses Definition 2.5 (Null Space) and Definition 1.6 (Kernel): \(\text{Null}(X) = \ker(A)\) is a subspace. Connects to Example 8 (solvability of linear systems) and Example 9 (least-squares uniqueness depending on rank). Key example of how null space directly governs non-uniqueness and requires regularization.


C.18. Low-Rank Factorization for Implicit Regularization

Code:

def low_rank_factorization_als(X, rank, max_iter=100, tol=1e-6):
    """
    Alternating Least Squares (ALS) for low-rank factorization.
    Minimize ||X - U V^T||_F^2 with U ∈ ℝ^{m×k}, V ∈ ℝ^{n×k}.
    """
    m, n = X.shape
    np.random.seed(42)
    
    # Initialize U, V randomly
    U = np.random.randn(m, rank) * 0.1
    V = np.random.randn(n, rank) * 0.1
    
    errors = []
    for iteration in range(max_iter):
        # Fix V, solve for U: U = X V (V^T V)^{-1}
        VtV = V.T @ V + 1e-8 * np.eye(rank)  # Add regularization for stability
        U = X @ V @ np .linalg.inv(VtV)
        
        # Fix U, solve for V: V = X^T U (U^T U)^{-1}
        UtU = U.T @ U + 1e-8 * np.eye(rank)
        V = X.T @ U @ np.linalg.inv(UtU)
        
        # Compute error
        X_recon = U @ V.T
        error = np.linalg.norm(X - X_recon, 'fro')
        errors.append(error)
        
        if iteration % 20 == 0:
            print(f"Iteration {iteration}: error = {error:.6f}")
        
        # Check convergence
        if iteration > 0 and np.abs(errors[-1] - errors[-2]) < tol:
            print(f"Converged at iteration {iteration}")
            break
    
    return U, V, errors

# Example: low-rank matrix
np.random.seed(42)
m, n = 50, 40
X = np.random.randn(m, 5) @ np.random.randn(5, n)  # Rank-5 matrix
X += 0.01 * np.random.randn(m, n)  # Add noise

print("Low-rank factorization via ALS:")
U, V, errors = low_rank_factorization_als(X, rank=5, max_iter=100)
X_recon = U @ V.T
final_error = np.linalg.norm(X - X_recon, 'fro')
print(f"Final reconstruction error: {final_error:.6f}")
print(f"Factorization parameter count: {U.size + V.size} (vs. original {X.size})")

Expected Output: ALS converges to low-rank approx; reconstruction error decreases; parameter reduction from \(50 \times 40 = 2000\) to \((50+40) \times 5 = 450\) parameters.

Numerical / Shape Notes: Factorization \(X \approx UV^T\) constrains rank to \(k\); solving via ALS avoids explicit rank constraint but implicitly finds low-rank solution.

Explanation: Low-rank factorization \(X \approx U V^\top\) decomposes \(X\) as a product of rectangle matrices (\(U \in \mathbb{R}^{m \times k}, V \in \mathbb{R}^{n \times k}\) where \(k \ll \min(m,n)\)). The product has rank at most \(k\), so if \(X\) is near rank-\(k\), factorization recovers the signal. Alternating Least Squares (ALS) solves the optimization \(\min_{U,V} \| X - UV^\top \|_F^2\) by alternating: fix \(V\) and solve for \(U\) (a linear least-squares problem), then fix \(U\) and solve for \(V\), repeat until convergence. This is simple, efficient, and guaranteed to decrease the objective at each step (non-convex, so only local convergence). The benefit: factorization uses \(k(m + n)\) parameters vs. \(mn\) for \(X\), yielding massive compression (e.g., for \(m = n = 10000, k = 20\): \(400\) parameters vs. \(100\) million). This compression is a form of implicit regularization: by restricting to rank-\(k\) space, the model cannot overfit to noise, giving generalization benefit.

ML Interpretation: Low-rank factorization is used in collaborative filtering (user-item matrices), matrix completion (recommendation, imputation), and dimensionality reduction. The factorization factors \(U, V\) are interpretable: \(U\) encodes “user latent profiles,” \(V\) encodes “item latent profiles,” and \(U_i \cdot V_j\) predicts the interaction (rating) between user \(i\) and item \(j\). The rank \(k\) is a hyperparameter controlling model complexity: large \(k\) fits training data better (lower bias) but overfits (higher variance), small \(k\) underfits. Cross-validation selects optimal \(k\). Implicit regularization from rank constraint is often more effective than explicit regularization (L2, dropout) in low-rank factorization, partly because the rank constraint is structurally aligned with the problem.

Failure Modes: (1) Convergence to local minima: ALS is non-convex, so it can converge to suboptimal local minima (especially with poor initialization). Multiple random initializations or initialization from SVD (top-\(k\) singular vectors) improves robustness. (2) Small singular values: if \(X\) has a spectrum with slow decay (no clear rank drop), choosing \(k\) is hard, and low-rank approximation incurs large error. (3) Scaling issues: if columns of \(X\) have vastly different scales, factorization may be biased toward fitting large-scale columns. Normalize before factorization.

Common Mistakes: (1) Treating \(k\) (factorization rank) as fixed without cross-validation; the optimal \(k\) depends on the data and task. (2) Not handling missing data correctly; ALS can be extended to missing data (imputation-style), but naive ALS on a matrix with explicit zeros treats them as signal, not missing. Use specialized matrix completion algorithms. (3) Confusing low-rank factorization (unsupervised compression) with supervised dimensionality reduction (e.g., PCA with class labels); they optimize different objectives. (4) Assuming rank-\(k\) factorization preserves all information; for \(k < \text{rank}(X)\), there’s always approximation error—choose \(k\) based on tolerance for error.

Chapter Connections: Connects to Definition 2.8 (Rank): by factorizing as \(UV^\top\), the product has rank \(\leq k\), and the algorithm implicitly selects the best rank-\(k\) approximation. Relates to Eckart-Young Theorem (Theorem 3.2): truncated SVD is optimal rank-\(k\) approximation in Frobenius norm; ALS finds a local optimum (often close to truncated SVD for well-separated singular values). Uses rank-nullity implicitly: restricting to rank-\(k\) space forces \(k\)-dimensional representation. Connects to Example 12 (truncated SVD) and Example 13 (parameter reduction via dimension constraints).


C.19. Isomorphism Between Polynomial Spaces and Coordinate Spaces

Code:

class PolynomialBasis:
    """Represent polynomials via monomial basis {1, x, x^2, ...}."""
    def __init__(self, coeffs):
        """coeffs: array of coefficients in monomial basis [c0, c1, c2, ...]."""
        self.coeffs = np.array(coeffs, dtype=float)
    
    def degree(self):
        """Degree of polynomial."""
        return len(self.coeffs) - 1
    
    def eval(self, x):
        """Evaluate polynomial at x."""
        return np.polyval(self.coeffs[::-1], x)
    
    def __add__(self, other):
        """Addition of polynomials."""
        max_len = max(len(self.coeffs), len(other.coeffs))
        c1 = np.pad(self.coeffs, (0, max_len - len(self.coeffs)))
        c2 = np.pad(other.coeffs, (0, max_len - len(other.coeffs)))
        return PolynomialBasis(c1 + c2)
    
    def __mul__(self, scalar):
        """Scalar multiplication."""
        return PolynomialBasis(self.coeffs * scalar)
    
    def __repr__(self):
        terms = []
        for i, c in enumerate(self.coeffs):
            if np.abs(c) > 1e-10:
                if i == 0:
                    terms.append(f"{c:.3f}")
                elif i == 1:
                    terms.append(f"{c:.3f}x")
                else:
                    terms.append(f"{c:.3f}x^{i}")
        return " + ".join(terms) if terms else "0"

# Example: isomorphism between P_2 (polynomials of degree ≤ 2) and R^3
print("Polynomial Space P_2 isomorphic to ℝ^3")
print("Basis of P_2: {1, x, x^2}")
print()

# Polynomial in P_2
p1 = PolynomialBasis([1, 2, 0])  # p1(x) = 1 + 2x
p2 = PolynomialBasis([0, 1, 3])  # p2(x) = x + 3x^2

print(f"p1(x) = {p1}, coordinates: [1, 2, 0]")
print(f"p2(x) = {p2}, coordinates: [0, 1, 3]")

# Addition
p3 = p1 + p2
print(f"p1 + p2 = {p3}, coordinates: {p3.coeffs}")

# Scalar multiplication
p4 = p1 * 2
print(f"2*p1 = {p4}, coordinates: {p4.coeffs}")

# Evaluation (isomorphism is structure-preserving)
x_test = 2.0
print(f"\nEvaluation at x = {x_test}:")
print(f"p1({x_test}) = {p1.eval(x_test)}, via coordinates: [1,2,0]·[1,{x_test},{x_test**2}] = {np.dot(p1.coeffs, [1, x_test, x_test**2])}")
print(f"p2({x_test}) = {p2.eval(x_test)}")
print(f"(p1+p2)({x_test}) = {p3.eval(x_test)}, equals p1+p2: {np.isclose(p3.eval(x_test), p1.eval(x_test) + p2.eval(x_test))}")

Expected Output: Polynomials + and × match coordinate operations; isomorphism is structure-preserving at all operations + evaluation.

Numerical / Shape Notes: \(P_2 = \{a_0 + a_1 x + a_2 x^2 : a_i \in \mathbb{R}\} \cong \mathbb{R}^3\) is realized via coordinate map \(p(x) \mapsto [a_0, a_1, a_2]^T\).

Explanation: Two vector spaces \(V\) and \(W\) are isomorphic if there exists a bijective linear map \(\phi: V \to W\) preserving addition and scaling. The polynomial space \(\mathcal{P}_k = \{ a_0 + a_1 x + \cdots + a_k x^k : a_i \in \mathbb{R} \}\) has dimension \(k+1\) (basis: \(\{1, x, x^2, \dots, x^k\}\)). The coordinate map \(\phi(p) = [a_0, a_1, \dots, a_k]^\top\) (extraction of coefficients) is a linear isomorphism from \(\mathcal{P}_k\) to \(\mathbb{R}^{k+1}\). This means polynomial operations (addition, scaling) correspond exactly to coordinate vector operations (componentwise). For example, \((p + q)(x) = \sum_i (a_i + b_i) x^i\) corresponds to \([a_i + b_i]_{i=0}^k\), the coordinate sum. Computationally, this justifies representing polynomials as coefficient vectors and using linear algebra routines (matrix operations) to manipulate them.

ML Interpretation: Isomorphisms show that the algebraic structure of polynomial spaces is the same as Euclidean spaces \(\mathbb{R}^n\). This implies that linear algebra tools (eigendecomposition, least-squares, basis change) can be applied to polynomial problems. For example, finding the best polynomial fit of degree \(k\) to data (minimizing \(\sum (y_i - p(x_i))^2\)) is a linear regression problem in the space \(\mathcal{P}_k\), solvable via normal equations or QR decomposition. Polynomial kernels in machine learning (used in SVMs, kernel ridge regression) leverage this isomorphism: a polynomial kernel \(K(x, z) = (x^\top z + 1)^d\) implicitly maps data into a high-dimensional polynomial feature space, then applies linear methods in that space. Understanding the isomorphism clarifies why polynomial methods are both interesting (they extend linear models) and tractable computationally (exploiting the polynomial-to-vector isomorphism).

Failure Modes: (1) Overfitting in polynomial regression: although the isomorphism is clean mathematically, increasing polynomial degree \(k\) increases model complexity and sample complexity. Cross-validation is essential to avoid overfitting. (2) Numerical instability in monomial basis: the monomial basis \(\{1, x, x^2, \dots, x^k\}\) is poorly conditioned (large \(k\)); monomials at different scales have vastly different magnitudes, leading to numerical errors. Orthogonal bases (Chebyshev, Legendre) are more stable for high \(k\). (3) Extrapolation outside training data: polynomial models fit well on training data but can oscillate wildly outside the data range (Runge phenomenon). Regularization or data normalization controls this.

Common Mistakes: (1) Assuming all operations in coordinate space are valid in polynomial space; the isomorphism is structure-preserving, but not all coordinate operations have meaning in polynomial space (e.g., element-wise max of coefficient vectors doesn’t correspond to a polynomial operation). (2) Confusing the isomorphism with equality; \(\mathcal{P}_k\) and \(\mathbb{R}^{k+1}\) are different sets, but isomorphic (there’s a translation, the coordinate map). (3) Using naive monomial basis for polynomial regression; use orthogonal polynomial bases (Hermite, Legendre) for stability. (4) Not validating on test data when using high-degree polynomials; it’s easy to overfit, and the fit may look perfect on training data but fail on unseen data.

Chapter Connections: Illustrates Theorem 2.6 (Coordinate Isomorphism): the coordinate map \([\cdot]_\mathcal{B}: \mathcal{P}_k \to \mathbb{R}^{k+1}\) is a linear isomorphism, making \(\mathcal{P}_k\) and \(\mathbb{R}^{k+1}\) indistinguishable from an algebraic perspective. Uses Definition 2.1 (Dimension): \(\dim(\mathcal{P}_k) = k+1\), the size of the monomial basis. Shows that basis (Definition 2.2) applies to polynomial spaces, and choosing different bases (monomial, Chebyshev, Fourier) corresponds to different coordinate systems. Relates to Example 6 (basis choices changing computation convenience).


C.20. Dimension and Generalization in Kernel Methods

Code:

def kernel_gram_matrix(X, kernel='rbf', param=0.1):
    """
    Compute Gram matrix for various kernels.
    Kernel methods implicitly map data into high-(or infinite-) dimensional space.
    """
    n = X.shape[0]
    if kernel == 'linear':
        K = X @ X.T
    elif kernel == 'poly':
        # Polynomial kernel: (x·z + 1)^d
        degree = param
        K = (X @ X.T + 1) ** degree
    elif kernel == 'rbf':
        # RBF (Gaussian) kernel: exp(-gamma ||x-z||^2)
        gamma = param
        sq_dists = np.sum(X**2, axis=1, keepdims=True) - 2*X @ X.T + np.sum(X**2, axis=1)
        K = np.exp(-gamma * sq_dists)
    return K

# Example: compare Gram matrices for different kernels
np.random.seed(42)
X = np.random.randn(50, 10)

kernels = [('linear', None), ('poly', 2), ('poly', 3), ('rbf', 0.1), ('rbf', 1.0)]

for kernel_name, param in kernels:
    K = kernel_gram_matrix(X, kernel=kernel_name, param=param)
    rank_K = np.linalg.matrix_rank(K)
    evals = np.linalg.eigvalsh(K)
    evals = np.sort(evals)[::-1]
    fro_norm = np.linalg.norm(K, 'fro')
    
    print(f"Kernel: {kernel_name:10s} (param={param or 'N/A'})")
    print(f"  Gram rank: {rank_K}, Frobenius norm: {fro_norm:.2f}")
    print(f"  Top 5 eigenvalues: {evals[:5].round(3)}")
    print()

# Intuition: Gram matrix rank indicates effective dimensionality
# Linear kernel in original space: rank ≤ 10 (input dim)
# RBF kernel: implicit infinite-dimensional space, Gram rank = 50 (n_samples, generically full rank)

Expected Output: Linear kernel has rank ≤ 10 (input dimension); polynomial/RBF kernels have higher ranks (feature space dimension higher than input).

Numerical / Shape Notes: Gram matrix \(K \in \mathbb{R}^{50 \times 50}\) encodes pairwise similarities. Rank indicates effective dimensionality of implicit feature space.

Explanation: Kernel methods replace explicit feature vectors \(\phi(\mathbf{x})\) with a kernel function \(K(\mathbf{x}, \mathbf{z}) = \langle \phi(\mathbf{x}), \phi(\mathbf{z}) \rangle\). The Gram matrix \(K_{ij} = K(\mathbf{x}_i, \mathbf{x}_j)\) encodes all pairwise inner products in the implicit feature space. The rank of the Gram matrix equals the dimension of the span of \(\{\phi(\mathbf{x}_i)\}_{i=1}^n\) in the feature space. Linear kernel \(K(\mathbf{x}, \mathbf{z}) = \mathbf{x}^\top \mathbf{z}\) has \(\text{rank}(K) = \min(n, d)\) (at most the input dimension), reflecting that the implicit feature space is just the input space. Polynomial kernel \((1 + \mathbf{x}^\top \mathbf{z})^p\) corresponds to polynomial features of degree \(p\), a higher-dimensional space (dimension \(O(d^p)\)). RBF kernel \(\exp(-\gamma \| \mathbf{x} - \mathbf{z} \|^2)\) implicitly maps to an infinite-dimensional space, so the Gram matrix typically has full rank (all \(n\) eigenvalues are nonzero for generic data). The effective dimensionality of the learned model is bounded by the number of support vectors (samples with nonzero weights), not the feature space dimension, enabling generalization despite high implicit dimensionality.

ML Interpretation: Kernel methods decouple the input dimension from the feature space dimension: high-dimensional feature spaces (via kernels) can be tractable computationally because algorithms use only pairwise similarities \(K(\mathbf{x}_i, \mathbf{x}_j)\), not explicit features. This enables nonlinear classification/regression (via nonlinear kernels) without creating explicit nonlinear features. The Gram matrix rank indicates effective dimension: if the rank is low, the kernel has mapped data into a low-dimensional space, limiting model complexity (good for generalization). If the rank is high (e.g., full rank for RBF kernels), the implicit space is high-dimensional, increasing capacity and risk of overfitting. Regularization (SVM margin, kernel ridge regression penalty) controls generalization despite high implicit dimension. The choice of kernel is crucial: different kernels encode different notions of similarity, and the “right” kernel depends on the data (domain knowledge or cross-validation). Gram matrix analysis (rank, eigenvalue spectrum) is a diagnostic tool for understanding kernel suitability and model capacity.

Failure Modes: (1) Gram matrix is expensive: computing a full Gram matrix takes \(O(n^2)\) time and \(O(n^2)\) space. For large datasets, use approximate methods (Nyström approximation, random features, Fourier approximations) or online kernel methods. (2) Kernel parameter tuning is critical: different kernel parameters (polynomial degree, RBF bandwidth \(\gamma\)) yield vastly different Gram matrices and model behavior. Cross-validation is essential; default parameters rarely work well. (3) Gram matrix conditioning: kernel Gram matrices can be ill-conditioned (eigenvalues vary widely), causing numerical instability in solvers. Regularization (adding \(\lambda I\) to \(K\)) improves conditioning.

Common Mistakes: (1) Ignoring Gram matrix rank: assuming the feature space dimension is “large,” without checking actual rank of Gram matrix. For some kernels on specific data, the rank may be surprisingly low, limiting capacity. (2) Not validating kernel choice: using a standard kernel (e.g., RBF) without checking if it’s appropriate for the problem. Domain-specific kernels (e.g., string kernels for text, graph kernels for structured data) often outperform generic kernels. (3) Confusing kernel parameter tuning: kernel parameters (bandwidth, degree) control the effective dimensionality; there’s no universal “best” value—must tune per problem. (4) Assuming small Gram matrix rank means underfitting: low rank limits capacity, but regularization can still prevent overfitting; the trade-off (bias-variance) must be managed via validation.

Chapter Connections: Uses Definition 2.8 (Rank) and Gram matrix rank = rank of data in implicit feature space: \(\text{rank}(K) = \dim(\text{span}\{\phi(\mathbf{x}_i)\}_{i=1}^n)\). Connects to Theorem 3.2 (SVD): spectral decomposition of Gram matrix (eigendecomposition) reveals the principal directions of variation in the implicit feature space (kernel PCA). Related to Example 12 (dimensionality reduction via truncated SVD, analogous to keeping top \(k\) eigenvalues of Gram matrix). Fundamental to Example 14 (autoencoders learning implicit geometry) and the tension between implicit dimension (feature space) and effective dimension (support vectors, regularization).


End of C Solutions


APPENDICES

Appendix A: Notation Summary

Vector and Matrix Conventions

Notation Meaning Example
\(\mathbf{v}, \mathbf{x}, \mathbf{w}\) Column vectors in \(\mathbb{R}^n\) \(\mathbf{v} = (v_1, v_2, \ldots, v_n)^\top\)
\(\mathbf{v}^\top, \mathbf{x}^\top\) Transpose (row vector) If \(\mathbf{v} \in \mathbb{R}^n\), then \(\mathbf{v}^\top \in \mathbb{R}^{1 \times n}\)
\(A, B, W\) Matrices \(A \in \mathbb{R}^{m \times n}\) (m rows, n columns)
\(A_{ij}, A(i,j)\) Element in row \(i\), column \(j\) $A_{12} = $ entry in position (1,2)
\(A(:, j)\) \(j\)-th column of \(A\) Column vector \(\mathbf{a}_j \in \mathbb{R}^m\)
\(A(i, :)\) \(i\)-th row of \(A\) Row vector (or column, depending on transpose)
\(A_{1:k, 1:j}\) Submatrix from rows 1 to k, columns 1 to j Extracting upper-left block
\(I, I_n\) Identity matrix (n×n) \((I)_{ij} = \delta_{ij}\) (Kronecker delta)
\(\mathbf{0}\) Zero vector or matrix All entries are 0
\(\mathbf{1}\) All-ones vector \((1, 1, \ldots, 1)^\top\)
\(A^{-1}\) Matrix inverse \(AA^{-1} = I\) (only square, full-rank)
\(A^\dagger\) (or \(A^+\)) Pseudoinverse \(A = U\Sigma V^\top \Rightarrow A^\dagger = V\Sigma^{-1} U^\top\)
\(A^\top\) Transpose \((A^\top)_{ij} = A_{ji}\)
\(A^{1/2}, A^{-1/2}\) Matrix square root/inverse root Via eigendecomposition: \(A = U\Lambda U^\top \Rightarrow A^{1/2} = U\Lambda^{1/2}U^\top\)

Spaces, Subspaces, and Subsets

Notation Meaning Context
\(\mathbb{R}^n\) \(n\)-dimensional Euclidean space Vector space over reals
\(\mathbb{C}^n\) Complex vectors Extensions of real linear algebra
\(V, U, W\) Vector spaces or subspaces Abstract or concrete
\(\text{span}(S)\) Span of set \(S\) Linear combinations of elements in \(S\)
\(\text{Col}(A)\) Column space of \(A\) \(\text{span}(A(:,1), \ldots, A(:,n))\)
\(\text{Row}(A)\) Row space of \(A\) Span of rows; equals \(\text{Col}(A^\top)\)
\(\text{Null}(A)\) or \(\ker(A)\) Null space (kernel) of \(A\) \(\{\mathbf{x} : A\mathbf{x} = \mathbf{0}\}\)
\(\mathcal{B} = \{\mathbf{b}_1, \ldots, \mathbf{b}_n\}\) Basis for space Minimal spanning set
\(\dim(V)\) Dimension of \(V\) Size of any basis
\(V \oplus W\) Direct sum of subspaces \(V\) and \(W\) orthogonal, \(V \cap W = \{\mathbf{0}\}\)

Linear Transformations and Maps

Notation Meaning Formula
\(T: V \to W\) Linear map from \(V\) to \(W\) Preserves addition & scaling: \(T(\alpha\mathbf{v} + \beta\mathbf{w}) = \alpha T(\mathbf{v}) + \beta T(\mathbf{w})\)
\([T]_{\mathcal{B}, \mathcal{C}}\) Matrix of \(T\) w.r.t. bases \(\mathcal{B}, \mathcal{C}\) Column \(j\) is coordinates of \(T(\mathbf{b}_j)\) in \(\mathcal{C}\)
\(\text{rank}(T), \text{rank}(A)\) Rank (dimension of image) \(\text{rank}(A) = \dim(\text{Col}(A)) = \dim(\text{Row}(A))\)
\(\text{nullity}(T)\) Nullity (dimension of kernel) \(\text{nullity}(T) = \dim(\ker(T))\)
\(P_{\mathcal{B} \to \mathcal{B}'}\) Change-of-basis matrix \([\mathbf{v}]_{\mathcal{B}'} = P_{\mathcal{B} \to \mathcal{B}'} [\mathbf{v}]_\mathcal{B}\)

Norms and Inner Products

Notation Meaning Formula
\(\langle \mathbf{u}, \mathbf{v} \rangle\) Inner product \(\mathbf{u}^\top \mathbf{v} = \sum_i u_i v_i\) (standard Euclidean)
\(\|\mathbf{v}\|, \|\mathbf{v}\|_2\) Euclidean norm \(\sqrt{\mathbf{v}^\top \mathbf{v}} = \sqrt{\sum_i v_i^2}\)
\(\|\mathbf{v}\|_1\) L1 norm (Manhattan) \(\sum_i \|v_i\|\)
\(\|\mathbf{v}\|_\infty\) L∞ norm (max) \(\max_i \|v_i\|\)
\(\|A\|_F\) Frobenius norm \(\sqrt{\sum_{ij} A_{ij}^2} = \sqrt{\text{trace}(A^\top A)}\)
\(\|A\|_2\) Spectral norm \(\max_{\|\mathbf{x}\|=1} \|A\mathbf{x}\|\) (largest singular value)
\(\|\mathbf{u} - \mathbf{v}\|\) Euclidean distance Distance between two vectors

Eigenvalues, Eigenvectors, and Decompositions

Notation Meaning Definition
\(\lambda\) Eigenvalue Scalar \(\lambda\) s.t. \(A\mathbf{v} = \lambda \mathbf{v}\) for non-zero \(\mathbf{v}\)
\(\mathbf{v}\) Eigenvector Non-zero \(\mathbf{v}\) s.t. \(A\mathbf{v} = \lambda \mathbf{v}\)
\(\Lambda, \Sigma\) Diagonal matrix of eigenvalues or singular values \(\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)\)
\(U, V\) Orthonormal eigenvector/singular vector matrices Columns form orthonormal basis
\(A = U\Lambda U^\top\) Eigendecomposition (symmetric \(A\)) \(\Lambda\) diagonal, \(U\) orthogonal
\(A = U\Sigma V^\top\) SVD (any \(A\)) \(U \in \mathbb{R}^{m \times r}\), \(V \in \mathbb{R}^{n \times r}\), \(\Sigma\) diagonal
\(A = LL^\top\) Cholesky decomposition (positive-definite \(A\)) \(L\) lower-triangular
\(A = QR\) QR decomposition \(Q\) orthonormal, \(R\) upper-triangular

Multilinear Algebra

Notation Meaning Context
\(\mathbf{X} \in \mathbb{R}^{n \times d}\) Data matrix Rows = features, columns = samples (or vice versa)
\(\Sigma, \Gamma\) Covariance matrix \(\Sigma = \mathbb{E}[(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^\top]\)
\(\nabla f(\mathbf{x})\) Gradient (vector of partial derivatives) \((\nabla f)_i = \frac{\partial f}{\partial x_i}\)
\(H_f(\mathbf{x})\) or \(\nabla^2 f\) Hessian matrix \((H)_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}\)
\(\text{trace}(A)\) Trace (sum of diagonal elements) \(\text{tr}(A) = \sum_i A_{ii}\)
\(\det(A)\) Determinant Product of eigenvalues; \(= 0\) iff singular
\(\text{vec}(A)\) Vectorization (stack columns) Convert matrix to vector

ML-Specific Notation

Notation Meaning Context
\(X, Y\) Feature matrix, label/target matrix \(X \in \mathbb{R}^{n \times d}\) (samples × features)
\(\mathbf{w}, \boldsymbol{\beta}\) Weight or coefficient vector Parameters of linear model
\(\mathbf{y}, \hat{\mathbf{y}}\) True targets, predicted targets \(\mathbf{y} \in \mathbb{R}^n\), \(\hat{\mathbf{y}} = X\mathbf{w}\)
\(L(\mathbf{w})\) or \(\ell(y, \hat{y})\) Loss function, per-sample loss Measures prediction error
\(\lambda\) Regularization parameter Penalty on parameter magnitude (ridge: \(\lambda \|\mathbf{w}\|^2\))
\(\mathcal{H}\) Hypothesis class Set of functions model can represent
\(K(\mathbf{x}, \mathbf{x}')\) Kernel function Defines inner product in implicit feature space

Constants and Abbreviations

Symbol Meaning
\(\epsilon\) Small positive number (threshold, tolerance)
\(\delta_{ij}\) Kronecker delta (\(= 1\) if \(i=j\), else \(0\))
\(\propto\) Proportional to
\(\sim\) Distributed as (probability)
\(\approx\) Approximately equal
SVD Singular Value Decomposition
PCA Principal Component Analysis
QR QR decomposition (orthonormal-triangular)
ML Machine Learning
RREF Reduced Row Echelon Form

Appendix B: Supplementary Proofs

B.1 Proof of Rank-Nullity Theorem (Detailed)

Theorem (Rank-Nullity): Let \(T: V \to W\) be a linear map between finite-dimensional vector spaces. Then: \[\dim(V) = \dim(\ker T) + \dim(\text{im} T) = \text{nullity}(T) + \text{rank}(T)\]

Proof:

Step 1: Basis for kernel.
Let \(k = \dim(\ker T) \geq 0\). If \(k > 0\), choose a basis \(\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}\) for \(\ker T\). If \(k = 0\) (trivial kernel), the set is empty.

Step 2: Extend to basis of \(V\).
By the Basis Extension Theorem, extend \(\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}\) to a basis \(\{\mathbf{v}_1, \ldots, \mathbf{v}_k, \mathbf{u}_1, \ldots, \mathbf{u}_r\}\) of \(V\), where \(r \geq 0\) and \(k + r = \dim(V)\).

Step 3: Show \(\{T(\mathbf{u}_1), \ldots, T(\mathbf{u}_r)\}\) spans \(\text{im}(T)\).
Any \(\mathbf{w} \in \text{im}(T)\) satisfies \(\mathbf{w} = T(\mathbf{v})\) for some \(\mathbf{v} \in V\). Write: \[\mathbf{v} = \sum_{i=1}^k \alpha_i \mathbf{v}_i + \sum_{j=1}^r \beta_j \mathbf{u}_j\] Then: \[\mathbf{w} = T(\mathbf{v}) = \sum_{i=1}^k \alpha_i T(\mathbf{v}_i) + \sum_{j=1}^r \beta_j T(\mathbf{u}_j) = \sum_{j=1}^r \beta_j T(\mathbf{u}_j)\] since \(\mathbf{v}_i \in \ker T\) implies \(T(\mathbf{v}_i) = \mathbf{0}\). Thus \(\{T(\mathbf{u}_1), \ldots, T(\mathbf{u}_r)\}\) spans \(\text{im}(T)\).

Step 4: Show \(\{T(\mathbf{u}_1), \ldots, T(\mathbf{u}_r)\}\) is linearly independent.
Suppose \(\sum_{j=1}^r \gamma_j T(\mathbf{u}_j) = \mathbf{0}\). Then: \[T\left(\sum_{j=1}^r \gamma_j \mathbf{u}_j\right) = \mathbf{0}\] So \(\sum_{j=1}^r \gamma_j \mathbf{u}_j \in \ker T\). Since \(\ker T\) has basis \(\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}\): \[\sum_{j=1}^r \gamma_j \mathbf{u}_j = \sum_{i=1}^k \alpha_i \mathbf{v}_i\] Rearranging: \[\sum_{i=1}^k \alpha_i \mathbf{v}_i + \sum_{j=1}^r (-\gamma_j) \mathbf{u}_j = \mathbf{0}\] By linear independence of the full basis \(\{\mathbf{v}_1, \ldots, \mathbf{v}_k, \mathbf{u}_1, \ldots, \mathbf{u}_r\}\), all coefficients are zero, including \(\gamma_j = 0\) for all \(j\). Thus \(\{T(\mathbf{u}_1), \ldots, T(\mathbf{u}_r)\}\) is linearly independent.

Step 5: Conclude.
We have \(\dim(\text{im} T) = r\) and \(\dim(\ker T) = k\), so: \[\dim(\ker T) + \dim(\text{im} T) = k + r = \dim(V)\] \(\square\)


B.2 Theorem: Dimension Equivalence for Finite-Dimensional Spaces

Theorem: Two finite-dimensional vector spaces \(V\) and \(W\) (over the same field) are isomorphic iff \(\dim(V) = \dim(W)\).

Proof:

(\(\Rightarrow\)) If \(V \cong W\), then \(\dim(V) = \dim(W)\).
Suppose \(\phi: V \to W\) is a linear isomorphism (bijective linear map). Let \(\{\mathbf{v}_1, \ldots, \mathbf{v}_n\}\) be a basis for \(V\). We claim \(\{\phi(\mathbf{v}_1), \ldots, \phi(\mathbf{v}_n)\}\) is a basis for \(W\).

  1. Spanning: Any \(\mathbf{w} \in W\) satisfies \(\mathbf{w} = \phi(\mathbf{v})\) for some \(\mathbf{v} \in V\) (surjectivity of \(\phi\)). Writing \(\mathbf{v} = \sum_i c_i \mathbf{v}_i\), we get \(\mathbf{w} = \sum_i c_i \phi(\mathbf{v}_i)\) (linearity of \(\phi\)).

  2. Independence: If \(\sum_i \gamma_i \phi(\mathbf{v}_i) = \mathbf{0}\), then \(\phi(\sum_i \gamma_i \mathbf{v}_i) = \mathbf{0}\). Since \(\ker(\phi) = \{\mathbf{0}\}\) (injectivity), \(\sum_i \gamma_i \mathbf{v}_i = \mathbf{0}\). By independence of \(\{\mathbf{v}_i\}\), all \(\gamma_i = 0\).

Thus \(\{\phi(\mathbf{v}_1), \ldots, \phi(\mathbf{v}_n)\}\) is a basis for \(W\), so \(\dim(W) = n = \dim(V)\).

(\(\Leftarrow\)) If \(\dim(V) = \dim(W) = n\), then \(V \cong W\).
Choose bases \(\{\mathbf{v}_1, \ldots, \mathbf{v}_n\}\) for \(V\) and \(\{\mathbf{w}_1, \ldots, \mathbf{w}_n\}\) for \(W\). Define \(\phi: V \to W\) by: \[\phi\left(\sum_{i=1}^n c_i \mathbf{v}_i\right) = \sum_{i=1}^n c_i \mathbf{w}_i\]

This is well-defined (coordinates unique w.r.t. basis), linear (respects coefficients), injective (distinct coordinate vectors map to distinct elements in \(W\)), and surjective (every element of \(W\) is the image of some coordinate vector). Thus \(\phi\) is an isomorphism. \(\square\)


B.3 Eckart-Young Theorem (Low-Rank Approximation Optimality)

Theorem: Let \(A \in \mathbb{R}^{m \times n}\) with SVD \(A = U\Sigma V^\top\). Among all rank-\(k\) matrices \(B\), the rank-\(k\) truncated SVD \(A_k = U_k \Sigma_k V_k^\top\) minimizes Frobenius norm error: \[A_k = \arg\min_{\text{rank}(B) \leq k} \|A - B\|_F\] with error: \[\|A - A_k\|_F = \sqrt{\sum_{j=k+1}^{\min(m,n)} \sigma_j^2}\]

Proof:

Orthogonal invariance of Frobenius norm.
For any matrices \(U, V\) orthogonal and matrix \(M\), \(\|UMV^\top\|_F = \|M\|_F\). This follows since \(\|M\|_F^2 = \text{tr}(M^\top M)\) and trace is invariant under similarity: \(\text{tr}(M^\top M) = \text{tr}(V U^\top M^\top U V^\top) = \text{tr}(V^\top V U^\top M^\top U V^\top) = \|UMV^\top\|_F^2\).

Reformulation.
For any rank-\(k\) matrix \(B\), write \(B = UBV^\top + U B^\perp V^{\perp\top}\) where \(B = U^\top B V\) etc. (projections). Then: \[\|A - B\|_F = \|U(\Sigma - \bar{B})V^\top + \text{orthogonal}\|_F = \|\Sigma - \bar{B}\|_F\] where \(\bar{B} = U^\top B V \in \mathbb{R}^{m \times n}\).

Minimization.
To minimize \(\|\Sigma - \bar{B}\|_F^2\) subject to \(\text{rank}(\bar{B}) \leq k\), note: \[\|\Sigma - \bar{B}\|_F^2 = \sum_{i=1}^m \sum_{j=1}^n (\sigma_{ij} - \bar{b}_{ij})^2\] where \(\sigma_{ij}\) are entries of \(\Sigma\) (diagonal: \(\sigma_{ij} = \sigma_i \delta_{ij}\)). For each diagonal position \(i\), the best choice is \(\bar{b}_{ii} = \min(\sigma_i, \text{constraint})\). With rank \(\leq k\) constraint, we set \(\bar{b}_{ii} = \sigma_i\) for \(i \leq k\) and \(\bar{b}_{ii} = 0\) for \(i > k\) (no off-diagonal entries can reduce error optimally). Thus \(\bar{B} = \Sigma_k\) (truncated diagonal). Therefore: \[B_* = U \Sigma_k V^\top = A_k\]

Error formula.
\[\|A - A_k\|_F^2 = \|\Sigma - \Sigma_k\|_F^2 = \sum_{j=k+1}^{\min(m,n)} \sigma_j^2\] \(\square\)


B.4 Rank-Nullity Bound for Matrix Products

Theorem: Let \(A \in \mathbb{R}^{m \times n}\) and \(B \in \mathbb{R}^{n \times p}\). Then: \[\text{rank}(AB) \leq \min(\text{rank}(A), \text{rank}(B))\] with equality iff \(\text{rank}(A) + \text{rank}(B) \geq n\).

Proof:

Upper bound.
The image of \(AB\) is \(\text{im}(AB) = \{AB\mathbf{x} : \mathbf{x} \in \mathbb{R}^p\} = \{A\mathbf{y} : \mathbf{y} \in \text{im}(B)\} \subseteq \{A\mathbf{z} : \mathbf{z} \in \mathbb{R}^n\} = \text{im}(A)\). Thus: \[\text{rank}(AB) = \dim(\text{im}(AB)) \leq \dim(\text{im}(A)) = \text{rank}(A)\] Similarly, \(\text{rank}(AB) \leq \text{rank}(B)\) (images are in column space of \(A\), which is at most \(\text{rank}(B)\)-dimensional before application of \(A\)). Thus: \[\text{rank}(AB) \leq \min(\text{rank}(A), \text{rank}(B))\]

Equality condition.
Write \(A = U\Sigma V^\top\) (SVD). Then \(\text{rank}(AB) = \text{rank}(\Sigma V^\top B)\). For equality, we need \(\dim(\ker(V^\top B)) = n - \text{rank}(B)\). If \(\text{rank}(A) + \text{rank}(B) \geq n\), the sum of kernel dimension of \(B\) and column dimension of \(A\) satisfies the condition; otherwise, rank loss occurs. \(\square\)


Appendix C: ML Implementation Notes

C.1 Computing Rank Numerically: Algorithms and Pitfalls

Goal: Given a matrix \(A \in \mathbb{R}^{m \times n}\), compute its rank (dimension of column space).

Method 1: SVD-Based (Recommended)

import numpy as np

def rank_svd(A, tol=1e-10):
    """
    Compute rank via SVD.
    
    Args:
        A: m x n matrix
        tol: tolerance for singular values (default: machine epsilon * m * max(m,n) * max(sigma))
    
    Returns:
        rank: number of singular values > tol
    """
    U, s, Vt = np.linalg.svd(A, full_matrices=False)
    # Automatic tolerance (following MATLAB convention)
    if tol is None:
        tol = np.max(A.shape) * np.finfo(s.dtype).eps * s[0]
    rank = np.sum(s > tol)
    return rank

Advantages: Numerically stable; gives singular values (condition number); works for rectangular matrices.

Pitfall: Tolerance selection is critical. Too large → underestimates rank. Too small → numerical noise treated as signal.


Method 2: QR Decomposition (Fast, rank-revealing QR)

def rank_qr(A, tol=1e-10):
    """
    Compute rank via QR decomposition with column pivoting.
    
    Args:
        A: m x n matrix
        tol: tolerance for diagonal elements of R
    
    Returns:
        rank: number of diagonal elements > tol
    """
    Q, R, P = scipy.linalg.qr(A, mode='economic', pivoting=True)
    # R is upper-triangular; rank is number of main-diagonal elements > tol
    rank = np.sum(np.abs(np.diag(R)) > tol)
    return rank

Advantages: Faster than full SVD for tall matrices (\(m \gg n\)); reveals rank by inspection of \(R\)’s diagonal.

Pitfall: Assumes column pivoting is enabled; without it, may misidentify rank on nearly-singular matrices.


Method 3: Direct Gaussian Elimination (Educational, less stable)

def rank_row_reduce(A, tol=1e-10):
    """
    Compute rank by row reduction to echelon form.
    Minimal stability, but illustrative.
    """
    R = A.copy()
    m, n = R.shape
    rank = 0
    for col in range(n):
        # Find pivot
        pivot_row = np.argmax(np.abs(R[rank:, col])) + rank
        if np.abs(R[pivot_row, col]) < tol:
            continue
        # Swap rows
        R[[rank, pivot_row]] = R[[pivot_row, rank]]
        # Eliminate below
        for row in range(rank + 1, m):
            factor = R[row, col] / R[rank, col]
            R[row, col:] -= factor * R[rank, col:]
        rank += 1
    return rank

Pitfall: Prone to rounding errors; partial pivoting essential but not foolproof.


C.2 Whitening and Preconditioning for Optimization

Overview: Whitening transforms data to zero mean, identity covariance, and makes loss landscapes isotropic (equal curvature in all directions), accelerating gradient descent.

Implementation:

def whiten_data(X):
    """
    Whiten data X (n_samples x n_features).
    Returns whitened data and whitening transform matrix.
    """
    X_centered = X - np.mean(X, axis=0)
    cov = X_centered.T @ X_centered / (X.shape[0] - 1)
    
    # Method 1: Cholesky decomposition
    try:
        L = np.linalg.cholesky(cov)
        Z = X_centered @ np.linalg.inv(L.T)  # Z @ Z.T / n ≈ I
        return Z, np.linalg.inv(L.T)
    except np.linalg.LinAlgError:
        # cov singular; use SVD instead
        U, s, Vt = np.linalg.svd(cov, full_matrices=False)
        whitening_matrix = (U @ np.diag(1/np.sqrt(s + 1e-10))) @ U.T
        return X_centered @ whitening_matrix, whitening_matrix

Using whitening in gradient descent:

def gradient_descent_whitened(X, y, learning_rate=0.01, max_iter=1000):
    """
    Gradient descent on whitened features.
    """
    X_white, whiten_matrix = whiten_data(X)
    
    # Initialize weights
    w = np.zeros(X_white.shape[1])
    
    for t in range(max_iter):
        # Compute gradient on whitened data
        residuals = X_white @ w - y
        grad = X_white.T @ residuals / len(y)
        w -= learning_rate * grad
    
    return w, whiten_matrix

Effect: Convergence rate often improves dramatically; step size can be increased without divergence.


C.3 PCA Implementation: From First Principles to Scikit-Learn

From Scratch:

def pca_from_scratch(X, n_components=None):
    """
    PCA implementation using eigendecomposition.
    
    Args:
        X: n_samples x n_features data matrix
        n_components: number of components to retain
    
    Returns:
        X_reduced: n_samples x n_components projection
        components: n_components x n_features principal vectors
        explained_variance: variance along each component
    """
    # Center data
    X_centered = X - np.mean(X, axis=0)
    
    # Compute covariance
    cov = X_centered.T @ X_centered / (len(X) - 1)
    
    # Eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(cov)
    
    # Sort descending
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Truncate to n_components
    if n_components is None:
        n_components = len(X)
    eigenvectors = eigenvectors[:, :n_components]
    eigenvalues = eigenvalues[:n_components]
    
    # Project
    X_reduced = X_centered @ eigenvectors
    
    return X_reduced, eigenvectors.T, eigenvalues

Using Scikit-Learn (Recommended):

from sklearn.decomposition import PCA

# Fit PCA
pca = PCA(n_components=0.95)  # Retain 95% of variance
X_reduced = pca.fit_transform(X)

# Inspect
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Cumulative variance: {np.cumsum(pca.explained_variance_ratio_)}")
print(f"Components shape: {pca.components_.shape}")  # (n_components, n_features)

# Reconstruct
X_approx = pca.inverse_transform(X_reduced)
reconstruction_error = np.mean((X - X_approx)**2)

C.4 Ridge Regression and Regularization Trade-offs

Implementation:

def ridge_regression(X, y, lambda_param):
    """
    Ridge regression: min ||X w - y||^2 + lambda ||w||^2
    
    Closed form: w = (X^T X + lambda I)^-1 X^T y
    """
    m, n = X.shape
    w = np.linalg.solve(X.T @ X + lambda_param * np.eye(n), X.T @ y)
    return w

def ridge_regression_cv(X, y, lambdas=None, cv=5):
    """
    Ridge regression with cross-validation to select lambda.
    """
    from sklearn.linear_model import RidgeCV
    
    if lambdas is None:
        lambdas = np.logspace(-4, 4, 50)
    
    ridge_cv = RidgeCV(alphas=lambdas, cv=cv)
    ridge_cv.fit(X, y)
    
    print(f"Optimal lambda: {ridge_cv.alpha_}")
    return ridge_cv

Trade-off Visualization:

def plot_regularization_path(X, y_train, X_test, y_test):
    """
    Plot train/test error vs. regularization parameter.
    """
    lambdas = np.logspace(-4, 4, 50)
    train_errors, test_errors = [], []
    
    for lam in lambdas:
        w = ridge_regression(X, y_train, lam)
        train_errors.append(np.mean((X @ w - y_train)**2))
        test_errors.append(np.mean((X_test @ w - y_test)**2))
    
    plt.figure(figsize=(10, 6))
    plt.plot(lambdas, train_errors, label='Train Error')
    plt.plot(lambdas, test_errors, label='Test Error')
    plt.axvline(lambdas[np.argmin(test_errors)], linestyle='--', label='Optimal λ')
    plt.xscale('log')
    plt.xlabel('Regularization Parameter λ')
    plt.ylabel('MSE')
    plt.legend()
    plt.title('Regularization Trade-off: Bias vs. Variance')
    plt.show()

C.5 Gram Matrix Computation and Kernel Methods

Standard Gram Matrix:

def compute_gram_matrix(X, metric='euclidean'):
    """
    Compute Gram matrix G = X X^T
    
    Args:
        X: n_samples x n_features
        metric: distance metric (euclidean, cosine, etc.)
    
    Returns:
        G: n_samples x n_samples Gram matrix (similarity)
    """
    if metric == 'euclidean':
        G = X @ X.T
    elif metric == 'cosine':
        # Normalize rows
        X_norm = X / np.linalg.norm(X, axis=1, keepdims=True)
        G = X_norm @ X_norm.T
    return G

Kernel Methods:

def kernel_matrix(X1, X2, kernel='rbf', gamma=1.0, degree=2):
    """
    Compute kernel matrix K(X1, X2) for various kernels.
    """
    if kernel == 'rbf':
        # RBF: exp(-gamma * ||x - y||^2)
        sq_distances = np.sum(X1**2, axis=1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, axis=1)
        K = np.exp(-gamma * sq_distances)
    elif kernel == 'polynomial':
        # Polynomial: (X1 @ X2.T + 1)^degree
        K = (X1 @ X2.T + 1)**degree
    elif kernel == 'linear':
        K = X1 @ X2.T
    return K

Kernel Methods in SVM:

from sklearn.svm import SVC

# Train SVM with RBF kernel
svm = SVC(kernel='rbf', gamma='scale', C=1.0)
svm.fit(X_train, y_train)

# Access Gram matrix (for support vectors)
K_sv = kernel_matrix(X_train[svm.support_], X_train[svm.support_], kernel='rbf')
print(f"Gram matrix rank (support vectors): {np.linalg.matrix_rank(K_sv)}")
print(f"Number of support vectors: {len(svm.support_)}")

C.6 Handling Rank-Deficient and Ill-Conditioned Matrices

Condition Number:

def matrix_condition(A):
    """
    Compute condition number (ratio of largest to smallest singular values).
    Large cond => ill-conditioned => sensitive to perturbations.
    """
    U, s, Vt = np.linalg.svd(A)
    cond = s[0] / s[-1]
    return cond

def is_ill_conditioned(A, threshold=1e10):
    """Check if matrix is ill-conditioned."""
    return matrix_condition(A) > threshold

Stable Pseudoinverse:

def stable_pinv(A, tol=1e-10):
    """
    Compute pseudoinverse via SVD, handling small singular values.
    """
    U, s, Vt = np.linalg.svd(A, full_matrices=False)
    # Threshold singular values
    s_inv = np.where(s > tol, 1/s, 0)
    A_pinv = Vt.T @ np.diag(s_inv) @ U.T
    return A_pinv

Robust Least Squares (Underdetermined):

def least_squares_robust(X, y, regularization='ridge', lambda_param=0.1):
    """
    Solve least squares robustly when X is rank-deficient.
    """
    if regularization == 'ridge':
        # Ridge: add lambda * I to Gram matrix
        w = np.linalg.solve(X.T @ X + lambda_param * np.eye(X.shape[1]), X.T @ y)
    elif regularization == 'lasso':
        # LASSO: sparse solution (coordinate descent or proximal methods)
        from sklearn.linear_model import Lasso
        lasso = Lasso(alpha=lambda_param)
        lasso.fit(X, y)
        w = lasso.coef_
    elif regularization == 'svd':
        # Use pseudoinverse (minimum-norm solution)
        w = stable_pinv(X) @ y
    return w

C.7 Autoencoders: Dimension Reduction via Neural Networks

Linear Autoencoder (Equivalent to PCA):

import tensorflow as tf

def linear_autoencoder(input_dim, bottleneck_dim):
    """
    Linear autoencoder: input -> dense(bottleneck_dim) -> dense(input_dim)
    Equivalent to PCA if trained with MSE loss.
    """
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(bottleneck_dim, activation='linear', input_shape=(input_dim,)),
        tf.keras.layers.Dense(input_dim, activation='linear')
    ])
    model.compile(optimizer='adam', loss='mse')
    return model

# Usage
bottleneck_dim = 32
autoencoder = linear_autoencoder(input_dim=784, bottleneck_dim=bottleneck_dim)
autoencoder.fit(X_train, X_train, epochs=50, validation_data=(X_test, X_test))

# Extract encoder (latent codes)
encoder = tf.keras.Sequential(autoencoder.layers[:-1])
X_latent = encoder.predict(X_train)
print(f"Latent dimension: {X_latent.shape[1]} (should be {bottleneck_dim})")

Nonlinear Autoencoder (Manifold Learning):

def nonlinear_autoencoder(input_dim, bottleneck_dim):
    """
    Nonlinear autoencoder with hidden layers.
    Can learn curved manifolds, not just linear subspaces.
    """
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(256, activation='relu', input_shape=(input_dim,)),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(bottleneck_dim, activation='relu'),  # Bottleneck
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(256, activation='relu'),
        tf.keras.layers.Dense(input_dim, activation='sigmoid')  # Reconstruct
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy')
    return model

Selecting Bottleneck Dimension via Cross-Validation:

def autoencoder_dimension_selection(X_train, X_test, dims=range(1, 129, 10)):
    """
    Train autoencoders with various bottleneck dimensions.
    Select via validation error.
    """
    results = {'dim': [], 'train_loss': [], 'test_loss': []}
    
    for dim in dims:
        ae = linear_autoencoder(input_dim=X_train.shape[1], bottleneck_dim=dim)
        history = ae.fit(X_train, X_train, epochs=30, validation_data=(X_test, X_test), verbose=0)
        results['dim'].append(dim)
        results['train_loss'].append(history.history['loss'][-1])
        results['test_loss'].append(history.history['val_loss'][-1])
    
    optimal_dim = results['dim'][np.argmin(results['test_loss'])]
    print(f"Optimal bottleneck dimension: {optimal_dim}")
    return results

C.8 Numerical Stability Checklist

Before applying linear algebra algorithms, check:

Check Command Remediation
Rank np.linalg.matrix_rank(A) If rank $ < (m,n)$, use pseudoinverse or regularization
Condition Number np.linalg.cond(A) If cond \(> 10^{10}\), use SVD, regularize, or precondition
Singular Values U, s, Vt = np.linalg.svd(A) Inspect \(s\) for near-zero values; set threshold for “zero”
Eigenvalues eigs, V = np.linalg.eigh(A) For symmetric \(A\), all real; check if tightly clustered (ill-conditioned)
Determinant det = np.linalg.det(A) If \(\approx 0\), matrix is near-singular; inversion unstable
Orthogonality Q.T @ Q ≈ I After QR or SVD, verify orthonormality if numerical precision matters
Convergence Plot error vs. iteration In iterative solvers (gradient descent, power method), check convergence rate

End of Appendices