Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Optional

This section is optional reading. It leans more heavily on linear algebra than the rest of the chapter and is not required by anything that follows.

Big Idea

The least squares problem minAxb\min \|A\mathbf{x} - \mathbf{b}\| arises whenever a linear system is not exactly solvable. There are two cases: overdetermined (m>nm > n, too many equations) and underdetermined (m<nm < n, too few equations). Both are understood through the fundamental subspaces of AA.

The Fundamental Subspaces

Given ARm×nA \in \mathbb{R}^{m \times n} with rank rr, the matrix AA maps between two spaces that each split into orthogonal pairs:

Rn=R(AT)r-dimN(A)(nr)-dim,Rm=R(A)r-dimN(AT)(mr)-dim\mathbb{R}^n = \underbrace{\text{R}(A^T)}_{r\text{-dim}} \oplus \underbrace{\text{N}(A)}_{(n-r)\text{-dim}}, \qquad \mathbb{R}^m = \underbrace{\text{R}(A)}_{r\text{-dim}} \oplus \underbrace{\text{N}(A^T)}_{(m-r)\text{-dim}}

The matrix AA maps the row space R(AT)\text{R}(A^T) onto the range R(A)\text{R}(A) (an isomorphism when restricted to these subspaces), and sends the kernel N(A)\text{N}(A) to zero.

Source
<Figure size 1000x550 with 1 Axes>

The Two Types of Least Squares

Overdetermined: m>nm > n (too many equations)

The system Ax=bA\mathbf{x} = \mathbf{b} has no exact solution because b\mathbf{b} generally does not lie in R(A)\text{R}(A). We minimize the residual:

x^=argminxRnAxb2\hat{\mathbf{x}} = \arg\min_{\mathbf{x} \in \mathbb{R}^n} \|A\mathbf{x} - \mathbf{b}\|_2

The solution projects b\mathbf{b} onto R(A)\text{R}(A): decompose b=Ax^+r\mathbf{b} = A\hat{\mathbf{x}} + \mathbf{r} where the residual rN(AT)\mathbf{r} \in \text{N}(A^T) is orthogonal to the range. When AA has full column rank (r=nr = n), the kernel is trivial and x^\hat{\mathbf{x}} is unique.

Underdetermined: m<nm < n (too few equations)

The system Ax=bA\mathbf{x} = \mathbf{b} has infinitely many solutions (when bR(A)\mathbf{b} \in \text{R}(A)). We pick the minimum norm solution:

x^=argminxx2subject toAx=b\hat{\mathbf{x}} = \arg\min_{\mathbf{x}} \|\mathbf{x}\|_2 \quad \text{subject to} \quad A\mathbf{x} = \mathbf{b}

The solution projects onto R(AT)\text{R}(A^T): any solution can be written as x=x^+z\mathbf{x} = \hat{\mathbf{x}} + \mathbf{z} with zN(A)\mathbf{z} \in \text{N}(A). The minimum norm solution is the unique one with no kernel component, i.e., x^R(AT)\hat{\mathbf{x}} \in \text{R}(A^T).

In this course we focus on the overdetermined case (m>nm > n, full column rank).

Geometric Interpretation

The least squares solution projects b\mathbf{b} onto R(A)\text{R}(A). The decomposition Rm=R(A)N(AT)\mathbb{R}^m = \text{R}(A) \oplus \text{N}(A^T) splits b\mathbf{b} into:

b=Ax^R(A)+rN(AT)\mathbf{b} = \underbrace{A\hat{\mathbf{x}}}_{\in \text{R}(A)} + \underbrace{\mathbf{r}}_{\in \text{N}(A^T)}

The residual r=bAx^\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}} lies in the cokernel N(AT)\text{N}(A^T), which is the orthogonal complement of R(A)\text{R}(A) (right panel above). This orthogonality condition rR(A)\mathbf{r} \perp \text{R}(A) characterizes the least squares solution.

The Two Types of Least Squares

Depending on the rank of AA, there are two distinct situations:

Case 1: Full column rank (r=nr = n)

The kernel N(A)={0}\text{N}(A) = \{\mathbf{0}\} is trivial, so the least squares solution x^\hat{\mathbf{x}} is unique. The problem reduces to projecting b\mathbf{b} onto R(A)\text{R}(A) (the cokernel projection). This is the standard overdetermined case.

Case 2: Rank-deficient (r<nr < n)

The kernel N(A)\text{N}(A) is nontrivial: if x^\hat{\mathbf{x}} is a least squares solution, then so is x^+z\hat{\mathbf{x}} + \mathbf{z} for any zN(A)\mathbf{z} \in \text{N}(A). Among all least squares solutions, we pick the minimum norm solution: the unique x^\hat{\mathbf{x}} that lies in R(AT)\text{R}(A^T) (i.e., has no component in N(A)\text{N}(A)). This requires the SVD or pseudoinverse to compute.

In this course we focus on Case 1 (full column rank).

Solving Least Squares via QR

Let A=QRA = QR be the full QR decomposition, where QRm×mQ \in \mathbb{R}^{m \times m} is orthogonal and RRm×nR \in \mathbb{R}^{m \times n}. Partition:

Q=[Q1Q2],R=[R10]Q = \begin{bmatrix} Q_1 & Q_2 \end{bmatrix}, \qquad R = \begin{bmatrix} R_1 \\ \mathbf{0} \end{bmatrix}

where Q1Rm×nQ_1 \in \mathbb{R}^{m \times n} (the first nn columns), Q2Rm×(mn)Q_2 \in \mathbb{R}^{m \times (m-n)}, and R1Rn×nR_1 \in \mathbb{R}^{n \times n} is upper triangular. The reduced (thin) QR factorization is A=Q1R1A = Q_1 R_1.

Theorem 1 (QR Solution to Least Squares)

Let AA be m×nm \times n with m>nm > n and rank(A)=n\text{rank}(A) = n. The least squares solution

x^=argminxRnAxb2\hat{\mathbf{x}} = \arg\min_{\mathbf{x} \in \mathbb{R}^n} \|A\mathbf{x} - \mathbf{b}\|_2

is the unique solution of the upper triangular system

R1x^=Q1Tb.R_1 \hat{\mathbf{x}} = Q_1^T \mathbf{b}.

The residual norm is Ax^b=Q2Tb\|A\hat{\mathbf{x}} - \mathbf{b}\| = \|Q_2^T\mathbf{b}\|.

Proof 1

Since QQ is orthogonal, it preserves norms:

Axb2=Q(RxQTb)2=RxQTb2\|A\mathbf{x} - \mathbf{b}\|^2 = \|Q(R\mathbf{x} - Q^T\mathbf{b})\|^2 = \|R\mathbf{x} - Q^T\mathbf{b}\|^2

Expanding the block structure:

=[R10]x[Q1TbQ2Tb]2=[R1xQ1TbQ2Tb]2= \left\| \begin{bmatrix} R_1 \\ \mathbf{0} \end{bmatrix} \mathbf{x} - \begin{bmatrix} Q_1^T \mathbf{b} \\ Q_2^T \mathbf{b} \end{bmatrix} \right\|^2 = \left\| \begin{bmatrix} R_1\mathbf{x} - Q_1^T\mathbf{b} \\ -Q_2^T\mathbf{b} \end{bmatrix} \right\|^2

By the Pythagorean theorem:

=R1xQ1Tb2+Q2Tb2= \|R_1\mathbf{x} - Q_1^T\mathbf{b}\|^2 + \|Q_2^T\mathbf{b}\|^2

The term Q2Tb2\|Q_2^T\mathbf{b}\|^2 does not depend on x\mathbf{x}, so the minimum occurs when R1x=Q1TbR_1\mathbf{x} = Q_1^T\mathbf{b}, and the minimum residual is Q2Tb\|Q_2^T\mathbf{b}\|.

Perturbation Theory for Least Squares

Before choosing an algorithm, we ask how sensitive the solution x^\hat{\mathbf{x}} is to perturbations in the data AA and b\mathbf{b}. Unlike square systems, the answer depends not only on κ2(A)\kappa_2(A) but also on the size of the residual.

Recall the 2-norm condition number of a tall full-rank matrix:

κ2(A)=σmax(A)σmin(A)=A2A2,\kappa_2(A) = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)} = \|A\|_2 \, \|A^\dagger\|_2,

where A=(ATA)1ATA^\dagger = (A^T A)^{-1} A^T is the pseudoinverse.

Theorem 2 (Least Squares Perturbation Bound)

Let ARm×nA \in \mathbb{R}^{m \times n} have full column rank, let x^\hat{\mathbf{x}} solve minAxb2\min \|A\mathbf{x} - \mathbf{b}\|_2 with residual r=bAx^\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}, and let θ\theta be the angle between b\mathbf{b} and R(A)\text{R}(A), so

sinθ=r2b2.\sin\theta = \frac{\|\mathbf{r}\|_2}{\|\mathbf{b}\|_2}.

Let x^+δx\hat{\mathbf{x}} + \delta\mathbf{x} solve the perturbed problem with data A+δAA + \delta A, b+δb\mathbf{b} + \delta\mathbf{b}. If

ε=max ⁣(δA2A2,δb2b2)\varepsilon = \max\!\left( \frac{\|\delta A\|_2}{\|A\|_2}, \frac{\|\delta\mathbf{b}\|_2}{\|\mathbf{b}\|_2} \right)

is small enough that εκ2(A)<1\varepsilon\, \kappa_2(A) < 1, then to first order

δx2x^2    ε(2κ2(A)cosθ+κ2(A)2tanθ).\frac{\|\delta\mathbf{x}\|_2}{\|\hat{\mathbf{x}}\|_2} \;\lesssim\; \varepsilon \left( \frac{2\,\kappa_2(A)}{\cos\theta} + \kappa_2(A)^2 \tan\theta \right).

Remark 1 (Reading the Bound)

Two regimes deserve attention.

  1. Small residual (sinθ0\sin\theta \approx 0, b\mathbf{b} nearly in R(A)\text{R}(A)). Then tanθ0\tan\theta \approx 0 and the bound collapses to δx/x^2εκ2(A)\|\delta\mathbf{x}\|/\|\hat{\mathbf{x}}\| \lesssim 2\varepsilon\,\kappa_2(A). The least squares problem is as well conditioned as a square linear system.

  2. Large residual (sinθ\sin\theta not small). The κ2(A)2tanθ\kappa_2(A)^2 \tan\theta term dominates. Sensitivity scales with the square of the condition number, even when AA itself is only mildly ill conditioned.

This κ2(A)2\kappa_2(A)^2 term is the central obstruction in least squares.

Remark 2 (Why This Rules Out the Normal Equations)

Forming ATAA^T A produces a square system with condition number

κ2(ATA)=κ2(A)2.\kappa_2(A^T A) = \kappa_2(A)^2.

Solving it by Cholesky in finite precision incurs a relative error of order εmachκ2(A)2\varepsilon_{\text{mach}}\, \kappa_2(A)^2 regardless of the residual. QR instead works directly with AA and inherits only the intrinsic conditioning from Theorem 2. Comparing the two:

  • Small residual (tanθ0\tan\theta \approx 0): the problem is κ2(A)\kappa_2(A) conditioned. QR delivers that. Normal equations still pay κ2(A)2\kappa_2(A)^2, so they are strictly worse, sometimes catastrophically.

  • Large residual: the problem itself is already κ2(A)2\kappa_2(A)^2 conditioned. Normal equations are no worse than the inherent sensitivity; QR offers no accuracy advantage here.

The asymmetry is the point: QR is never worse and is much better whenever the residual is small, which is the typical regime for a well-posed fitting problem.

Algorithm

The QR-based algorithm is:

  1. Factor: compute A=Q1R1A = Q_1 R_1 via Householder

  2. Transform: compute b~=Q1Tb\tilde{\mathbf{b}} = Q_1^T\mathbf{b}

  3. Solve: back substitution on R1x^=b~R_1\hat{\mathbf{x}} = \tilde{\mathbf{b}}

Since we never form ATAA^T A, the effective condition number is κ(A)\kappa(A) rather than κ(A)2\kappa(A)^2 in the small-residual regime.

Remark 3 (Normal Equations)

An alternative derivation uses the orthogonality condition rR(A)\mathbf{r} \perp \text{R}(A) directly to obtain ATAx^=ATbA^TA\hat{\mathbf{x}} = A^T\mathbf{b} (the normal equations). While mathematically equivalent, solving the normal equations is numerically dangerous because κ(ATA)=κ(A)2\kappa(A^TA) = \kappa(A)^2 (exercise).

Linear Regression

The most common application of least squares is fitting a model to data.

Example 1 (Polynomial Fitting)

Given NN data points (ti,yi)(t_i, y_i), fit a polynomial p(t)=c0+c1t+c2t2p(t) = c_0 + c_1 t + c_2 t^2:

(1t1t121t2t221tNtN2)X(c0c1c2)β=(y1y2yN)y\underbrace{\begin{pmatrix} 1 & t_1 & t_1^2 \\ 1 & t_2 & t_2^2 \\ \vdots & \vdots & \vdots \\ 1 & t_N & t_N^2 \end{pmatrix}}_{X} \underbrace{\begin{pmatrix} c_0 \\ c_1 \\ c_2 \end{pmatrix}}_{\boldsymbol{\beta}} = \underbrace{\begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{pmatrix}}_{\mathbf{y}}

With N>3N > 3 data points, this system is overdetermined. Solve via QR: compute X=Q^R^X = \hat{Q}\hat{R}, then R^β^=Q^Ty\hat{R}\hat{\boldsymbol{\beta}} = \hat{Q}^T\mathbf{y}.