Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

The Problem

Given ARm×nA \in \mathbb{R}^{m \times n} with m>nm > n and bRmb \in \mathbb{R}^m, the system Ax=bAx = b typically has no solution.

Why? The vector bb usually doesn’t lie in the column space of AA.

Goal: Find x^\hat{x} that minimizes the residual:

x^=argminxRnAxb2\hat{x} = \arg\min_{x \in \mathbb{R}^n} \|Ax - b\|_2

Linear Regression Example

The most common application: fitting a model to data.

Example: Given NN data points (ti,yi)(t_i, y_i), fit a polynomial p(t)=c0+c1t+c2t2p(t) = c_0 + c_1 t + c_2 t^2:

(1t1t121t2t221tNtN2)(c0c1c2)=(y1y2yN)\begin{pmatrix} 1 & t_1 & t_1^2 \\ 1 & t_2 & t_2^2 \\ \vdots & \vdots & \vdots \\ 1 & t_N & t_N^2 \end{pmatrix} \begin{pmatrix} c_0 \\ c_1 \\ c_2 \end{pmatrix} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{pmatrix}

With N>3N > 3 data points, this system is overdetermined.

Geometric Interpretation

The least squares solution finds the point in R(A)\text{R}(A) closest to bb:

        b
       /|
      / |  residual r = b - Ax̂
     /  |  is perpendicular to R(A)
    Ax̂--+---- R(A) (column space)

Key insight: The residual r=bAx^r = b - A\hat{x} is orthogonal to the column space of AA.

The Normal Equations

From the orthogonality condition rR(A)r \perp \text{R}(A), we can derive:

Derivation: We need (bAx^)Av(b - A\hat{x}) \perp Av for all vRnv \in \mathbb{R}^n.

(Av)T(bAx^)=0v(Av)^T(b - A\hat{x}) = 0 \quad \forall v
vTAT(bAx^)=0vv^T A^T(b - A\hat{x}) = 0 \quad \forall v

This requires AT(bAx^)=0A^T(b - A\hat{x}) = 0, giving ATAx^=ATbA^T A \hat{x} = A^T b.

Why Are They Called “Normal” Equations?

The name comes from the fact that the residual is normal (perpendicular) to the column space—not because they’re “standard” equations.

Properties of ATAA^T A

When AA has full column rank:

PropertyStatement
Symmetric(ATA)T=ATA(A^T A)^T = A^T A
Positive definitexTATAx=Ax22>0x^T A^T A x = |Ax|_2^2 > 0 for x0x \neq 0
InvertibleFollows from positive definiteness
Condition numberκ(ATA)=κ(A)2\kappa(A^T A) = \kappa(A)^2 ⚠️

Minimization Viewpoint

The least squares problem is equivalent to minimizing:

f(x)=Axb22=xTATAx2bTAx+bTbf(x) = \|Ax - b\|_2^2 = x^T A^T A x - 2b^T A x + b^T b

Taking the gradient and setting to zero:

f(x)=2ATAx2ATb=0\nabla f(x) = 2A^T A x - 2A^T b = 0

yields the normal equations.

Observation: This is a quadratic function with positive definite Hessian 2ATA2A^T A, so there’s a unique global minimum.

Multiple Linear Regression

In statistics notation, the least squares problem for regression:

β^=argminβyXβ22\hat{\beta} = \arg\min_{\beta} \|y - X\beta\|_2^2

where:

The columns of XX typically include a column of ones (for the intercept).

The Pseudoinverse

The matrix (ATA)1AT(A^T A)^{-1} A^T is called the Moore-Penrose pseudoinverse A+A^+:

A+=(ATA)1ATA^+ = (A^T A)^{-1} A^T

So x^=A+b\hat{x} = A^+ b.

Properties:

Numerical Considerations

Example: Polynomial Fitting

import numpy as np

# Data points
t = np.array([0, 1, 2, 3, 4])
y = np.array([1.0, 2.1, 3.9, 8.2, 15.8])

# Design matrix for quadratic fit
X = np.column_stack([np.ones_like(t), t, t**2])

# Normal equations (don't do this!)
# beta_bad = np.linalg.solve(X.T @ X, X.T @ y)

# Better: use lstsq which uses QR internally
beta, residuals, rank, s = np.linalg.lstsq(X, y, rcond=None)

Residual Analysis

After finding x^\hat{x}, the residual is:

r=bAx^r = b - A\hat{x}

Properties:

The residual measures how well the model fits the data.

Summary

ConceptFormula
Least squares problemminxAxb2\min_x |Ax - b|_2
Normal equationsATAx^=ATbA^T A \hat{x} = A^T b
Solution (if full rank)x^=(ATA)1ATb\hat{x} = (A^T A)^{-1} A^T b
Residualr=bAx^R(A)r = b - A\hat{x} \perp \text{R}(A)
Condition number issueκ(ATA)=κ(A)2\kappa(A^T A) = \kappa(A)^2