Least Squares Approximation#
Find the least squares approximation of the system \(A \boldsymbol{x} \approx \boldsymbol{b}\) by minimizing the distance \(\| A \boldsymbol{x} - \boldsymbol{b}\|\). There are several methods to find the approximation including the normal equations and the QR equations.

Definition#
Let \(A\) be an \(m \times n\) matrix with \(m > n\) and \(\mathrm{rank}(A) = n\). The least squares approximation of the system \(A \boldsymbol{x} \approx \boldsymbol{b}\) is the vector \(\boldsymbol{x}\) which minimizes the distance \(\| A\boldsymbol{x} - \boldsymbol{b} \|\).
Normal Equations#
Let \(U \subseteq \mathbb{R}^n\) be a subspace and let \(\{ \boldsymbol{u}_1, \dots, \boldsymbol{u}_m \}\) be a basis (not necessarily orthogonal). Assemble the matrix \(A\) by setting the columns equal to the basis vectors of \(U\). Then the projection of the vector \(\boldsymbol{x}\) onto \(U\) is equivalent to determine the linaer combination \(A\boldsymbol{y}\) closest to \(\boldsymbol{x}\). From the geometry we expect that the error \(\boldsymbol{e} = \boldsymbol{x} - A\boldsymbol{y}\) be perpendicular to the subspace \(U\). Equivalently perpendicular to all the basis vectors of the subspace \(U\).
Assembling this into matrix form this becomes
Note that \(U = R[A]\) and so \(\boldsymbol{e} \in R[A]^\perp = N[A^T]\).
Is \(A^T A\) invertible? Yes if \(\mathrm{rank}(A) = n = \mathrm{rank}(A^T)\).
This means that we can determine \(\boldsymbol{y}\) uniquely by solving the above linear system. In particular, this means that
Thus the projection matrix onto the subspace \(U\) is given by \(P := A\left(A^T A \right)^{-1} A^T\).
Let \(A\) be an \(m \times n\) matrix with \(m > n\) and \(\mathrm{rank}(A) = n\). The least squares approximation of the system \(A \boldsymbol{x} \approx \boldsymbol{b}\) is the solution of the system
The system is called the normal equations.
Proof
If \(\boldsymbol{x} \in \mathbb{R}^n\), then \(A \boldsymbol{x} \in R(A)\). The projection theorem states that the point in \(R(A)\) nearest to \(\boldsymbol{b} \in \mathbb{R}^m\) is the orthogonal projection of \(\boldsymbol{b}\) onto \(R(A)\). If \(\boldsymbol{x}\) is the vector such that \(A\boldsymbol{x} = \mathrm{proj}_{R(A)}(\boldsymbol{b})\), then \(A\boldsymbol{x} - \boldsymbol{b}\) is in \(R(A)^{\perp}\) and therefore
We assume \(\mathrm{rank}(A) = n\), therefore \(A^TA\) is nonsingular and the solution exists and is unique.
QR Equations#
Let \(A\) be an \(m \times n\) matrix with \(m > n\) and \(\mathrm{rank}(A) = n\). The least squares approximation of the system \(A \boldsymbol{x} \approx \boldsymbol{b}\) is the solution of the system of equations
where \(A = Q_1 R_1\) is the thin QR decomopsition. The system is called the QR equations. Futhermore, the residual is given by
where \(A = QR\) is the QR deomposition with \(Q = [ Q_1 \ \ Q_2 ]\).
Proof
The matrix \(Q\) is orthogonal therefore
where we use the Pythagoras theorem in the last equality. The vector \(Q_2^T \boldsymbol{b}\) does not depend on \(\boldsymbol{x}\) therefore the minimum value of \(\| A \boldsymbol{x} - \boldsymbol{b} \|\) occurs when \(R_1 \boldsymbol{x} = Q_1^T \boldsymbol{b}\) and the residual is \(\| A \boldsymbol{x} - \boldsymbol{b} \| = \| Q_2^T \boldsymbol{b} \|\).
Fitting Models to Data#
Suppose we have \(m\) points
and we want to find a line
that “best fits” the data. There are different ways to quantify what “best fits” means but the most common method is called least squares linear regression. In least squares linear regression, we want to minimize the sum of squared errors
In matrix notation, the sum of squared errors is
where
We assume that \(m \geq 2\) and \(t_i \not= t_j\) for all \(i \not= j\) which implies \(\mathrm{rank}(A) = 2\). Therefore the vector of coefficients
is the least squares approximation of the system \(A \boldsymbol{c} \approx \boldsymbol{y}\). See Wikipedia:Simple linear regression.
More generally, given \(m\) data points
and a model function \(f(t,\boldsymbol{c})\) which depends on parameters \(c_1,\dots,c_n\), the least squares data fitting problem consists of computing parameters \(c_1,\dots,c_n\) which minimize the sum of squared errors
If the model function is of the form
for some functions \(f_1(t),\dots,f_n(t)\) then we say the data fitting problem is linear (but note the function \(f_1,\dots,f_n\) are not necessarily linear). In the linear case, use matrix notation to write the sum of squared errors as
where
We assume that \(m \geq n\) and \(f_1,\dots,f_n\) are linearly independently (which implies \(\mathrm{rank}(A) = n\)). Therefore the vector of coefficients \(\boldsymbol{c}\) is the least squares approximation of the system \(A \boldsymbol{c} \approx \boldsymbol{y}\).
Exercises#
Let \(A = QR\) where
Find the least squares approximation \(A\boldsymbol{x} \approx \boldsymbol{b}\) where
Solution
Setup (but do not solve) a linear system \(A \boldsymbol{c} = \boldsymbol{y}\) where the solution is the coefficient vector
such that the function
bests fits the data \((0,1),(1/4,3),(1/2,2),(3/4,-1),(1,0)\).