Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Big Idea

Every computation on a computer introduces small errors because real numbers must be stored in finite precision. The key guarantee: the relative error of representing any real number is bounded by machine epsilon. This simple fact explains the mysterious finite difference behavior from Chapter 1 — and understanding it is the first step toward writing robust numerical code.

Absolute and Relative Error

Before we can analyze floating-point arithmetic, we need precise definitions of error.

Definition 1 (Absolute and Relative Error)

Let xx be the true value and xx^* be a numerical approximation.

Absolute error:

Abs(x)=xx\text{Abs}(x) = |x - x^*|

Relative error:

Rel(x)=xxx,x0\text{Rel}(x) = \frac{|x - x^*|}{|x|}, \quad x \neq 0

Why Relative Error Matters

Example 1 (Same Absolute Error, Different Quality)

x=1x = 1, x=2x^* = 2x=106x = 10^6, x=106+1x^* = 10^6 + 1
Absolute error11
Relative error100%100\%0.0001%0.0001\%
QualityTerribleExcellent

Both have the same absolute error, but the first is off by 100% while the second is nearly perfect. Relative error captures what matters.

Floating-Point Numbers as Approximations

We need to represent all real numbers — from Avogadro’s number (6.02×10236.02 \times 10^{23}) to Planck’s constant (6.63×10346.63 \times 10^{-34}) — using finite resources. The solution is floating-point arithmetic: a finite set of numbers F\mathbb{F} that approximates the real line R\mathbb{R}, with a guaranteed bound on the relative error of the approximation.

The key fact is this: for any real number xx, its floating-point representation fl(x)\text{fl}(x) satisfies

fl(x)=x(1+ε),εμ\text{fl}(x) = x(1 + \varepsilon), \quad |\varepsilon| \leq \mu

where μ\mu is machine epsilon (or unit roundoff). This says that floating-point representation introduces a small relative error — the same relative accuracy whether xx is tiny or huge.

Definition 2 (Machine Epsilon)

The machine epsilon μ\mu is the smallest number such that

fl(x)xxμfor all xR\frac{|\text{fl}(x) - x|}{|x|} \leq \mu \quad \text{for all } x \in \mathbb{R}

For a floating-point system with tt mantissa bits:

μ=12×2t\mu = \frac{1}{2} \times 2^{-t}
PrecisionMantissa bitsMachine epsilonDecimal digits
Single (float32)235.96×108\approx 5.96 \times 10^{-8}~7
Double (float64)521.11×1016\approx 1.11 \times 10^{-16}~16

Remark 1 (Practical Implication)

Single precision gives about 7 decimal digits of accuracy; double precision gives about 16 decimal digits. This means that two real numbers that agree to 16 significant digits will have the same double-precision representation. We will see exactly how these bits encode numbers when we study IEEE 754 representation.

Solving the Mystery: The Finite Difference Trade-off

In the previous chapter, we observed that finite difference errors increase for very small step sizes. Now we have the tools to explain why.

The Computation on a Machine

On a computer, function values carry round-off error. By the floating-point guarantee (3), the computed values satisfy

f~(x+h)=f(x+h)(1+ε1),f~(x)=f(x)(1+ε2),εiμ.\widetilde{f}(x+h) = f(x+h)(1 + \varepsilon_1), \quad \widetilde{f}(x) = f(x)(1 + \varepsilon_2), \qquad |\varepsilon_i| \leq \mu.

So the computed forward difference is

D~hf(x)  :=  f~(x+h)f~(x)h.\widetilde{D}_h f(x) \;:=\; \frac{\widetilde{f}(x+h) - \widetilde{f}(x)}{h}.

Separating the Two Errors

Expanding:

D~hf(x)=f(x+h)f(x)hexact finite difference  +  f(x+h)ε1f(x)ε2hround-off perturbation.\widetilde{D}_h f(x) = \underbrace{\frac{f(x+h) - f(x)}{h}}_{\text{exact finite difference}} \;+\; \underbrace{\frac{f(x+h)\,\varepsilon_1 - f(x)\,\varepsilon_2}{h}}_{\text{round-off perturbation}}.

Applying Taylor’s theorem to the first term:

f(x+h)f(x)h=f(x)+h2f(ξ),\frac{f(x+h) - f(x)}{h} = f'(x) + \frac{h}{2}f''(\xi),

the total error is

D~hf(x)f(x)=f(x+h)ε1f(x)ε2hround-off error  +  h2f(ξ)truncation error.\widetilde{D}_h f(x) - f'(x) = \underbrace{\frac{f(x+h)\,\varepsilon_1 - f(x)\,\varepsilon_2}{h}}_{\text{round-off error}} \;+\; \underbrace{\frac{h}{2}f''(\xi)}_{\text{truncation error}}.

Since εiμ|\varepsilon_i| \leq \mu and f(x+h)f(x)f(x+h) \approx f(x) for small hh:

D~hf(x)f(x)    2μf(x)hround-off  +  h2f(ξ)truncation\left|\widetilde{D}_h f(x) - f'(x)\right| \;\leq\; \underbrace{\frac{2\mu\,|f(x)|}{h}}_{\text{round-off}} \;+\; \underbrace{\frac{h}{2}\,|f''(\xi)|}_{\text{truncation}}
Note

The two terms compete: truncation error decreases as h0h \to 0, while round-off error increases as h0h \to 0.

Remark 2 (Catastrophic Cancellation)

The round-off term 2μf(x)/h2\mu|f(x)|/h grows as h0h \to 0 because we are subtracting nearly equal numbers. The absolute error of f~(x+h)f~(x)\widetilde{f}(x+h) - \widetilde{f}(x) stays roughly constant at (f(x+h)+f(x))μ(|f(x+h)| + |f(x)|)\mu, but dividing by hh amplifies it. More generally, subtracting nearly equal floating-point numbers destroys relative accuracy — this is called catastrophic cancellation.

The Optimal Step Size

Minimizing (11) by setting dE/dh=0dE/dh = 0:

2μf(x)h2+f(ξ)2=0    hopt=2μf(x)f(ξ)μ-\frac{2\mu|f(x)|}{h^2} + \frac{|f''(\xi)|}{2} = 0 \quad \implies \quad h_{\text{opt}} = 2\sqrt{\frac{\mu|f(x)|}{|f''(\xi)|}} \approx \sqrt{\mu}

where the last approximation assumes f(x)f(ξ)|f(x)| \approx |f''(\xi)|.

Remark 3 (Optimal Step Sizes)

PrecisionMachine epsilon μ\muOptimal hh (forward diff)
Single108\sim 10^{-8}104\sim 10^{-4}
Double1016\sim 10^{-16}108\sim 10^{-8}