Errors and Floating-Point Arithmetic University of Massachusetts Amherst
Absolute and Relative Error ¶ Before we can analyze floating-point arithmetic, we need precise definitions of error.
Why Relative Error Matters ¶ Example 1 (Same Absolute Error, Different Quality)
x = 1 x = 1 x = 1 , x ∗ = 2 x^* = 2 x ∗ = 2 x = 1 0 6 x = 10^6 x = 1 0 6 , x ∗ = 1 0 6 + 1 x^* = 10^6 + 1 x ∗ = 1 0 6 + 1 Absolute error 1 1 Relative error 100 % 100\% 100% 0.0001 % 0.0001\% 0.0001% Quality Terrible Excellent
Both have the same absolute error, but the first is off by 100% while the second is nearly perfect. Relative error captures what matters.
Floating-Point Numbers as Approximations ¶ We need to represent all real numbers — from Avogadro’s number (6.02 × 1 0 23 6.02 \times 10^{23} 6.02 × 1 0 23 ) to Planck’s constant (6.63 × 1 0 − 34 6.63 \times 10^{-34} 6.63 × 1 0 − 34 ) — using finite resources. The solution is floating-point arithmetic : a finite set of numbers F \mathbb{F} F that approximates the real line R \mathbb{R} R , with a guaranteed bound on the relative error of the approximation.
The key fact is this: for any real number x x x , its floating-point representation fl ( x ) \text{fl}(x) fl ( x ) satisfies
fl ( x ) = x ( 1 + ε ) , ∣ ε ∣ ≤ μ \text{fl}(x) = x(1 + \varepsilon), \quad |\varepsilon| \leq \mu fl ( x ) = x ( 1 + ε ) , ∣ ε ∣ ≤ μ where μ \mu μ is machine epsilon (or unit roundoff ). This says that floating-point representation introduces a small relative error — the same relative accuracy whether x x x is tiny or huge.
Solving the Mystery: The Finite Difference Trade-off ¶ In the previous chapter , we observed that finite difference errors increase for very small step sizes. Now we have the tools to explain why.
The Computation on a Machine ¶ On a computer, function values carry round-off error. By the floating-point guarantee (3) , the computed values satisfy
f ~ ( x + h ) = f ( x + h ) ( 1 + ε 1 ) , f ~ ( x ) = f ( x ) ( 1 + ε 2 ) , ∣ ε i ∣ ≤ μ . \widetilde{f}(x+h) = f(x+h)(1 + \varepsilon_1), \quad \widetilde{f}(x) = f(x)(1 + \varepsilon_2), \qquad |\varepsilon_i| \leq \mu. f ( x + h ) = f ( x + h ) ( 1 + ε 1 ) , f ( x ) = f ( x ) ( 1 + ε 2 ) , ∣ ε i ∣ ≤ μ . So the computed forward difference is
D ~ h f ( x ) : = f ~ ( x + h ) − f ~ ( x ) h . \widetilde{D}_h f(x) \;:=\; \frac{\widetilde{f}(x+h) - \widetilde{f}(x)}{h}. D h f ( x ) := h f ( x + h ) − f ( x ) . Separating the Two Errors ¶ Expanding:
D ~ h f ( x ) = f ( x + h ) − f ( x ) h ⏟ exact finite difference + f ( x + h ) ε 1 − f ( x ) ε 2 h ⏟ round-off perturbation . \widetilde{D}_h f(x) = \underbrace{\frac{f(x+h) - f(x)}{h}}_{\text{exact finite difference}} \;+\; \underbrace{\frac{f(x+h)\,\varepsilon_1 - f(x)\,\varepsilon_2}{h}}_{\text{round-off perturbation}}. D h f ( x ) = exact finite difference h f ( x + h ) − f ( x ) + round-off perturbation h f ( x + h ) ε 1 − f ( x ) ε 2 . Applying Taylor’s theorem to the first term:
f ( x + h ) − f ( x ) h = f ′ ( x ) + h 2 f ′ ′ ( ξ ) , \frac{f(x+h) - f(x)}{h} = f'(x) + \frac{h}{2}f''(\xi), h f ( x + h ) − f ( x ) = f ′ ( x ) + 2 h f ′′ ( ξ ) , the total error is
D ~ h f ( x ) − f ′ ( x ) = f ( x + h ) ε 1 − f ( x ) ε 2 h ⏟ round-off error + h 2 f ′ ′ ( ξ ) ⏟ truncation error . \widetilde{D}_h f(x) - f'(x) = \underbrace{\frac{f(x+h)\,\varepsilon_1 - f(x)\,\varepsilon_2}{h}}_{\text{round-off error}} \;+\; \underbrace{\frac{h}{2}f''(\xi)}_{\text{truncation error}}. D h f ( x ) − f ′ ( x ) = round-off error h f ( x + h ) ε 1 − f ( x ) ε 2 + truncation error 2 h f ′′ ( ξ ) . Since ∣ ε i ∣ ≤ μ |\varepsilon_i| \leq \mu ∣ ε i ∣ ≤ μ and f ( x + h ) ≈ f ( x ) f(x+h) \approx f(x) f ( x + h ) ≈ f ( x ) for small h h h :
∣ D ~ h f ( x ) − f ′ ( x ) ∣ ≤ 2 μ ∣ f ( x ) ∣ h ⏟ round-off + h 2 ∣ f ′ ′ ( ξ ) ∣ ⏟ truncation \left|\widetilde{D}_h f(x) - f'(x)\right| \;\leq\; \underbrace{\frac{2\mu\,|f(x)|}{h}}_{\text{round-off}} \;+\; \underbrace{\frac{h}{2}\,|f''(\xi)|}_{\text{truncation}} ∣ ∣ D h f ( x ) − f ′ ( x ) ∣ ∣ ≤ round-off h 2 μ ∣ f ( x ) ∣ + truncation 2 h ∣ f ′′ ( ξ ) ∣ The round-off term 2 μ ∣ f ( x ) ∣ / h 2\mu|f(x)|/h 2 μ ∣ f ( x ) ∣/ h grows as h → 0 h \to 0 h → 0 because we are subtracting nearly equal numbers. The absolute error of f ~ ( x + h ) − f ~ ( x ) \widetilde{f}(x+h) - \widetilde{f}(x) f ( x + h ) − f ( x ) stays roughly constant at ( ∣ f ( x + h ) ∣ + ∣ f ( x ) ∣ ) μ (|f(x+h)| + |f(x)|)\mu ( ∣ f ( x + h ) ∣ + ∣ f ( x ) ∣ ) μ , but dividing by h h h amplifies it. More generally, subtracting nearly equal floating-point numbers destroys relative accuracy — this is called catastrophic cancellation .
The Optimal Step Size ¶ Minimizing (11) by setting d E / d h = 0 dE/dh = 0 d E / d h = 0 :
− 2 μ ∣ f ( x ) ∣ h 2 + ∣ f ′ ′ ( ξ ) ∣ 2 = 0 ⟹ h opt = 2 μ ∣ f ( x ) ∣ ∣ f ′ ′ ( ξ ) ∣ ≈ μ -\frac{2\mu|f(x)|}{h^2} + \frac{|f''(\xi)|}{2} = 0 \quad \implies \quad h_{\text{opt}} = 2\sqrt{\frac{\mu|f(x)|}{|f''(\xi)|}} \approx \sqrt{\mu} − h 2 2 μ ∣ f ( x ) ∣ + 2 ∣ f ′′ ( ξ ) ∣ = 0 ⟹ h opt = 2 ∣ f ′′ ( ξ ) ∣ μ ∣ f ( x ) ∣ ≈ μ where the last approximation assumes ∣ f ( x ) ∣ ≈ ∣ f ′ ′ ( ξ ) ∣ |f(x)| \approx |f''(\xi)| ∣ f ( x ) ∣ ≈ ∣ f ′′ ( ξ ) ∣ .
Precision Machine epsilon μ \mu μ Optimal h h h (forward diff) Single ∼ 1 0 − 8 \sim 10^{-8} ∼ 1 0 − 8 ∼ 1 0 − 4 \sim 10^{-4} ∼ 1 0 − 4 Double ∼ 1 0 − 16 \sim 10^{-16} ∼ 1 0 − 16 ∼ 1 0 − 8 \sim 10^{-8} ∼ 1 0 − 8