Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Number Systems

Integers in Base 10

We typically write integers in base 10. A number dNdN1d1d0d_N d_{N-1} \dots d_1 d_0 (where di{0,1,,9}d_i \in \{0, 1, \dots, 9\}) represents:

#=d0100+d1101++dN10N\# = d_0 \cdot 10^0 + d_1 \cdot 10^1 + \cdots + d_N \cdot 10^N

Binary Representation

Computers use base 2 (binary), where digits di{0,1}d_i \in \{0, 1\}. A binary number dNdN1d1d0d_N d_{N-1} \dots d_1 d_0 represents:

#=i=0Ndi2i\# = \sum_{i=0}^{N} d_i \cdot 2^i

Signed Integers: Two’s Complement

Negative integers use two’s complement representation. For an NN-bit signed integer a=dN1dN2d1d0a = d_{N-1}d_{N-2}\dots d_1 d_0:

a=dN12N1+i=0N2di2ia = -d_{N-1} \cdot 2^{N-1} + \sum_{i=0}^{N-2} d_i \cdot 2^i

Fixed Point Notation

To represent fractions, we allow digits after a radix point:

dNd1d0.d1d2dMd_N \dots d_1 d_0 . d_{-1} d_{-2} \dots d_{-M}

represents:

#=i=MNdibi\# = \sum_{i=-M}^{N} d_i \cdot b^i

where bb is the base.

With finitely many digits, some numbers cannot be represented exactly (e.g., 13\frac{1}{3} or π\pi).

Floating Point Numbers

For scientific computing, we need to represent numbers of vastly different magnitudes—from Avogadro’s number (6.02×10236.02 \times 10^{23}) to Planck’s constant (6.63×10346.63 \times 10^{-34}).

Scientific notation allows the radix point to “float”:

245000=2.45×105245000 = 2.45 \times 10^5

In binary:

110002=1.12×24,0.01012=1.012×2211000_2 = 1.1_2 \times 2^4, \quad 0.0101_2 = 1.01_2 \times 2^{-2}

Note: In normalized binary scientific notation, the digit before the radix point is always 1, so we don’t need to store it!

IEEE 754 Standard

A floating point number consists of three parts:

Single Precision (32-bit)

ComponentBits
Sign1
Exponent8
Mantissa23

The value represented is:

fl(x)=±(1+d021+d122++d22223)×2E127\text{fl}(x) = \pm \left(1 + \frac{d_0}{2^1} + \frac{d_1}{2^2} + \cdots + \frac{d_{22}}{2^{23}}\right) \times 2^{E - 127}

where did_i are the mantissa bits and EE is the stored exponent.

Double Precision (64-bit)

ComponentBits
Sign1
Exponent11
Mantissa52

Integer Interpretation of Floating Point

The same 32 bits can be interpreted as either a float or an integer. Given the floating point representation (Sx,Ex,mx)(S_x, E_x, m_x), the integer value is:

Int(x)=231Sx+223Ex+Mx\text{Int}(x) = 2^{31} S_x + 2^{23} E_x + M_x

where Mx=223mxM_x = 2^{23} m_x is the integer value of the mantissa bits.

This dual interpretation is exploited in fast numerical algorithms like the fast inverse square root.

Rounding Error (Machine Epsilon)

Given a real number xx, its floating point representation fl(x)\text{fl}(x) satisfies:

fl(x)xxμ\frac{|\text{fl}(x) - x|}{|x|} \leq \mu

where μ\mu is the machine epsilon (or unit roundoff):

μ=12×2t\mu = \frac{1}{2} \times 2^{-t}

with tt being the number of mantissa bits.

PrecisionMantissa bitsMachine epsilon
Single (float)235.96×108\approx 5.96 \times 10^{-8}
Double521.11×1016\approx 1.11 \times 10^{-16}

Application: The Finite Difference Trade-off

In the approximation theory chapter, we observed that finite difference errors increase for very small step sizes. Now we can explain why.

The Total Error

When computing the forward difference approximation:

f(x)f(x+h)f(x)hf'(x) \approx \frac{f(x+h) - f(x)}{h}

we make two types of errors:

  1. Truncation error from Taylor’s theorem: h2f(ξ)\frac{h}{2}|f''(\xi)|

  2. Round-off error from floating-point arithmetic

For the round-off error: when hh is small, f(x+h)f(x)f(x+h) \approx f(x), so we’re subtracting two nearly equal numbers. If both values have relative error μ\mu, the subtraction has absolute error roughly 2μf(x)2\mu|f(x)|. Dividing by hh amplifies this to:

Round-off error2μf(x)h\text{Round-off error} \approx \frac{2\mu|f(x)|}{h}

The total error is:

E(h)=2μf(x)hround-off+h2f(ξ)truncationE(h) = \underbrace{\frac{2\mu|f(x)|}{h}}_{\text{round-off}} + \underbrace{\frac{h}{2}|f''(\xi)|}_{\text{truncation}}

The Optimal Step Size

To minimize E(h)E(h), differentiate and set to zero:

dEdh=2μf(x)h2+f(ξ)2=0\frac{dE}{dh} = -\frac{2\mu|f(x)|}{h^2} + \frac{|f''(\xi)|}{2} = 0

Solving (and assuming f(x)f(ξ)|f(x)| \approx |f''(\xi)| for simplicity):

hopt=2μμh_{\text{opt}} = 2\sqrt{\mu} \approx \sqrt{\mu}

Connection to the Error Framework

This is a perfect illustration of our error analysis framework:

The condition number of the operation “subtract then divide by hh” grows like 1/h1/h, which is why round-off errors get amplified for small hh.