Number Systems¶
Integers in Base 10¶
We typically write integers in base 10. A number (where ) represents:
Binary Representation¶
Computers use base 2 (binary), where digits . A binary number represents:
Signed Integers: Two’s Complement¶
Negative integers use two’s complement representation. For an -bit signed integer :
Fixed Point Notation¶
To represent fractions, we allow digits after a radix point:
represents:
where is the base.
With finitely many digits, some numbers cannot be represented exactly (e.g., or ).
Floating Point Numbers¶
For scientific computing, we need to represent numbers of vastly different magnitudes—from Avogadro’s number () to Planck’s constant ().
Scientific notation allows the radix point to “float”:
In binary:
Note: In normalized binary scientific notation, the digit before the radix point is always 1, so we don’t need to store it!
IEEE 754 Standard¶
A floating point number consists of three parts:
Sign bit : 0 for positive, 1 for negative
Exponent : Shifted to allow negative exponents
Mantissa/Fraction : The significant digits
Single Precision (32-bit)¶
| Component | Bits |
|---|---|
| Sign | 1 |
| Exponent | 8 |
| Mantissa | 23 |
The value represented is:
where are the mantissa bits and is the stored exponent.
Double Precision (64-bit)¶
| Component | Bits |
|---|---|
| Sign | 1 |
| Exponent | 11 |
| Mantissa | 52 |
Integer Interpretation of Floating Point¶
The same 32 bits can be interpreted as either a float or an integer. Given the floating point representation , the integer value is:
where is the integer value of the mantissa bits.
This dual interpretation is exploited in fast numerical algorithms like the fast inverse square root.
Rounding Error (Machine Epsilon)¶
Given a real number , its floating point representation satisfies:
where is the machine epsilon (or unit roundoff):
with being the number of mantissa bits.
| Precision | Mantissa bits | Machine epsilon |
|---|---|---|
| Single (float) | 23 | |
| Double | 52 |
Application: The Finite Difference Trade-off¶
In the approximation theory chapter, we observed that finite difference errors increase for very small step sizes. Now we can explain why.
The Total Error¶
When computing the forward difference approximation:
we make two types of errors:
Truncation error from Taylor’s theorem:
Round-off error from floating-point arithmetic
For the round-off error: when is small, , so we’re subtracting two nearly equal numbers. If both values have relative error , the subtraction has absolute error roughly . Dividing by amplifies this to:
The total error is:
The Optimal Step Size¶
To minimize , differentiate and set to zero:
Solving (and assuming for simplicity):
Connection to the Error Framework¶
This is a perfect illustration of our error analysis framework:
Backward error (truncation): How much did we perturb the mathematical problem? We approximated by a secant slope—error .
Forward error (round-off amplification): How much did floating-point errors affect our answer? Error .
The condition number of the operation “subtract then divide by ” grows like , which is why round-off errors get amplified for small .