import numpy as np
= np.double(1.1)
a print(type(a))
= np.single(1.2)
b print(type(b))
= np.half(1.3)
c print(type(c))
<class 'numpy.float64'>
<class 'numpy.float32'>
<class 'numpy.float16'>
This topic will not be presented explicitly in lectures; instead, you will study the material yourself in Lab Session 1 (?sec-lab1).
Computers store numbers with finite precision, i.e. using a finite set of bits (binary digits), typically 32 or 64 of them. You learned how to store numbers as floating-point numbers last year in the module COMP1860 Building our Digital World.
You will recall that many numbers cannot be stored exactly.
The inaccuracies inherent in finite precision arithmetic must be modelled in order to understand:
The examples shown here will be in decimal by the issues apply to any base, e.g. binary.
This is important when trying to solve problems with floating-point numbers, so that we learn how to avoid the key pitfalls.
To understand how this works in practice, we introduce an abstract way to think about the practicalities of floating-point numbers, but our examples will have fewer digits.
Any finite precision number can be written using the floating point representation:
\[ x = \pm 0.b_1 b_2 b_3 \ldots b_{t-1} b_t \times \beta^e. \]
\((\beta, t, L, U)\) fully defines a finite precision number system.
Normalised finite precision systems will be considered here for which
\[ b_1 \neq 0 \quad (0 < b_1 \le \beta -1). \]
Example 2.1
Our familiar floating point numbers can be represented using this format too:
The IEEE single-precision standard is \((\beta, t, L, U) = (2, 23, -127, 128)\). This is available via numpy.single
.
The IEEE double-precision standard is \((\beta, t, L, U) = (2, 52, -1023, 1024)\). This is available via numpy.double
.
import numpy as np
= np.double(1.1)
a print(type(a))
= np.single(1.2)
b print(type(b))
= np.half(1.3)
c print(type(c))
<class 'numpy.float64'>
<class 'numpy.float32'>
<class 'numpy.float16'>
You can create double-precision floating point numbers without using numpy
(i.e. 64 bit with the same \(\beta, t, L, U\) as above) by calling the float
function, or just giving the number in decimal format:
5.1) == 5.1, np.double(5.1) == 5.1) (np.double(
(np.True_, np.True_)
However, in some extreme edge cases, you are unlikely to meet, Python floats do not follow the IEEE double-precision standard in some extreme edge cases.
Example 2.2 Consider the number system given by \((\beta, t, L, U) = (10, 2, -1, 2)\) which gives
\[ x = \pm .b_1 b_2 \times 10^e \text{ where } -1 \le e \le 2. \]
- How many numbers can be represented by this normalised system?
Overall this gives us: \[ 2 \times 9 \times 10 \times 4 \text{ options } = 720 \text{ options}. \]
- What are the two largest positive numbers in this system?
The largest value uses \(+\) as a sign, \(b_1 = 9\), \(b_2 = 9\) and \(e = 2\) which gives \[+ 0.99 \times 10^{2} = 99. \]
The second largest value uses \(+\) as a sign, \(b_1 = 9\), \(b_2 = 8\) and \(e = 2\) which gives \[+ 0.98 \times 10^{2} = 98. \]
- What are the two smallest positive numbers?
The smallest positive number has \(+\) sign, \(b_1 = 1\), \(b_2 =0\) and \(e=-1\) which gives \[+ 0.10 \times 10^{-1} = 0.01. \]
The second smallest positive number has \(+\) sign, \(b_1 = 1\), \(b_2 = 1\) and \(e = -1\) which gives \[+ 0.11 \times 10^{-1} = 0.011. \]
- What is the smallest possible difference between two numbers in this system?
The smallest difference will be between numbers of the form \(+0.10 \ times 10^{-1}\) and \(+0.11 \times 10^{-1}\) which gives \[ 0.11 \times 10^{-1} - 0.10 \times 10^{-1} = 0.011 - 0.010 = 0.001. \]
Alternatively, we can brute-force search for this:
The minimum difference min_diff=0.0010
at x=+0.57 x 10^{-1} y=+0.58 x 10^{-1}.
Exercise 2.1 Consider the number system given by \((\beta, t, L, U) = (10, 3, -3, 3)\) which gives
\[ x = \pm .b_1 b_2 b_3 \times 10^e \text{ where } -3 \le e \le 3. \]
How many numbers can be represented by this normalised system?
What are the two largest positive numbers in this system?
What are the two smallest positive numbers?
What is the smallest possible difference between two numbers in this system?
What is the smallest possible difference in this system, \(x\) and \(y\), for which \(x < 100 < y\)?
Example 2.3 (What about in Python) We find that even with double-precision floating point numbers, we see some funniness when working with decimals:
= np.double(0.0)
a
# add 0.1 ten times to get 1?
for _ in range(10):
= a + np.double(0.1)
a print(a)
print("Is a = 1?", a == 1.0)
0.1
0.2
0.30000000000000004
0.4
0.5
0.6
0.7
0.7999999999999999
0.8999999999999999
0.9999999999999999
Is a = 1? False
Why is this output not a surprise?
We also see that even adding up numbers can have different results depending on what order we add them:
= np.double(1e30)
x = np.double(-1e30)
y = np.double(1.0)
z
print(f"{(x + y) + z=:.16f}")
(x + y) + z=1.0000000000000000
print(f"{x + (y + z)=:.16f}")
x + (y + z)=0.0000000000000000
From now on \(fl(x)\) will be used to represent the (approximate) stored value of \(x\). The error in this representation can be expressed in two ways.
\[ \begin{aligned} \text{Absolute error} &= | fl(x) - x | \\ \text{Relative error} &= \frac{| fl(x) - x |}{|x|}. \end{aligned} \]
The number \(fl(x)\) is said to approximate \(x\) to \(t\) significant digits (or figures) if \(t\) is the largest non-negative integer for which
\[ \text{Relative error} < 0.5 \times \beta^{1-t}. \]
It can be proved that if the relative error is equal to \(\beta^{-d}\), then \(fl(x)\) has \(d\) correct significant digits.
In the number system given by \((\beta, t, L, U)\), the nearest (larger) representable number to \(x = 0.b_1 b_2 b_3 \ldots b_{t-1} b_t \times \beta^e\) is
\[ \tilde{x} = x + .\underbrace{000\ldots01}_{t \text{ digits}} \times \beta^e = x + \beta^{e-t} \]
Any number \(y \in (x, \tilde{x})\) is stored as either \(x\) or \(\tilde{x}\) by rounding to the nearest representable number, so
It follow from \(y > x \ge .100 \ldots 00 \times \beta^e = \beta^{e-1}\) that
\[ \frac{|y - fl(y)|}{|y|} < \frac{1}{2} \frac{\beta^{e-t}}{\beta^{e-1}} = \frac{1}{2} \beta^{1-t}, \]
and this provides a bound on the relative error: for any \(y\)
\[ \frac{|y - fl(y)|}{|y|} < \frac{1}{2} \beta^{1-t}. \]
The last term is known as machine precision or unit roundoff and is often called \(eps\). This is obtained in Python with
= np.finfo(np.double).eps
eps print(eps)
2.220446049250313e-16
Example 2.4
The number system \((\beta, t, L, U) = (10, 2, -1, 2)\) gives
\[ eps = \frac{1}{2} \beta^{1-t} = \frac{1}{2} 10^{1-2} = 0.05. \]
The number system \((\beta, t, L, U) = (10, 3, -3, 3)\) gives
\[ eps = \frac{1}{2} \beta^{1-t} = \frac{1}{2} 10^{1-3} = 0.005. \]
The number system \((\beta, t, L, U) = (10, 7, 2, 10)\) gives
\[ eps = \frac{1}{2} \beta^{1-t} = \frac{1}{2} 10^{1-7} = 0.000005. \]
For some common types in python, we see the following values:
for dtype in [np.half, np.single, np.double]:
print(dtype.__name__, np.finfo(dtype).eps)
float16 0.000977
float32 1.1920929e-07
float64 2.220446049250313e-16
Machine precision epsilon (\(eps\)) gives us an upper bound for the error in the representation of a floating-point number in a particular system. We note that this is different to the smallest possible numbers that we can store!
= np.finfo(np.double).eps
eps print(f"{eps=}")
= eps
smaller for _ in range(5):
= smaller / 10.0
smaller print(f"{smaller=}")
eps=np.float64(2.220446049250313e-16)
smaller=np.float64(2.2204460492503132e-17)
smaller=np.float64(2.220446049250313e-18)
smaller=np.float64(2.220446049250313e-19)
smaller=np.float64(2.2204460492503132e-20)
smaller=np.float64(2.2204460492503133e-21)
Arithmetic operations are usually carried out as though infinite precision is available, after which the result is rounded to the nearest representable number.
This approach means that arithmetic cannot be completely trusted and the usual rules do not necessarily apply!
Example 2.5 Consider the number system \((\beta, t, L, U) = (10, 2, -1, 2)\) and take
\[ x = .10 \times 10^2, \quad y = .49 \times 10^0, \quad z = .51 \times 10^0. \]
In exact arithmetic \(x + y = 10 + 0.49 = 10.49\) and \(x + z = 10 + 0.51 = 10.51\).
In this number system rounding gives
\[ fl(x+y) = .10 \times 10^2 = x, \qquad fl(x+z) = .11 \times 10^2 \neq x. \]
(Note that \(\frac{y}{x} < eps\) but \(\frac{z}{x} > eps\).)
Evaluate the following expression in this number system.
\[ x+(y+y), \quad (x+y)+y, \quad x+(z+z), \quad (x+z) +z. \]
(Also note the benefits of adding the smallest terms first!)
Exercise 2.2 (More examples)
Verify that a similar problem arises for the numbers
\[ x = .85 \times 10^0, \quad y = .3 \times 10^{-2}, \quad z = .6 \times 10^{-2}, \]
in the system \((\beta, t, L, U) = (10, 2, -3, 3)\).
It is sometimes helpful to think of another machine precision epsilon in other way: Machine precision epsilon is the smallest positive number \(eps\) such that \(1 + eps > 1\), i.e. it is half the difference between \(1\) and the next largest representable number.
Examples:
For the number system \((\beta, t, L, U) = (10, 2, -1, 2)\),
\[ \begin{array}{ccl} & .11 \times 10^1 & \qquad \leftarrow {\text{next number}} \\ - & .10 \times 10^1 & \qquad \leftarrow 1 \\ \hline & .01 \times 10^1 & \qquad \leftarrow 0.1 \end{array} \]
so \(eps = \frac{1}{2}(0.1) = 0.05\).
When working with floating-point numbers there are other things we need to worry about, too!
inf
(infinity) with numpy.double
s and is usually “fatal”.
inf
, but \(\frac{0}{0}\) gives nan
(not a number)
inf
No! There are many examples of major software errors that have occurred due to programmers not understanding the issues associated with computer arithmetic…
In February 1991, a basic rounding error within software for the US Patriot missile system caused it to fail, contributing to the loss of 28 lives.
In June 1996, the European Space Agency’s Ariane Rocket exploded shortly after take-off: the error was due to failing to handle overflow correctly.
In October 2020, a driverless car drove straight into a wall due to faulty handling of a floating-point error.
There is inaccuracy in almost all computer arithmetic.
Care must be taken to minimise its effects, for example:
The usual mathematical rules no longer apply.
There is no point in trying to compute a solution to a problem to a greater accuracy than can be stored by the computer.