2 Floating point number systems

Author

Affiliation

Thomas Ranner

School of Computer Science, University of Leeds

Module learning objective

Explain practical challenges working with floating-point numbers.

This topic will not be presented explicitly in lectures; instead, you will study the material yourself in Lab Session 1 (?sec-lab1).

2.1 Finite precision number systems

Computers store numbers with finite precision, i.e. using a finite set of bits (binary digits), typically 32 or 64 of them. You learned how to store numbers as floating-point numbers last year in the module COMP1860 Building our Digital World.

You will recall that many numbers cannot be stored exactly.

Some numbers cannot be represented precisely using any finite set of digits: e.g. \(\sqrt{2} = 1.141 42\ldots\), \(\pi = 3.141 59\ldots\), etc.
Some cannot be represented precisely in a given number base:
e.g. \(\frac{1}{9} = 0.111\ldots\) (decimal), \(\frac{1}{5} = 0.0011 0011 \ldots\) (binary).
Others can be represented by a finite number of digits but only using more than are available: e.g. \(1.526 374 856 437\) cannot be stored exactly using 10 decimal digits.

The inaccuracies inherent in finite precision arithmetic must be modelled in order to understand:

how the numbers are represented (and the nature of associated limitations);
the errors in their representation;
the errors which occur when arithmetic operations are applied to them.

The examples shown here will be in decimal by the issues apply to any base, e.g. binary.

This is important when trying to solve problems with floating-point numbers, so that we learn how to avoid the key pitfalls.

2.2 Normalised systems

To understand how this works in practice, we introduce an abstract way to think about the practicalities of floating-point numbers, but our examples will have fewer digits.

Any finite precision number can be written using the floating point representation:

\[ x = \pm 0.b_1 b_2 b_3 \ldots b_{t-1} b_t \times \beta^e. \]

The digits \(b_i\) are integers satisfying \(0 \le b_i \le \beta - 1\).
The mantissa, \(b_1 b_2 b_3 \ldots b_{t-1} b_t\), contains \(t\) digits.
\(\beta\) is the base (always a positive integer).
\(e\) is the integer exponent and is bounded (\(L \le e \le U\)).

\((\beta, t, L, U)\) fully defines a finite precision number system.

Normalised finite precision systems will be considered here for which

\[ b_1 \neq 0 \quad (0 < b_1 \le \beta -1). \]

Example 2.1

In the case \((\beta, t, L, U) = (10, 4, -49, 50)\) (base 10), \[ 10 000 = .1000 \times 10^5, \quad 22.64 = .2264 \times 10^2, \quad 0.000 056 7 = .5670 \times 10^{-4} \]
In the case \((\beta, t, L, U) = (2, 6, -7, 8)\) (binary), \[ 1 0000 = .1000 00 \times 2^5, \quad 1011.11 = .1011 11 \times 2^4,\] \[ 0.0000 11 = .1100 00 \times 2^{-4}. \]
Zero is always taken to be a special case e.g., \(0 = \pm .00\ldots 0 \times \beta^0\).

Our familiar floating point numbers can be represented using this format too:

The IEEE single-precision standard is \((\beta, t, L, U) = (2, 23, -127, 128)\). This is available via numpy.single.
The IEEE double-precision standard is \((\beta, t, L, U) = (2, 52, -1023, 1024)\). This is available via numpy.double.

import numpy as np

a = np.double(1.1)
print(type(a))
b = np.single(1.2)
print(type(b))
c = np.half(1.3)
print(type(c))

<class 'numpy.float64'>
<class 'numpy.float32'>
<class 'numpy.float16'>

You can create double-precision floating point numbers without using numpy (i.e. 64 bit with the same \(\beta, t, L, U\) as above) by calling the float function, or just giving the number in decimal format:

(np.double(5.1) == 5.1, np.double(5.1) == 5.1)

(np.True_, np.True_)

However, in some extreme edge cases, you are unlikely to meet, Python floats do not follow the IEEE double-precision standard in some extreme edge cases.

Example 2.2 Consider the number system given by \((\beta, t, L, U) = (10, 2, -1, 2)\) which gives

\[ x = \pm .b_1 b_2 \times 10^e \text{ where } -1 \le e \le 2. \]

How many numbers can be represented by this normalised system?

The sign can be positive or negative
\(b_1\) can take on the values \(1\) to \(9\) (9 options)
\(b_2\) can take on the values \(0\) to \(9\) (10 options)
\(e\) can take on the values \(-1, 0, 1, 2\) (4 options)

Overall this gives us: \[ 2 \times 9 \times 10 \times 4 \text{ options } = 720 \text{ options}. \]

What are the two largest positive numbers in this system?

The largest value uses \(+\) as a sign, \(b_1 = 9\), \(b_2 = 9\) and \(e = 2\) which gives \[+ 0.99 \times 10^{2} = 99. \]

The second largest value uses \(+\) as a sign, \(b_1 = 9\), \(b_2 = 8\) and \(e = 2\) which gives \[+ 0.98 \times 10^{2} = 98. \]

What are the two smallest positive numbers?

The smallest positive number has \(+\) sign, \(b_1 = 1\), \(b_2 =0\) and \(e=-1\) which gives \[+ 0.10 \times 10^{-1} = 0.01. \]

The second smallest positive number has \(+\) sign, \(b_1 = 1\), \(b_2 = 1\) and \(e = -1\) which gives \[+ 0.11 \times 10^{-1} = 0.011. \]

What is the smallest possible difference between two numbers in this system?

The smallest difference will be between numbers of the form \(+0.10 \ times 10^{-1}\) and \(+0.11 \times 10^{-1}\) which gives \[ 0.11 \times 10^{-1} - 0.10 \times 10^{-1} = 0.011 - 0.010 = 0.001. \]

Alternatively, we can brute-force search for this:

The minimum difference min_diff=0.0010
at x=+0.57 x 10^{-1} y=+0.58 x 10^{-1}.

Exercise 2.1 Consider the number system given by \((\beta, t, L, U) = (10, 3, -3, 3)\) which gives

\[ x = \pm .b_1 b_2 b_3 \times 10^e \text{ where } -3 \le e \le 3. \]

How many numbers can be represented by this normalised system?
What are the two largest positive numbers in this system?
What are the two smallest positive numbers?
What is the smallest possible difference between two numbers in this system?
What is the smallest possible difference in this system, \(x\) and \(y\), for which \(x < 100 < y\)?

Example 2.3 (What about in Python) We find that even with double-precision floating point numbers, we see some funniness when working with decimals:

a = np.double(0.0)

# add 0.1 ten times to get 1?
for _ in range(10):
    a = a + np.double(0.1)
    print(a)

print("Is a = 1?", a == 1.0)

0.1
0.2
0.30000000000000004
0.4
0.5
0.6
0.7
0.7999999999999999
0.8999999999999999
0.9999999999999999
Is a = 1? False

Why is this output not a surprise?

We also see that even adding up numbers can have different results depending on what order we add them:

x = np.double(1e30)
y = np.double(-1e30)
z = np.double(1.0)

print(f"{(x + y) + z=:.16f}")

(x + y) + z=1.0000000000000000

print(f"{x + (y + z)=:.16f}")

x + (y + z)=0.0000000000000000

2.3 Errors and machine precision

From now on \(fl(x)\) will be used to represent the (approximate) stored value of \(x\). The error in this representation can be expressed in two ways.

\[ \begin{aligned} \text{Absolute error} &= | fl(x) - x | \\ \text{Relative error} &= \frac{| fl(x) - x |}{|x|}. \end{aligned} \]

The number \(fl(x)\) is said to approximate \(x\) to \(t\) significant digits (or figures) if \(t\) is the largest non-negative integer for which

\[ \text{Relative error} < 0.5 \times \beta^{1-t}. \]

It can be proved that if the relative error is equal to \(\beta^{-d}\), then \(fl(x)\) has \(d\) correct significant digits.

In the number system given by \((\beta, t, L, U)\), the nearest (larger) representable number to \(x = 0.b_1 b_2 b_3 \ldots b_{t-1} b_t \times \beta^e\) is

\[ \tilde{x} = x + .\underbrace{000\ldots01}_{t \text{ digits}} \times \beta^e = x + \beta^{e-t} \]

Any number \(y \in (x, \tilde{x})\) is stored as either \(x\) or \(\tilde{x}\) by rounding to the nearest representable number, so

the largest possible error is \(\frac{1}{2} \beta^{e-t}\),
which means that \(| y - fl(y) | \le \frac{1}{2} \beta^{e-t}\).

It follow from \(y > x \ge .100 \ldots 00 \times \beta^e = \beta^{e-1}\) that

\[ \frac{|y - fl(y)|}{|y|} < \frac{1}{2} \frac{\beta^{e-t}}{\beta^{e-1}} = \frac{1}{2} \beta^{1-t}, \]

and this provides a bound on the relative error: for any \(y\)

\[ \frac{|y - fl(y)|}{|y|} < \frac{1}{2} \beta^{1-t}. \]

The last term is known as machine precision or unit roundoff and is often called \(eps\). This is obtained in Python with

eps = np.finfo(np.double).eps
print(eps)

2.220446049250313e-16

Example 2.4

The number system \((\beta, t, L, U) = (10, 2, -1, 2)\) gives

\[ eps = \frac{1}{2} \beta^{1-t} = \frac{1}{2} 10^{1-2} = 0.05. \]
The number system \((\beta, t, L, U) = (10, 3, -3, 3)\) gives

\[ eps = \frac{1}{2} \beta^{1-t} = \frac{1}{2} 10^{1-3} = 0.005. \]
The number system \((\beta, t, L, U) = (10, 7, 2, 10)\) gives

\[ eps = \frac{1}{2} \beta^{1-t} = \frac{1}{2} 10^{1-7} = 0.000005. \]
For some common types in python, we see the following values:

for dtype in [np.half, np.single, np.double]:
    print(dtype.__name__, np.finfo(dtype).eps)

float16 0.000977
float32 1.1920929e-07
float64 2.220446049250313e-16

Machine precision epsilon (\(eps\)) gives us an upper bound for the error in the representation of a floating-point number in a particular system. We note that this is different to the smallest possible numbers that we can store!

eps = np.finfo(np.double).eps
print(f"{eps=}")

smaller = eps
for _ in range(5):
    smaller = smaller / 10.0
    print(f"{smaller=}")

eps=np.float64(2.220446049250313e-16)
smaller=np.float64(2.2204460492503132e-17)
smaller=np.float64(2.220446049250313e-18)
smaller=np.float64(2.220446049250313e-19)
smaller=np.float64(2.2204460492503132e-20)
smaller=np.float64(2.2204460492503133e-21)

Arithmetic operations are usually carried out as though infinite precision is available, after which the result is rounded to the nearest representable number.

This approach means that arithmetic cannot be completely trusted and the usual rules do not necessarily apply!

Example 2.5 Consider the number system \((\beta, t, L, U) = (10, 2, -1, 2)\) and take

\[ x = .10 \times 10^2, \quad y = .49 \times 10^0, \quad z = .51 \times 10^0. \]

In exact arithmetic \(x + y = 10 + 0.49 = 10.49\) and \(x + z = 10 + 0.51 = 10.51\).
In this number system rounding gives

\[ fl(x+y) = .10 \times 10^2 = x, \qquad fl(x+z) = .11 \times 10^2 \neq x. \]

(Note that \(\frac{y}{x} < eps\) but \(\frac{z}{x} > eps\).)

Evaluate the following expression in this number system.

\[ x+(y+y), \quad (x+y)+y, \quad x+(z+z), \quad (x+z) +z. \]

(Also note the benefits of adding the smallest terms first!)

Exercise 2.2 (More examples)

Verify that a similar problem arises for the numbers

\[ x = .85 \times 10^0, \quad y = .3 \times 10^{-2}, \quad z = .6 \times 10^{-2}, \]

in the system \((\beta, t, L, U) = (10, 2, -3, 3)\).

Given the number system \((\beta, t, L, U) = (10, 3, -3, 3)\) and \(x = .100\times 10^3\), find nonzero numbers \(y\) and \(z\) from this system for which \(fl(x+y) = x\) and \(fl(x+z) > x\).

It is sometimes helpful to think of another machine precision epsilon in other way: Machine precision epsilon is the smallest positive number \(eps\) such that \(1 + eps > 1\), i.e. it is half the difference between \(1\) and the next largest representable number.

Examples:

For the number system \((\beta, t, L, U) = (10, 2, -1, 2)\),

\[ \begin{array}{ccl} & .11 \times 10^1 & \qquad \leftarrow {\text{next number}} \\ - & .10 \times 10^1 & \qquad \leftarrow 1 \\ \hline & .01 \times 10^1 & \qquad \leftarrow 0.1 \end{array} \]

so \(eps = \frac{1}{2}(0.1) = 0.05\).

Verify that this approaches gives the previously calculated value for \(eps\) in the number system given by \((\beta, t, L, U) = (10, 3, -3, 3)\).

2.4 Other “Features” of finite precision

When working with floating-point numbers there are other things we need to worry about, too!

Overflow: The number is too large to be represented, e.g. multiply the largest representable number by 10. This gives inf (infinity) with numpy.doubles and is usually “fatal”.
Underflow: The number is too small to be represented, e.g. divide the smallest representable number by 10. This gives \(0\) and may not be immediately obvious.
Divide by zero: gives a result of inf, but \(\frac{0}{0}\) gives nan (not a number)
Divide by inf: gives \(0.0\) with no warning

2.5 Is this all academic?

No! There are many examples of major software errors that have occurred due to programmers not understanding the issues associated with computer arithmetic…

In February 1991, a basic rounding error within software for the US Patriot missile system caused it to fail, contributing to the loss of 28 lives.
In June 1996, the European Space Agency’s Ariane Rocket exploded shortly after take-off: the error was due to failing to handle overflow correctly.
In October 2020, a driverless car drove straight into a wall due to faulty handling of a floating-point error.

2.6 Summary

There is inaccuracy in almost all computer arithmetic.
Care must be taken to minimise its effects, for example:
- add the smallest terms in an expression first;
- avoid taking the difference of two very similar terms;
- even checking whether \(a = b\) is dangerous!
The usual mathematical rules no longer apply.
There is no point in trying to compute a solution to a problem to a greater accuracy than can be stored by the computer.

2.6.1 Further reading

Wikipedia: Floating-point arithmetic
David Goldberg, What every computer scientist should know about floating-point arithmetic, ACM Computing Surveys, Volume 23, Issue 1, March 1991.
John D Cook, Floating point error is the least of my worries, online, November 2011.