preface

If you haven’t experienced the so-called floating point error in your code, you’re lucky.

For example, 0.1 + 0.2 does not equal 0.3, and 8.7/10 does not equal 0.87, but 0.869999… It is so strange 🤔

But this is definitely not a netherworld bug, nor is it a problem with Python’s design. It is a natural consequence of the way floating point numbers are computed, so it is the same even in JavaScript or any other language:

How does a computer store an Integer?

Before we talk about why floating point errors occur, let’s talk about how computers use 0 and 1 to represent integers. We all know binary: 101 represents $2^2 + 2^0$(5), and 1010 represents $2^3 + 2^1$(10).

If it is an unsigned 32 bit integer, it has 32 places for 0 or 1, so the minimum value is 0000… 0000 is 0, and the maximum is 1111… 1111 represents $2^{31} + 2^{30} +… Plus 2 to the 1 plus 2 to the 0 dollars is 4294967,295

From the permutation and combination point of view, since each bit can be 0 or 1, the value of the whole variable has $2^{32}$possibilities, so we can accurately express any value between 0 and $2^{23} – 1$without any error.

Floating Point Number

Although there are many integers between 0 and $2^{23} – 1$, there are only so many integers as $2^{32}$. But floating-point numbers are different. Think of it this way: there are only ten integers in the range 1 to 10, but there are an infinite number of floating-point numbers, such as 5.1, 5.11, 5.111, and so on.

But since there are only 2³² possibilities in a 32 bit space, many CPU manufacturers have invented various floating point representations in order to fit all the floating point numbers into the 32 bit space, but it is also troublesome if the format of each CPU is different. Therefore, IEEE 754, published by IEEE, is used as the general floating-point arithmetic standard, and all CPUs are designed according to this standard.

IEEE 754

Many things are defined in IEEE 754, including the representation of single (32 bit), double (64 bit), and special values (infinity, NaN)

normalized

In terms of a floating point number of 8.5, to convert it to IEEE 754 format you must do some normalized processing: The 8.5 is divided into 8 + 0.5, which is $2^3 + (\cfrac{1}{2})^1$, which is then written in binary as 1000.1, and finally as $1.0001 \times 2^3$, which is very similar to the scientific decimal notation.

Single precision floating point number

In IEEE 754, the 32-bit floating-point number is divided into three parts, namely sign, exponent and fraction, which makes a total of 32 bits

  • Sign: The leftmost bit is a sign. The sign is 0 for a positive number and 1 for a negative number
  • Exponent: The 8 bits in the middle represent normalized exponents in the form of truth +127, which is 3 plus 127 equals 130
  • Fraction: The 23 bit on the far right puts the fractional part, with1.0001It means to get rid of1.After the0001

So 8.5 in 32 bit format would look like this:

When will the error occur?

The previous example of 8.5 can be expressed as $2^3 + (\cfrac{1}{2})^1$, because 8 and 0.5 happen to be powers of 2, so there is no precision problem at all.

But if it were 8.9, because there is no way to add the numbers to the power of 2, it would be forced to be $1.0001110011… \times 2^3$, and there will be an error of about $0.0000003. If you are curious about the result, you can play it on the IEEE-754 Floating Point Converter website.

Double precision floating-point numbers

In order to minimize the error, IEEE 754 also defines how to use 64 bits to represent a single precision floating point number. Compared with 32 bits, the fraction is more than twice as large, from 23 bits to 52 bits. So the accuracy will naturally improve a lot.

Taking 8.9 as an example, 64 bit representation can be more accurate, but because 8.9 cannot be completely written as the sum of two numbers, there will still be errors under the decimal 16 bits, but it is much smaller than the error of 0.0000003 for a single precision

The same is true of Python 1.0 and 0.999… 999 is the same, 123 is 122.999… 999 is also the same. Since the gap between them is too small to be placed in fraction, each bit of them is the same from the perspective of binary format.

The solution

Since floating-point errors are unavoidable, you have to live with them. Here are two common ways to deal with them:

Set the maximum permissible error ε (epsilon)

Some languages provide what’s called epsilon, which lets you determine if you’re within the allowable range of floating point errors. In Python, epsilon is about $2.2e^{-16}$

So you can rewrite 0.1 + 0.2 == 0.3 as 0.1 + 0.2-0.3 <= epsilon, which will avoid the floating point error in the calculation process, and correctly compare whether 0.1 + 0.2 is equal to 0.3.

Of course, you can define your own epsilon if the system doesn’t provide it, and set it to the power of 2 to the minus 15

Calculate entirely in decimal

The reason for the floating point error is that the process of converting from decimal to binary cannot fit all the decimal parts into the mantissima. Since there may be an error in the conversion, we simply do not convert, and use the decimal system to do the operation.

There’s a module called decimal in Python, and there’s a similar package in JavaScript. It will help you calculate in decimal terms, just as you can calculate 0.1 + 0.2 with pen and paper without any error or error.

Although Decimal calculation can completely avoid floating-point error, since Decimal calculation of DECIMAL is simulated, binary calculation is still carried out in the lowest CPU circuit, which will be much slower than the original floating-point calculation. Therefore, it is not recommended to use DECIMAL for all floating-point calculations.