Many people know that floating point numbers in Java are not exact and need to be evaluated accurately with BigDecimal, but few people know why floating point numbers are not exact. Why use it if it’s not exact? This article will carry out a wave of analysis;

As we know, the computer’s digital storage and operation are carried out through binary, for, decimal integer conversion into binary integer using “mod 2, reverse order” method

The specific approach is:

  • Dividing a decimal integer by 2 gives a quotient and remainder;
  • Divide the quotient by 2, and you get another quotient and remainder, and so on, until the quotient is less than 1
  • Then, the remainder obtained first is used as the low significant bit of the binary number, and the remainder obtained later is used as the high significant bit of the binary number.

For example, we want to convert 127 to binary as follows:

So, what about converting a decimal to a binary decimal?

Decimal decimals are converted to binary decimals using the “round by two” method.

The specific approach is:

  • Multiply two by a decimal number to get the product
  • Take the integral part of the product and multiply by 2 the remaining decimal part to get another product
  • The integral part of the product is then taken out, and so on, until the fractional part of the product is zero, at which point 0 or 1 is the last bit of binary. Or achieve the required accuracy.

If you try to convert 0.625 to binary:

But 0.625 is a special column. Use the same algorithm to calculate the binary value of 0.1:

We find an infinite loop in the binary representation of 0.1, i.e. (0.1)10 = (0.000110011001100…). 2

In this case, the computer cannot accurately represent 0.1 in binary.

So, in order to solve the problem of partial decimals not being represented accurately in binary, the IEEE 754 specification was developed.

IEEE Binary Floating-point arithmetic Standard (IEEE 754) is the most widely used floating-point arithmetic standard since the 1980s. It is adopted by many cpus and floating-point arithmetic machines.

Floating point number and decimal are not exactly the same, the computer small number representation, in fact, there are fixed point and floating point two. Because fixed-point numbers have a smaller representation range than floating-point numbers with the same number of digits. So in computer science, floating point numbers are used to represent approximations of real numbers.

IEEE 754 specifies four ways to represent floating point values: single precision (32 bits), double precision (64 bits), extended single precision (43 bits above, rarely used), and extended double precision (79 bits above, usually implemented as 80 bits).

The most common are 32-bit single-precision floating-point numbers and 64-bit double-precision floating-point numbers.

A single-precision floating-point number occupies four bytes (32 bits) in computer memory and can represent a wide range of values using a “floating point” (floating decimal point) method.

Instead of single-precision floating-point numbers, double-precision floating-point numbers use 64 bits (8 bytes) to store a floating-point number.

IEEE does not solve the problem that decimals cannot be represented accurately, but only proposes a way to represent decimals with approximate values and introduces the concept of precision.

A floating point number a is represented by two numbers m and e: A = m × b^e.

In any such system, we choose a cardinality B (the base of the counting system) and precision P (how many bits are used for storage). M (mantissa) is of the form ±d.dd… The P-digit of DDD (each digit is an integer between 0 and b-1, including 0 and B-1).

If the first digit of m is a non-zero integer,m is said to be normalized. Some descriptions use a single sign bit (s for + or -) for positive and negative, such that m must be positive. E is the exponent.

Finally, because the decimals stored in the computer are actually approximations of decimal decimals, not exact values, never use floating point numbers in code to represent important indicators such as amounts.

It is recommended to use BigDecimal or Long (in minutes) to represent amounts.