This article is not just for Java

The problem

After these first two pieces of code, you should have a good understanding of floating point numbers and probably don’t want to use “float “(float/double) anymore

Question 1

        floatA = 0.1 f;floatB = 0.2 f;floatc = a +b; System. The out. Println (Float.com pare said (0.3 f, a + b)); System. The out. Println (c - 0.3 f); System.out.println("= = = = = = = = = = = = = = = = = = ="); Double a1 = 0.1; Double b1 = 0.2; double c1 =a1 +b1; System. The out. Println (Double.com pare said (0.3 d, c1)); System. The out. Println (c1-0.3); System.out.println(Math.pow(2,-54));Copy the code

The output

true0.0 = = = = = = = = = = = = = = = = = = =falseE-17 e-17 5.551115123125783 5.551115123125783Copy the code

Math.pow(2,-54) math.pow (2,-54)

Question 2

        floatX = 0.0 f / 0.0 f; System.out.println(x);Copy the code

The output

NaN
Copy the code

Why is the output of the NaN, rather than Java. Lang. ArithmeticException: / by zero the exception

IEEE 754

We know that any data in memory is stored in binary, so IEEE 754 specifies how floating-point numbers are stored in binary in computers, and the Java language complies with this standard (which popular language dare not).

Floating point type

Single and double floating-point numbers correspond to float and double, respectively, in the Java language

Single-precision floating point number

A double – precision floating – point number

As you can see from the figure above, under IEEE 754, the binary format of floating point numbers is divided into three parts, similar to scientific notation.

S is the sign bit, 0 is the positive bit, 1 is the negative bit, E is the exponent, b is how many bits of the exponent, M is the mantissa, n is how many mantissa bits

Binary number is converted to floating point number

The formula for converting binary numbers to floating point numbers is as follows

The formula above has two points, which you may not understand, but it’s very cleverly designed and very cleverly designed

First, why does E need to subtract the size of a mask? Because E is generated with the size of a mask.

So why add a mask size? First of all, our exponents are positive and negative, so the highest bit of E is the sign bit. Taking 8-bit E as an example, ordinary sign numbers are used, which can represent numbers ranging from +0 to 127, -0 to -127. For a number to be an exponent, we do not need 2 zeros. So we solve this problem by adding a mask size (the technical term is shift storage). Through shift storage, the range of numbers we can express is 0 to 128,0 to -127.

Shift storage you can also think of it as mapping unsigned numbers to signed numbers

Second, why do we need to add 1 in front of M? Because the mantissa M drops one bit. After the shift, the first digit is forced to be 1, so it’s omitted. The mantissa of the n digit is actually n+1 digit. The first digit is forced to be 1, right? It’s not impossible to represent a zero, but what about a floating point zero? I’ll talk about that next.

Conversion of floating point numbers to binary numbers

The whole number, divided by 2, keeps taking 1’s and 0’s of the remainder, and you get the decimal part in reverse order, times 2, keeps taking 1’s and 0’s of the whole number, and you get positive order

Taking 10.5, 10.2, 0.2 and 0.5 for the integer part of 10, we can get that its binary is 1010 for 0.5

0.5 * 2 = 1.0 take 1Copy the code

So the binary of 0.5 is.1

For 0.2

0.2 * 2 = 0.4 take 0 0.4 * 2 = 0.8 take 0 0.8 * 2 = 1.6 take 1 0.6 * 2 = 1.2 take 1 0.2 * 2 = 0.4 take 0 cycleCopy the code

So 0.2 binary is.001 1001 1001 1001…

For the sake of specification, the floating-point binary we use requires a power carry to ensure that the integer part is 1

Take a 32-bit single-precision floating-point number as an example

The binary representation of 10.5 is 0 10000010 0101 0000 0000 0000 0000 0000 000 the binary representation of 10.2 is 0 10000010 0100 0110 0110 0110 0110 011 011 the binary representation of 0.5 is 0 01111110 0000 0000 0000 0000 0000 0000 0000 0000 0.2 binary representation is 0 01111100 1001 1001 1001 1001 1001 101

The transformation code

The human conversion is too painful. Go code

    public static void main(String[] args) {

        getFloatBin(0.1 f); getFloatBin(0.2 f); getFloatBin(0.3 f); getDoubleBin(0.1 d); getDoubleBin(0.2 d); getDoubleBin(0.3 d); } public static void getFloatBin(float f){
        int b=Float.floatToIntBits(f);
        String s = Integer.toBinaryString(b);
        int size = s.length();
        for(int i =0 ; i < 32-size; i++){ s ="0" +s;
        }
        for(int i =0 ; i< s.length(); i++){if(i ==1 || i ==9){
                System.out.print(",");
            }
            if(i >8 && (i-9)%4 ==0){
                System.out.print("");
            }
            System.out.print(s.charAt(i));
        }
        System.out.println();

    }

    public static void getDoubleBin(double d){
        long l = Double.doubleToLongBits(d);
        String s = Long.toBinaryString(l);
        int size = s.length();
        for(int i =0 ; i < 64-size; i++){ s ="0" +s;
        }
        for(int i =0 ; i< s.length(); i++){if(i ==1 || i ==12){
                System.out.print(",");
            }
            if(i >12 && (i-12)%4 ==0){
                System.out.print("");
            }
            System.out.print(s.charAt(i));
        }
        System.out.println();
    }
Copy the code

You can see a rule, as long as the floating point number does not end in 5, the corresponding binary is infinite loop, must be rounded, binary rounding is very simple, as long as the next 1, the round

Special numerical processing

As mentioned above, the value of E of the exponent ranges from 0 to 128, 0 to -127, but the corresponding floating-point numbers of 128(11111111) and -127(00000000) have special meanings.

+ 0

0

+ up

– up

NAN

NAN = not a number, which is returned by dividing a floating-point number by 0

Nonnormalized number

In addition to +/-0, floating point E=00000000 is used to represent numbers that are very small and close to 0. This number is de-normalized because it omits the beginning of 0. The corresponding number that omits the beginning of 1 is normalized. You know it’s there, you don’t really need it.

The range of floating point numbers

Knowing the binary format of a floating-point number, its range can be easily deduced

Normalization-minimum-absolute value

00 80 00 00 = 2-126 * (1+0/223)= 2-126≈ 1,17549435∙ E-38 80 80 00 00 =-2-126 * (1+0/223)= -1,17549435∙e-38

Normalization-maximum-absolute value

7F 7F FF FF = 2127(2-2-23) = 2128≈ 3,40282347∙ E +38 FF 7F FF FF = -2127(2-2-23) = -2128≈ 3,40282347∙e+38

Unnormalized floating-point numbers are not listed, so you can explore them yourself

Answer the opening question

Question 1

For single-precision floating-point numbers, known as floats, 0.1+0.2 computes as follows

0.1 = 0,01111100,1001 1001 1001 1001 101 = 1,1001 1001 1001 1001 1001 101 * 2^-4 0.2 = 0,01111100,1001 1001 1001 1001 1001 101 = 1,1001 1001 1001 1001 101 * 2^-3 0.1 + 0.2 = 1,1001 1001 1001 1001 1001 1001 101 * 2^-4 + 11,0011 0011 0011 0011 0011 01 * 2^-4 = 100,1100 1100 1100 1100 1100 111 * 2^-4 = 1,0011 0011 0011 0011 0011 0011 001(1) * 2^-2 = 1,0011 0011 0011 0011 0011 010 * 2^-2 0.3 = 0,01111101,0011 0011 0011 0011 0011 0011 010 = 1,0011 0011 0011 0011 0011 010 * 2^-2Copy the code

For a double-precision floating-point number, that is, a double, the logic is as follows

0.1 = 0,01111111011,1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010 = 1,1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010 * 2 ^ -4 0.2 = 0,01111111100,1001 1001 1001 1001 1001 1001 1001 1001 1001 1010 = 1 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010 * 2 ^ -3 0.1 + 0.2 = 1 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010 * 2 ^ 4 + 1 0011 0011 0011 11001 0011 0011 0011 0011 0011 0011 0011 0011 010 * 2 ^ -4 = 100,1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1110 * 2 ^ -4 = 1,0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 (10) * 2 ^ -2 = 1,0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0100 * 2 ^ -2 0.3 = 0,01111111101,0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 = 1,0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011Copy the code

Floating point number is converted to binary, there is an infinite loop problem, so the significant bits will be rounded, for the binary number is 1 carry 0 do not carry the rule.

In 32 bit floating point 0.2 +0.1, 0.1, 0.2, 0.1+0.2, 0.3, all have rounding and carry the same number of bits. In the case of 0.2+0.1, the significant number is 52 bits, which is a multiple of 4, and the binary 0011 that follows does not carry, but 0.1+0.2 is greater than 0.3 because of exponent misalignment in the process of 0.1+0.2.

Question 2

The Floating-point implementation of the Java language follows the IEEE 754 specification, and NAN is one of the requirements of this specification. So a floating-point number divided by 0 will return NAN.

How do I handle floating point numbers in A Java program

After the above analysis, it is clear that if you use float or double in your Java application, there will be precision problems.

So be sure to use BigDecimal or integers for the amount fields in your application!!

If no operation is performed, it can be used as usual

The principle of BigDecimal is that there are no precision issues when converting integers to binary.

reference

IEEE 754 Floating point Standard Why are floating point numbers inaccurate? Floating point number in Python operation problems in memory representation shift storage difficulties understand

See this way, can you pay attention to our official number