This article is not just for Java
The problem
After these first two pieces of code, you should have a good understanding of floating point numbers and probably don’t want to use “float “(float/double) anymore
Question 1
floatA = 0.1 f;floatB = 0.2 f;floatc = a +b; System. The out. Println (Float.com pare said (0.3 f, a + b)); System. The out. Println (c - 0.3 f); System.out.println("= = = = = = = = = = = = = = = = = = ="); Double a1 = 0.1; Double b1 = 0.2; double c1 =a1 +b1; System. The out. Println (Double.com pare said (0.3 d, c1)); System. The out. Println (c1-0.3); System.out.println(Math.pow(2,-54));Copy the code
The output
true0.0 = = = = = = = = = = = = = = = = = = =falseE-17 e-17 5.551115123125783 5.551115123125783Copy the code
Math.pow(2,-54) math.pow (2,-54)
Question 2
floatX = 0.0 f / 0.0 f; System.out.println(x);Copy the code
The output
NaN
Copy the code
Why is the output of the NaN, rather than Java. Lang. ArithmeticException: / by zero the exception
IEEE 754
We know that any data in memory is stored in binary, so IEEE 754 specifies how floating-point numbers are stored in binary in computers, and the Java language complies with this standard (which popular language dare not).
Floating point type
Single and double floating-point numbers correspond to float and double, respectively, in the Java language
Single-precision floating point number
A double – precision floating – point number
As you can see from the figure above, under IEEE 754, the binary format of floating point numbers is divided into three parts, similar to scientific notation.
S is the sign bit, 0 is the positive bit, 1 is the negative bit, E is the exponent, b is how many bits of the exponent, M is the mantissa, n is how many mantissa bits
Binary number is converted to floating point number
The formula for converting binary numbers to floating point numbers is as follows
The formula above has two points, which you may not understand, but it’s very cleverly designed and very cleverly designed
First, why does E need to subtract the size of a mask? Because E is generated with the size of a mask.
So why add a mask size? First of all, our exponents are positive and negative, so the highest bit of E is the sign bit. Taking 8-bit E as an example, ordinary sign numbers are used, which can represent numbers ranging from +0 to 127, -0 to -127. For a number to be an exponent, we do not need 2 zeros. So we solve this problem by adding a mask size (the technical term is shift storage). Through shift storage, the range of numbers we can express is 0 to 128,0 to -127.
Shift storage you can also think of it as mapping unsigned numbers to signed numbers
Second, why do we need to add 1 in front of M? Because the mantissa M drops one bit. After the shift, the first digit is forced to be 1, so it’s omitted. The mantissa of the n digit is actually n+1 digit. The first digit is forced to be 1, right? It’s not impossible to represent a zero, but what about a floating point zero? I’ll talk about that next.
Conversion of floating point numbers to binary numbers
The whole number, divided by 2, keeps taking 1’s and 0’s of the remainder, and you get the decimal part in reverse order, times 2, keeps taking 1’s and 0’s of the whole number, and you get positive order
Taking 10.5, 10.2, 0.2 and 0.5 for the integer part of 10, we can get that its binary is 1010 for 0.5
0.5 * 2 = 1.0 take 1Copy the code
So the binary of 0.5 is.1
For 0.2
0.2 * 2 = 0.4 take 0 0.4 * 2 = 0.8 take 0 0.8 * 2 = 1.6 take 1 0.6 * 2 = 1.2 take 1 0.2 * 2 = 0.4 take 0 cycleCopy the code
So 0.2 binary is.001 1001 1001 1001…
For the sake of specification, the floating-point binary we use requires a power carry to ensure that the integer part is 1
Take a 32-bit single-precision floating-point number as an example
The binary representation of 10.5 is 0 10000010 0101 0000 0000 0000 0000 0000 000 the binary representation of 10.2 is 0 10000010 0100 0110 0110 0110 0110 011 011 the binary representation of 0.5 is 0 01111110 0000 0000 0000 0000 0000 0000 0000 0000 0.2 binary representation is 0 01111100 1001 1001 1001 1001 1001 101
The transformation code
The human conversion is too painful. Go code
public static void main(String[] args) {
getFloatBin(0.1 f); getFloatBin(0.2 f); getFloatBin(0.3 f); getDoubleBin(0.1 d); getDoubleBin(0.2 d); getDoubleBin(0.3 d); } public static void getFloatBin(float f){
int b=Float.floatToIntBits(f);
String s = Integer.toBinaryString(b);
int size = s.length();
for(int i =0 ; i < 32-size; i++){ s ="0" +s;
}
for(int i =0 ; i< s.length(); i++){if(i ==1 || i ==9){
System.out.print(",");
}
if(i >8 && (i-9)%4 ==0){
System.out.print("");
}
System.out.print(s.charAt(i));
}
System.out.println();
}
public static void getDoubleBin(double d){
long l = Double.doubleToLongBits(d);
String s = Long.toBinaryString(l);
int size = s.length();
for(int i =0 ; i < 64-size; i++){ s ="0" +s;
}
for(int i =0 ; i< s.length(); i++){if(i ==1 || i ==12){
System.out.print(",");
}
if(i >12 && (i-12)%4 ==0){
System.out.print("");
}
System.out.print(s.charAt(i));
}
System.out.println();
}
Copy the code
You can see a rule, as long as the floating point number does not end in 5, the corresponding binary is infinite loop, must be rounded, binary rounding is very simple, as long as the next 1, the round
Special numerical processing
As mentioned above, the value of E of the exponent ranges from 0 to 128, 0 to -127, but the corresponding floating-point numbers of 128(11111111) and -127(00000000) have special meanings.
+ 0
0
+ up
– up
NAN
NAN = not a number, which is returned by dividing a floating-point number by 0
Nonnormalized number
In addition to +/-0, floating point E=00000000 is used to represent numbers that are very small and close to 0. This number is de-normalized because it omits the beginning of 0. The corresponding number that omits the beginning of 1 is normalized. You know it’s there, you don’t really need it.
The range of floating point numbers
Knowing the binary format of a floating-point number, its range can be easily deduced
Normalization-minimum-absolute value
00 80 00 00 = 2-126 * (1+0/223)= 2-126≈ 1,17549435∙ E-38 80 80 00 00 =-2-126 * (1+0/223)= -1,17549435∙e-38
Normalization-maximum-absolute value
7F 7F FF FF = 2127(2-2-23) = 2128≈ 3,40282347∙ E +38 FF 7F FF FF = -2127(2-2-23) = -2128≈ 3,40282347∙e+38
Unnormalized floating-point numbers are not listed, so you can explore them yourself
Answer the opening question
Question 1
For single-precision floating-point numbers, known as floats, 0.1+0.2 computes as follows
0.1 = 0,01111100,1001 1001 1001 1001 101 = 1,1001 1001 1001 1001 1001 101 * 2^-4 0.2 = 0,01111100,1001 1001 1001 1001 1001 101 = 1,1001 1001 1001 1001 101 * 2^-3 0.1 + 0.2 = 1,1001 1001 1001 1001 1001 1001 101 * 2^-4 + 11,0011 0011 0011 0011 0011 01 * 2^-4 = 100,1100 1100 1100 1100 1100 111 * 2^-4 = 1,0011 0011 0011 0011 0011 0011 001(1) * 2^-2 = 1,0011 0011 0011 0011 0011 010 * 2^-2 0.3 = 0,01111101,0011 0011 0011 0011 0011 0011 010 = 1,0011 0011 0011 0011 0011 010 * 2^-2Copy the code
For a double-precision floating-point number, that is, a double, the logic is as follows
0.1 = 0,01111111011,1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010 = 1,1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010 * 2 ^ -4 0.2 = 0,01111111100,1001 1001 1001 1001 1001 1001 1001 1001 1001 1010 = 1 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010 * 2 ^ -3 0.1 + 0.2 = 1 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010 * 2 ^ 4 + 1 0011 0011 0011 11001 0011 0011 0011 0011 0011 0011 0011 0011 010 * 2 ^ -4 = 100,1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1110 * 2 ^ -4 = 1,0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 (10) * 2 ^ -2 = 1,0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0100 * 2 ^ -2 0.3 = 0,01111111101,0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 = 1,0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011Copy the code
Floating point number is converted to binary, there is an infinite loop problem, so the significant bits will be rounded, for the binary number is 1 carry 0 do not carry the rule.
In 32 bit floating point 0.2 +0.1, 0.1, 0.2, 0.1+0.2, 0.3, all have rounding and carry the same number of bits. In the case of 0.2+0.1, the significant number is 52 bits, which is a multiple of 4, and the binary 0011 that follows does not carry, but 0.1+0.2 is greater than 0.3 because of exponent misalignment in the process of 0.1+0.2.
Question 2
The Floating-point implementation of the Java language follows the IEEE 754 specification, and NAN is one of the requirements of this specification. So a floating-point number divided by 0 will return NAN.
How do I handle floating point numbers in A Java program
After the above analysis, it is clear that if you use float or double in your Java application, there will be precision problems.
So be sure to use BigDecimal or integers for the amount fields in your application!!
If no operation is performed, it can be used as usual
The principle of BigDecimal is that there are no precision issues when converting integers to binary.
reference
IEEE 754 Floating point Standard Why are floating point numbers inaccurate? Floating point number in Python operation problems in memory representation shift storage difficulties understand