preface
Let’s look at an example of double distortion
public class DoubleTest {
public static void main(String[] args) {
for (double i = 0; i < 1; i = (i * 10 + 1) / 10) {
for (double k = 0; k < i; k = (k * 10 + 1) / 10) {
System.out.println(i + "-" + k + "="+ (i - k)); }}}}Copy the code
Output:
0.1-0.0 = 0.1 0.2 0.0 = 0.2 0.2 0.1 = 0.1 0.3-0.0 = 0.3 0.3 0.1 = 0.19999999999999998 0.3 0.2 = 0.09999999999999998 0.4 0.0 = 0.4 0.4-0.1 = 0.30000000000000004 0.4 0.2 = 0.2 0.4 0.3 = 0.10000000000000003 0.5-0.0 = 0.5 0.5 0.1 = 0.4 0.5 0.2 = 0.3 0.5 0.3 = 0.2 0.6-0.4= 0.099999999999998 0.6-0.0=0.6 0.6-0.1=0.5 0.6-0.2= 0.3999999999997 0.6-0.3=0.3 0.6-0.4=0.19999999999999996 0.6-0.5= 0.099999999999998 0.7-0.0=0.7 0.7-0.1=0.6 0.7-0.2= 0.4999999999994 0.7-0.3= 0.399999999999997 0.7-0.4 = 0.29999999999999993 0.7 0.5 = 0.19999999999999996 0.7 0.6 = 0.09999999999999998 0.8 0.0 = 0.8 0.8-0.1=0.7000000000000001 0.8-0.2=0.6000000000000001 0.8-0.3=0.5 0.8-0.4=0.4 0.8-0.5=0.30000000000000004 0.8-0.6=0.20000000000000007 0.8-0.7= 0.100000000000000000009 0.9-0.0=0.9 0.9-0.1=0.8 0.9-0.2=0.7 0.9-0.3=0.6000000000000001 0.9-0.4=0.5 0.9-0.5=0.4 0.9-0.6=0.30000000000000004 0.9-0.7=0.20000000000000007 0.9-0.8= 0.0999999999999999998
What is a floating point number?
1, decimal
The composition of a decimal: in our country, the decimal represents by three parts, respectively is the integer + decimal point (separator) + decimal.
2. Why are decimals called floating point numbers
Floating point is a numerical representation of numbers belonging to a particular subset of rational numbers, used in computers to approximate any real number. Specifically, the real number is obtained by multiplying an integer or fixed-point number (the mantissa) by an integer power of some base (usually 2 in computers), similar to the scientific notation for base 10.
A floating-point number is a number where the decimal point can float anywhere.
In the machine language of a computer, there is only binary, and machine language can only recognize 0 and 1. Therefore, the computer is also impossible to store decimals, so there needs to be another flexible storage scheme. Such a scheme is called an exponential scheme:
3. Representation of floating point numbers in Java
For float, there are four bytes, 32 bits, 0-22 bits for mantissa, 23-30(8 bits) for exponent, and 31 bits for sign bits.
For double, it is 8 bytes, 64 bits, 0-51 mantissa, 52-62(11 bits) exponent, and 63 highest bits sign bits.
How are floating point numbers stored in memory?
We know that any data in computer memory is stored as’ 0\1 ‘, and floating-point numbers are also stored as’ 0\1 ‘. So decimal floating-point numbers must be converted to binary floating-point numbers when stored.
In memory using binary scientific notation to store, so divided into order code (that is, index) and base, because there are also positive and negative points, so there is a sign bit. For example, float is stored in memory as follows:
Double (1bit) Exponent (11 bit) Mantissa (52 bit)
Float takes up 8 bits in memory, and since the order code actually stores the shift of the exponent, assuming the truth value of the exponent is e and the order is e, we have e =e+(2^ n-1-1). Where 2^ n-1-1 is the exponential offset specified in IEEE754 standard. According to this formula, we can get 2^ 8-1 =127. Thus, float has an exponential range of -128 +127, while double has an exponential range of -1024 +1023. The negative exponential determines the minimum non-zero absolute value that floating point number can express. The positive exponent determines the maximum absolute value that can be expressed by a floating point number, which determines the range of floating point values.
Float +2^ 128 to +2^127 (+3.40E+38 to +3.40E+38)
The range of double is from -2^1024 to +2^1023, i.e. -1.79e +308 to +1.79E+308
We’re using shift storage, so for float, we add 127 to the exponent and 1023 to the double (minus 127 and 1023, respectively)
Shift storage is essentially a guarantee of consistency between +0 and -0.
In terms of the eight digits in the float index,
So this new byte of eight bits is represented by a string of numbers: 0000, 0000
First, let’s assume that instead of using the shift storage technique, we can simply look at how many numbers this new 8-bit byte can represent: 0000 0000-1111 1111 is 0-255, 256 numbers.
But we know that these eight digits have to be both positive and negative.
So take out the first sign on the left for plus and minus:
The first interval:
0000 0000 -0 111 1111 = +0 to 127
The second interval:
1 000 0000 -1 111 1111 is -0 to -127
That’s the question: how can there be two zeros, a positive zero and a negative zero?
Use shift storage: float uses 127(0111 1111)
0000 0000 +0111 1111=0111 1111= 1:1 +127=128 0000 0001 +0111 1111=1000 0000 =128: 128+127=255 1000 0000+0111 1111=1111 1111
The largest positive number, any larger than that, will overflow.
-1+127=126=127-1 0111 1111-0000 0001=0111 1110 -1: -2+127=125=127-2 0111 1111-0000 0010=0111 1101 -127: -127+127=0 0111 1111-0111 1111=0000 0000
The smallest negative number, overflows at school.
Base conversion of floating – point numbers
1, decimal to binary
Mainly look at decimal to binary, integer part and decimal part separately
-
Integer part: divide the integer by 2 to get a quotient and remainder, which continues to divide by 2 to get a quotient and a remainder, which continues to divide by 2 until the quotient is 0. The above operation gives a series of remainder, starting with the last remainder and ending with the first remainder. This series of 0\1 is the converted binary number.
-
Decimal part: Multiply by 2, then take the whole part, multiply the rest of the decimal part by 2, then take the whole part again, until the decimal part is zero. If it is never zero, reserve enough decimal places as required, rounding the last digit with 0. Put the extracted integers in order.
As you can see from the above conversion, not every decimal fraction can be expressed exactly in binary. A decimal P between 0 and 1 can be expressed as follows:
public static void main(String[] args) {
float f1=20f;
floatF2 = 20.3 f;floatF3 = 20.5 f; double d1=20; Double d2 = 20.3; Double d3 = 20.5; System.out.println(f1==d1); System.out.println(f2==d2); System.out.println(f3==d3); }Copy the code
true false true
To convert binary, multiply by 2, take the whole number, subtract the whole number from the result, and then multiply by 2 until there are no decimals or the decimal repeats.
0.3 * 2 = 0.6 (0) 0.6 * 2 = 1.2 (1) 0.2 * 2 = 0.4 (0) 0.4 * 2 = 0.8 (0) 0.8 * 2 = 1.6 (1)
At this point, 0.6 will appear again, entering the cycle, so 0.3 = 0.010011001… 1001 so 20.3 = 10100.010011001… 1001 (binary)
2, binary scientific notation
20.3 = 10100.010011001… 1001 (binary)=1.01000100110011E10… . (Decimal science) =1.01000100110011E100… . (Binary scientific counting)
We’re using shift storage, so for float, we add 127 to the exponent and 1023 to the double (minus 127 and 1023, respectively)
To note at the same time, float, for example, the highest level represents the number of the sign bit, index, a total of eight, the highest said is index of the positive and negative, because it is possible that E – 100 situation, so although there are eight, highest level is the sign bit, the remaining seven is said true value, that is why use shift storage.
A number can be equal to the corresponding double as long as it is within the range of and float and the decimal part is not infinite.
Floating point rounding rules
For example, when rounding a double-precision floating point number with the mantissa of 52 digits, you need to refer to the 53rd digit.
If the 53rd digit is 1 and all subsequent digits are 0, the 52nd digit is 0. If the 52nd bit is 0, no further operations are required. If the 52nd bit is 1, the 53rd bit must be carried one bit further to the 52nd bit.
If the 53rd digit is 1, but not all of the following digits are 0, then the 53rd digit is the next digit to the 52nd digit.
If it is not the above two cases, that is, 53 bits are 0, then it is directly discarded without carrying, called rounding down.
The floating-point rounding rule proves why the relative rounding error in the floating-point rounding mentioned above cannot be greater than half of the machine’s ε.
For Java, float is normally reserved for 7 decimal places and double for 15 decimal places.
This is also due to the data width limitation of the mantissa
For float, because 2^23 = 8388608
At the same time, the left-most bit is omitted by default, so it can actually represent the number of 2^24 = 16777216. It can represent 8 bits at most, but it can only represent 7 bits with absolute accuracy.
For double, 2^52 = 4503599627370496. Plus an omitted bit, 2^53 = 9007199254740992. So double can represent at most 16 bits, and absolute precision only 15 bits.
4, machine ε
Machine ε represents the difference between 1 and the smallest floating point number greater than 1. Machine ε is defined differently with different precision. In the case of double precision,
Double precision means 1 is
1.000… 0000 (52 0’s) times 2^0
And the smallest double greater than 1 is (actually can represent a smaller range, as mentioned later, but does not affect the machine ε here).
1.000……0001 × 2^0
That is
^ 2-2.220446049250313 e-16 52 material. So it’s the double precision machine ε.
In rounding, the relative rounding error cannot be greater than half of machine ε.
For double-precision floating-point numbers, this value is 0.00000005960464477539.
So when you multiply eight consecutive 0.1’s in a Double in Java, you get an imprecise representation.
Personal blog
Tencent Cloud Community
CSDN
Jane’s book
The public no. :
Reference: baijiahao.baidu.com/s?id=161817… www.cnblogs.com/Vicebery/p/… Blog.csdn.net/Return_head… Blog.csdn.net/u011277123/… Blog.csdn.net/endlessseao…