1. In a computer, what is 1-0.9 of a single precision?
Under normal calculation, the result of 1-0.9 must be 0.1, the answer is indisputable. But in the eyes of a computer, what is 1-0.9 of a single precision? For those of you who have studied it, the result is 0.100000024.
Let’s execute an order.
public static void main(String[] args) {
System.out.println(1.0 f-0.9 f);
}
// The result is 0.100000024
Copy the code
Is 1-0.9 of a single precision equal to 0.9-0.8?
public static void main(String[] args) {
System.out.println((1.0 f-0.9 f) = = (0.9 f-0.8 f));
}
// The result is false
Copy the code
Let’s print it out
public static void main(String[] args) {
System.out.println(0.9 f-0.8 f);
}
// Result is 0.099999964
Copy the code
It can be seen that there is an obvious error between the calculation result and the expectation, so how does this error come about? The next step is to store and calculate floating point numbers
2. 0 and 1
Before introducing floating-point numbers, let’s review binary, because floating-point numbers are also stored and computed in binary on computers.
Simply put, a computer is an electronic device, and whatever technology is on top of it is essentially zero and one signal processing. Metadata for information storage and logical computation can only be 0 and 1. Its carry rule is “every two into one”, and its borrow rule is “borrow one when two”. For example, a decimal 1 in binary is also a 1, and a decimal 2 in binary is a 10. Similarly, 4==>100, 24==>11000…
In decimal 24, for example, convert it into binary, make constant divided by 2, 24 (12 for the first time, the remainder is 0), (the second 6, remainder 0), (3, for the third time remainder 0), (1 for the fourth time, remainder of 1), (fifth, 0, the remainder is 1), and then arrange remainder reverse is 24 corresponding binary, That is 11000
The binary to decimal way, in the case of 11000, is 24+232^4+2^324+23, which is 24
The basic unit of storage used in computers is the bit, or b for short. Eight bits make up a Byte, or Byte, short for B. 1024 bytes, making 1KB; 1024 KB is counted as 1MB. 1024 MB as 1GB……
We introduce binary computation in the basic unit of one byte, the binary representation of decimal is 0000 0001. The value on the far left is plus or minus, 0 is plus, 1 is minus, so minus 1 can be 1000, 0001, which is the binary source code.
Binary computation involves three encoding methods: source code, inverse code and complement code.
- Source code: the positive number is the value itself, the symbol is 0; A negative number is the number itself, and the sign bit is 1. The representation range for 8-bit binary numbers is [-127,127]
- Inverse code: the positive number is the value itself, the sign is 0; The numerical value of a negative number is the inverse of the positive representation, with the sign bit being 1. The representation range for 8-bit binary numbers is [-127,127]
- Complement: the positive number is the value itself, the sign is 0; The value of a negative number is added to the inverse by 1, and the sign bit is 1. 8-bit binary numbers are represented in the range [-128,127] (note that the range represented by the complement is a bit larger than the source and inverse, explained later)
Example:
Decimal value | The original code | Radix-minus-one complement | complement |
---|---|---|---|
1 | 0000, 0001, | 0000, 0001, | 0000, 0001, |
– 1 | 1000, 0001, | 1111, 1110, | 1111, 1111, |
2 | 0000, 0010, | 0000, 0010, | 0000, 0010, |
2 – | 1000, 0010, | 1111, 1101, | 1111, 1110, |
So the question comes, since the original code is more consistent with our cognition, what is the use of inverse code and complement? Because the computer operation way and the thinking mode of human is not the same, we can easily distinguish a numeric value is positive or negative, and the computer if the sign bit and the numerical calculated separately, you need to make additional judgment, in a more complex applications, such calculations stacked up to the huge computational overhead, obviously it is not reasonable. In order to speed up the computing speed of the computer, it is necessary to calculate the symbol bit together, and if the original code is used for calculation, problems may occur in some cases. Take subtraction as an example, subtracting a value is to add the negative of the value, 1-2=1+(-2)=-1. According to the original code, [00000001] Original +[10000010] Original =[10000011] Original =−3[0000 0001]_ Original +[10000010] _ Original =[1000 0011]_ original =-3[00000001] Original +[10000010] Original =[10000011] Original =−3, this is incorrect, and if using inverse code to calculate, [00000001] Inverse +[11111101] Inverse =[11111110] Inverse =−1[0000 0001]_ Inverse +[11111101] _ inverse =[1111 1110]_ inverse =-1[00000001] Inverse +[11111101] Inverse =[11111110] Inverse =−1. The calculation is correct. One more example of complement, 2 + (2) = [00000010] against + [11111101] = [11111111] against = [0000] 0010-0 _ + = [1111] 1101 _ against [1111 1111]_ inverse =-0[00000010] Inverse +[11111101] inverse =[11111111] inverse =−0, according to correct cognition, 0 is 0, there is no positive or negative, such calculation obviously has problems. As coding evolved, complement was born, and the same calculation, 2+(-2)= [00000010] Complement +[11111110] Complement =[00000000] Complement =0[0000 0010]_ complement +[11111110] _ complement =[0000 0000]_ complement =0[00000010] complement +[11111110] complement =[00000000] complement =0, complement code, solve the +0 and -0 problem. In addition, the birth of complement increases the range that can be represented by binary code. In 8-bit binary code, complement can be represented to -128, and its corresponding complement is [10000000] complement [10000000] _ complement [10000000] complement (complement: the complement of the complement is equal to the source code).
3. The floating point number
Floating-point numbers, unlike integers, cannot be stored and computed as described above, but are represented separately as symbols, exponents, and significant digits. The current floating point standard is IEEE754, which provides four floating point types: single precision, double precision, extended single precision, extended double, the first two are the most commonly used, and the difference between single precision and double precision is only the number of digits, the following review of single precision floating point.
The single precision is allocated 4 bytes, which is 32 bits, as shown in the following figure:
The lower end of the memory address is usually written to the right, called the “least significant bit”, which represents the smallest bit and has the least impact on the whole. Single-precision floating-point numbers are represented in the format shown above, with symbols representing the positive and negative values. The representation of floating point numbers depends on scientific notation. The exponent bit represents the exponent of the normalized value. This piece accounts for 8 bits, which are called “order code points” in binary system. The last significant digit is called the mantissa in binary.
- The sign bit
The highest bit of binary is assigned a symbol for floating-point numbers, with 0 representing a positive number and 1 representing a negative number
- Exponent bits
Eight bits are allocated to the right of the sign bit to store the exponent. According to IEEE754 standard, the order code point stores the shift code corresponding to the exponent, rather than the source code or complement code. The geometric meaning of shift code is to map the truth values to a positive number field, and to compare the truth values, just align the high values and compare them one by one.
The truth value is eEE, the order code is EEE, and the offset specified by IEEE754 is 2N −1−12^{n-1} -12N −1−1. NNN is the number of bits of the order code, where N =8n=8n=8. So E= E +(2n−1−1)E= E +(2^{n-1}-1)E= E +(2n−1−1). In terms of offsets, the range that 8-bit binary can represent is [-128,127], which is shifted to the positive field. Each value needs to be added to 128, resulting in a range of [0,255], while the computer specifies that the two values with all zeros or all ones are treated as special values. All zeros are considered machine zeros (values too small to be precise are considered zeros, unlike the value 0, which represents a point and machine zeros a region), and all ones are considered infinity. Remove the two special values, and the range becomes [1, 254], with an offset of 2n−1−12^{n-1}-12n−1−1, the range of the exponents is [-126, 127].
- Mantissa bits
The 23 bits on the far right are used to store significant digits, which, according to scientific notation, are a combination of significant digits and exponents that represent the size of the final floating point number. In scientific notation, the value range before the decimal point is [1,10). In binary, this range is [1,2). In order to save storage space, the first 1 like 1.xyz formed after normalization is omitted, so the 23-bit region represents the 24-bit binary value, so this region is called mantissa.
Each of the three regions has its own responsibilities, which can be simplified as shown in the figure below.
The calculation formula is as follows:
Take the value 16 as an example. The 8-bit binary source code is 0001 0000, and the floating point number is 0100-0001-1000-0000-0000-0000-0000-0000. Let’s calculate the highest bit, 0, indicates a positive number. 100-0001-1 in decimal form is 131, 131-127=4, 24=162^4=1624=16; Times2 ^4=161×24=16
If the value is 1, the corresponding floating point number is 0011-1111-1000-0000-0000-0000-0000. The highest value is 0, indicating a positive number. 011-1111-1 is 127 in decimal notation, 2127−127=12^{127-127}=12127−127=1; The mantissa are all 0, i.e. 1×1=11\times1=11×1=1
The above two values can be accurately represented using floating-point numbers, but for most values, finite bits cannot be accurately represented.
For example, 0.9, the order code point can give 2−12^{-1}2−1(i.e. 0.5), 202^020 and 2−22^{-2}2−2 cannot be multiplied by 1.x to get 0.9; As long as the mantissa is accurately represented as 0.8, the whole can be accurately represented as 0.9, but a finite number of binary bits cannot be accurately represented as 0.8. The binary of 0.9 is 0011-1111-0110-0110-0110-0110-0110, which is not exactly 0.9, so back to the beginning of the problem, 1-0.9 is not exactly equal to 0.1, the specific result will be calculated later. (Add that the binary decimal is converted to decimal, and one digit after the decimal indicates 2−12^{-1}2−1. For example, 1.00000101=1+2−6+2−81.00000101=1+2^{-6}+2^{-8}1.00000101=1+2−6+2−8)
Think about it, how far apart are two adjacent values that can be accurately represented in this way? The range of the exponent is [-126, 127], and the minimum exponent is 2−1262^{-126}2−126. The difference between two adjacent mantissa digits is 2−232^{-23}2−23(adding 1 to the last mantissa digit is 2−232^{-23}2−23 in decimal), Then two adjacent values differ 23 = 2-2-2-126 x 1492 ^ {126} \ times2 ^ {- 23} = 2 ^ {149} 126 x 23 = 2-2-2-149. So what’s the maximum single precision that can be represented? Times10 ^{38}1.7× 10381.7\times10^{38}1.7×1038; The mantissa digit is maximized, that is, each digit is 1. The 1.11··11 represented is approximately considered to be a value infinitely close to 2. Therefore, the maximum value that can be expressed is 2×1.7×1038=3.4×10382\times1.7\times10^{38}=3.4\times10^{38}2×1.7×1038=3.4×1038. What’s the lowest positive number? Similarly, the smallest positive value that can be obtained is 1.0×2−1261.0\times2^{-126}1.0×2−126, where there is a concept called progressive lower overflow, that is, the difference between two values should be uniform; Times2 ^{-126}1.0×2−126 times2^{-149} 1.0×2−126 Times2 ^{-126}1.0×2−126 Times2 ^{-149} 1.0×2−126 Times2 ^{-149}2−149 This condition is called a sudden down spill. IEEE754 standard stipulates progressive down overflow, that is, the difference with 0 is also 2−1492^{-149}2−149, approximately equal to 1.4×10−451.4\times10^{-45}1.4×10−45.
4. Addition and subtraction operations
To add and subtract two values represented by scientific notation, you first need to make sure that the exponents are the same, and then add and subtract the valid values as normal numbers.
- Zero detection
It is stipulated that the value represented by the order code and mantissa are all 0 is 0. If one of the two values is 0, the result can be obtained directly.
- To order
If the order of the two values is the same, the decimal points are aligned. If the mantissa is not the same, the mantissa needs to be moved to change the order code. If the mantissa moves one bit to the right, the order code value will be increased by 1, and vice versa. Think about it, comparing left and right shift, it is possible to squeeze out some binary bits, resulting in error, but the error of left shift is 2−12^{-1}2−1, and the error of right shift is 2−232^{-23}2−23. Obviously, the error of right shift is smaller, so the standard provides that the operation of order can only move right.
- Mantissa sum
When the order is complete, the mantissa can be summed by adding bits (negative numbers need to be converted to complement for operation)
- normalized
The process of converting the results into the canonical form described earlier is called normalization. Mantissa moving to the right is called right return, and vice versa is called left return
- The result of rounding
Both docking operation and normalization operation may cause precision loss. In order to reduce such loss, this part of data removed is saved first, called protection bit, and then rounded according to the protection bit after normalization.
Now that we know the addition and subtraction of binary floating-point numbers, let’s go back to the question at the beginning: 1-0.9=?
- The binary of 1.0 is 0011-1111-1000-0000-0000-0000-0000-0000
- The binary of -0.9 is 1011-1111-0110-0110-0110-0110-0110-0110
For the convenience of calculation, the values of the three regions are separated
Floating point Numbers | symbol | exponent | Mantissa (add hidden value 1) | Mantissa complement (add hidden value) |
---|---|---|---|---|
1.0 | 0 | 127 | The 1000-0000-0000-1000-0000-0000 | The 1000-0000-0000-1000-0000-0000 |
0.9 | 1 | 126 | The 1110-0110-0110-1110-0110-0110 | The 0001-1001-1001-0001-1001-1001 |
- To order
The order code of 1.0 is 127 and the order code of -0.9 is 126. According to the standard, the order code of -0.9 can only be moved right, so the order code of -0.9 becomes 127. After moving right, the mantras complement 1 at the highest bit becomes 1000-1100-1100-1100-1100, and the last 1 is discarded. For convenience, add the 1 directly, also become 1000-1100-1100-1101
- Mantissa sum
- normalized
The value calculated in the previous step does not meet the requirements. The highest mantissa must be 1. Therefore, it is necessary to move the mantissa 4 bits to the left and subtract 4 from the corresponding order code. The normalized symbol is 0, the order code is 123 (the corresponding binary is 1111011), and the mantissa hidden value is 100-1100-1100-1101. The combination of the three parts gives the result of 1-0.9, which in decimal form is 0.100000024
What if we want absolute accuracy? In finance, for example, where a small loss of precision can lead to huge property losses, the representation of Decimal types is recommended.
public static void main(String[] args) {
BigDecimal a = new BigDecimal("1.0");
BigDecimal b = new BigDecimal("0.9");
BigDecimal c = a.subtract(b);
System.out.println(c);
}
// Result is 0.1
Copy the code
If you think it will help you, please give this article a thumbs up! Thanks a million!
Reference 5.
- Java Development Manual
- Baidu encyclopedia