The article directories
-
- A, quotes
- Floating-point representation
-
- 1, the IEEE 754
- 2, single precision and double precision
- 3. Single-precision floating-point representation
- 4. Give examples
- 5. Code testing
- Three, floating point number judgment
-
- 1. Precision definition
- 2. Determination of equality
- 3. Determination of inequality
- 4, greater than or equal to judgment
- 5, less than or equal to judgment
- 6, less than judgment
- 7, greater than the judgment
A, quotes
- If you look at the following code, what does this output look like?
double x = 0;
for (int i = 0; i < 10; ++i) {
x += 0.1;
}
printf("%d\n", x == 1);
Copy the code
- The output is as follows:
0
- The reason for this contrast is floating point error. Floating point number has precision error when it is stored, so we cannot use ‘==’ when determining floating point number is equal to. Then we will look at the expression of floating point number;
Floating-point representation
1, the IEEE 754
- IEEE Binary Floating-point arithmetic Standard (IEEE 754) is the most widely used floating-point arithmetic standard since the 1980s. It is adopted by many cpus and floating-point arithmetic machines. This standard defines the format for representing floating-point numbers (including negative zero-0 and anomalous values), special values (Inf and NaN), and “floating-point operators” for those values.
- Based on this specification, any floating point number can be expressed as follows: (Value) 2 = S I g * * Fract I on ∗ e xponent (Value)_2 = Sign * Fraction * {Exponent} (Value) 2 = Sign ∗ Fraction ∗ Exponent
Sign represents the Sign bit, representing positive or negative numbers; Faction represents the mantissa, which must start with 1 in scientific notation, for example, 1.010100111; So Exponent stands for Exponent, 1001 is actually 2, 9, 2^9, 29;
2, single precision and double precision
- F L o A T float float doub L e double double
- The mantissa of single precision is 23 bits, and the exponent is 8 bits. The mantissa of double precision is 52 bits, and the exponent is 11 bits.
- Since the structure is the same, only single-precision floating-point numbers are introduced here;
3. Single-precision floating-point representation
- The binary representation of a 32-bit floating-point number is shown, with a total of 32 bits (0 or 1).
- 1) S is in the highest digit, which represents the symbol. The negative number is 1 and the positive number is 0.
- 2) E is the 30th to 23rd bit from the highest to the lowest, with a total of 8 bits. It represents the exponential offset and stores the value after 127 is added (the reason is that the exponential may be negative);
- 3) F represents the mantissa of binary scientific notation, with a total of 23 digits. Since it is scientific notation, it must start with “1.”, so the significant mantissa number is 24 digits.
4. Give examples
[Example] Find the binary storage format of the floating point number − 18.375-18.375 −18.375.
- 1) First remove the symbol, and finally at the highest position of the binary representation 1, then the actual binary representation 18.375;
- 1=(10010)2 18=16+2=2 ^4 +2 ^1 =(10010) _2 18=16+2=24+21=(10010)2
- 3) The binary representation of the decimal part 0.375 is: 0.375 = 0.125 + 0.25 = 2-3 + 2-2 = (. 011) = 0.125 + 0.25 = 0.375 2 ^ 2 ^ {3} + {2} = (. 011) _2 0.375 = 0.125 + 0.25 = 2-2-3 + 2 = 2 (. 011)
- (18.375)10=(10010.011)2 (18.375)_{10} =(10010.011) _2 (18.375)10=(10010.011)2
- 5) Expressed as scientific enumeration method to obtain: 10 = (18.375) (1.0010011) 2 ∗ (100) 2 (18.375) _ {10} = (1.0010011) _2 * (100) _2 (18.375) = 10 (1.0010011) 2 ∗ (100) 2
- 6) Then get the mantissa F=0010011 F=0010011 F=0010011 (discard the “1.” in the scientific counting method), then use 0 to complete; Add 127 to the order, E=(100+1111111)2=(10000011)2 E=(100+1111111) _2 =(10000011) _2 E=(100+1111111)2=(10000011)2; Sign bit S=1 S=1 S=1;
S: 1
E: 10000011
F: 00100110000000000000000
- 7) Fill into binary to get:
SEF = 11000001 10010011 00000000 00000000
5. Code testing
- C/C++ can use the following code to take the address of a floating point number into an integer output after the binary representation of the integer;
float a = 18.375;
unsigned int v = *((unsigned int *)&a);
Copy the code
Three, floating point number judgment
1. Precision definition
- In C++, 1e−6 1e-6 1e−6 is 10−6 10^{-6} 10−6, 0.000001 0.000001 0.000001.
#define eps 1e-6
Copy the code
2. Determination of equality
- The representation of floating-point number cannot be determined by ‘==’. The two numbers must be subtracted and the absolute value taken to determine whether they are equal according to whether the result is less than a certain precision.
bool EQ(double a, double b) { // EQual
return fabs(a - b) < eps;
}
Copy the code
3. Determination of inequality
- Unequal is the non of equal;
bool NEQ(double a, double b) { // NotEQual
return !EQ(a, b);
}
Copy the code
4, greater than or equal to judgment
- ‘greater than or equal to’ means’ greater than or equal to ‘and needs to be broken down into the following forms:
bool GET(double a, double b) { // GreaterEqualThan
return a > b || EQ(a, b);
}
Copy the code
5, less than or equal to judgment
- ‘Less than or equal to’ means’ less than or equal to ‘and needs to be broken down into the following forms:
bool SET(double a, double b) { // SmallerEqualThan
return a < b || EQ(a, b);
}
Copy the code
6, less than judgment
- “Less than” is the “not” of “greater than or equal to”, which needs to be broken down into the following form:
- A
bool ST(double a, double b) { // SmallerThan
return a < b && NEQ(a, b);
}
Copy the code
7, greater than the judgment
- ‘greater than’ is the ‘not’ of ‘less than or equal to’, which needs to be broken down into the following form:
- A >b A > B A > B
bool GT(double a, double b) { // GreaterThan
return a > b && NEQ(a, b);
}
Copy the code