The article directories

    • A, quotes
    • Floating-point representation
      • 1, the IEEE 754
      • 2, single precision and double precision
      • 3. Single-precision floating-point representation
      • 4. Give examples
      • 5. Code testing
    • Three, floating point number judgment
      • 1. Precision definition
      • 2. Determination of equality
      • 3. Determination of inequality
      • 4, greater than or equal to judgment
      • 5, less than or equal to judgment
      • 6, less than judgment
      • 7, greater than the judgment


A, quotes

  • If you look at the following code, what does this output look like?
	double x = 0;
	for (int i = 0; i < 10; ++i) {
		x += 0.1;
	}
	printf("%d\n", x == 1);
Copy the code
  • The output is as follows:

0

  • The reason for this contrast is floating point error. Floating point number has precision error when it is stored, so we cannot use ‘==’ when determining floating point number is equal to. Then we will look at the expression of floating point number;

Floating-point representation

1, the IEEE 754

  • IEEE Binary Floating-point arithmetic Standard (IEEE 754) is the most widely used floating-point arithmetic standard since the 1980s. It is adopted by many cpus and floating-point arithmetic machines. This standard defines the format for representing floating-point numbers (including negative zero-0 and anomalous values), special values (Inf and NaN), and “floating-point operators” for those values.
  • Based on this specification, any floating point number can be expressed as follows: (Value) 2 = S I g * * Fract I on ∗ e xponent (Value)_2 = Sign * Fraction * {Exponent} (Value) 2 = Sign ∗ Fraction ∗ Exponent

Sign represents the Sign bit, representing positive or negative numbers; Faction represents the mantissa, which must start with 1 in scientific notation, for example, 1.010100111; So Exponent stands for Exponent, 1001 is actually 2, 9, 2^9, 29;

2, single precision and double precision

  • F L o A T float float doub L e double double
  • The mantissa of single precision is 23 bits, and the exponent is 8 bits. The mantissa of double precision is 52 bits, and the exponent is 11 bits.
  • Since the structure is the same, only single-precision floating-point numbers are introduced here;

3. Single-precision floating-point representation

  • The binary representation of a 32-bit floating-point number is shown, with a total of 32 bits (0 or 1).
  • 1) S is in the highest digit, which represents the symbol. The negative number is 1 and the positive number is 0.
  • 2) E is the 30th to 23rd bit from the highest to the lowest, with a total of 8 bits. It represents the exponential offset and stores the value after 127 is added (the reason is that the exponential may be negative);
  • 3) F represents the mantissa of binary scientific notation, with a total of 23 digits. Since it is scientific notation, it must start with “1.”, so the significant mantissa number is 24 digits.

4. Give examples

[Example] Find the binary storage format of the floating point number − 18.375-18.375 −18.375.

  • 1) First remove the symbol, and finally at the highest position of the binary representation 1, then the actual binary representation 18.375;
  • 1=(10010)2 18=16+2=2 ^4 +2 ^1 =(10010) _2 18=16+2=24+21=(10010)2
  • 3) The binary representation of the decimal part 0.375 is: 0.375 = 0.125 + 0.25 = 2-3 + 2-2 = (. 011) = 0.125 + 0.25 = 0.375 2 ^ 2 ^ {3} + {2} = (. 011) _2 0.375 = 0.125 + 0.25 = 2-2-3 + 2 = 2 (. 011)
  • (18.375)10=(10010.011)2 (18.375)_{10} =(10010.011) _2 (18.375)10=(10010.011)2
  • 5) Expressed as scientific enumeration method to obtain: 10 = (18.375) (1.0010011) 2 ∗ (100) 2 (18.375) _ {10} = (1.0010011) _2 * (100) _2 (18.375) = 10 (1.0010011) 2 ∗ (100) 2
  • 6) Then get the mantissa F=0010011 F=0010011 F=0010011 (discard the “1.” in the scientific counting method), then use 0 to complete; Add 127 to the order, E=(100+1111111)2=(10000011)2 E=(100+1111111) _2 =(10000011) _2 E=(100+1111111)2=(10000011)2; Sign bit S=1 S=1 S=1;

S: 1

E: 10000011

F: 00100110000000000000000

  • 7) Fill into binary to get:

SEF = 11000001 10010011 00000000 00000000

5. Code testing

  • C/C++ can use the following code to take the address of a floating point number into an integer output after the binary representation of the integer;
	float a = 18.375;
	unsigned int v = *((unsigned int *)&a);
Copy the code

Three, floating point number judgment

1. Precision definition

  • In C++, 1e−6 1e-6 1e−6 is 10−6 10^{-6} 10−6, 0.000001 0.000001 0.000001.
#define eps 1e-6
Copy the code

2. Determination of equality

  • The representation of floating-point number cannot be determined by ‘==’. The two numbers must be subtracted and the absolute value taken to determine whether they are equal according to whether the result is less than a certain precision.
bool EQ(double a, double b) {   // EQual
	return fabs(a - b) < eps;
}
Copy the code

3. Determination of inequality

  • Unequal is the non of equal;
bool NEQ(double a, double b) {  // NotEQual
	return !EQ(a, b);
}
Copy the code

4, greater than or equal to judgment

  • ‘greater than or equal to’ means’ greater than or equal to ‘and needs to be broken down into the following forms:
bool GET(double a, double b) {    // GreaterEqualThan
	return a > b || EQ(a, b);
}
Copy the code

5, less than or equal to judgment

  • ‘Less than or equal to’ means’ less than or equal to ‘and needs to be broken down into the following forms:
bool SET(double a, double b) {   // SmallerEqualThan
	return a < b || EQ(a, b);
}
Copy the code

6, less than judgment

  • “Less than” is the “not” of “greater than or equal to”, which needs to be broken down into the following form:
  • A
bool ST(double a, double b) {   // SmallerThan
	return a < b && NEQ(a, b);
}
Copy the code

7, greater than the judgment

  • ‘greater than’ is the ‘not’ of ‘less than or equal to’, which needs to be broken down into the following form:
  • A >b A > B A > B
bool GT(double a, double b) {   // GreaterThan
	return a > b && NEQ(a, b);
}
Copy the code