IEEE 754 floating point number

Floating-point data is used to represent real numbers with a decimal point. Why are real numbers called floating point numbers in C? In C, real numbers are stored as exponents in storage cells. A real number can be expressed as an index in more than one form. For example, 3.14159 can be expressed as 3.14159× 10E0,0.314159× 10E1,0.0314159× 10E2,31.4159× 10E-1, 314.159× 10E-2, etc. They represent the same value. It can be seen that the position of the decimal point can float between and before or after 314159 digits. As long as the decimal point floats while changing the value of the index, it can be guaranteed that its value will not change. Because the decimal position can float, the exponential form of real numbers is called floating point numbers.

Single and double precision storage structure:

Floating-point numbers are stored in single-precision and double-precision structures. Single precision totals 4 bytes 32 bits, double precision totals 8 bytes 64 bits.

Taking single precision as an example, the process of converting decimal to binary is discussed:

Decimal decimals need to be converted to binary decimals before they are stored.

  • Convert decimal to binary first
  • 1.M * 2E−127{2^{e-127}}2E−127 (M is the binary decimal, E is the order, 127 is the offset, E- offset = the actual binary exponent)
  • Store the information of “sign”, “rank code” and “mantissa” respectively

Take 123.456 as an example:

/** * Decimal to binary * Integer part: keep dividing by 2 and taking the remainder until the quotient is 0 * decimal part: Keep multiplying by 2 and taking the whole number until the product is 0 ** Calculation process * integer * 123/2 = 61(1); * 6 1/2 = 30 (1); * 30/2 = 15 (0); * 15/2 = 7 (1); * 7/2 = 3 (1); * l = 1 (1); * 1/2 = 0 (1); * reverse value, so integer into binary: 1111011 * * decimal * 0.456*2 = 0.912(0); * 0.912 * 2 = 1.824 (1); * 0.824 * 2 = 1.648 (1); * 0.648 * 2 = 1.296 (1); * * 0.296*2 = 0.592(0); * 0.592 * 2 = 1.184 (1); * 0.184 * 2 = 0.368 (0); * 0.368 * 2 = 0.736 (0); *... The decimal is converted to binary: 0111 0100 1011 1100 0110 1010 0111 1110 1111 1001 1101 1011 * * Therefore, the binary number corresponding to the decimal is: 1111011.011101001011110001101010011111101111100111011011 * into "mantissa + exponent" format for: 1.111011011101001011110001101010011111101111100111011011 * 2 ^ 6, so exponent E should be 6 + 127 = 133 * keep 23 after precision loss: 1.11101101110100101111000 * 2 ^ 6 * * symbol is positive, so is 0 * exponent is 6, while the exponent, 32-bit offset of 127 (64 1023), so the actual should save 127 + 6 = 133, The mantissa is converted to binary 10000101 *. The mantissa is directly converted to binary decimal 11101101110100101111000 *. The mantissa is converted to single precision value 0 10000101 11101101110100101111000 */
Copy the code

This is the process of converting from decimal to single-precision binary. Floating point numbers are discussed in more detail below.

Classification of floating point numbers

Take double precision as an example, the order code number is 11 bits, so the value range of order code E is [0, 2047].

normalized

If the order codes are not 0(each bit is 0) or 2047(each bit is 1), it is a normalized floating point number in the binary decimal format of 1.m * 2E−1023{2^{E-1023}}2E−1023.

The normalized

When the order code is 0(each bit is 0), is a nonnormalized floating point number used to represent 0 or very close to 0. Its mantissa does not increment by 1 as normalized data does. To smooth the transition from normalized data to normalized data, its order code is 1-1023, so the format of the normalization is 0.M * 2−1022{2^{-1022}}2−1022.

When M is all 0, they represent plus and minus 0 respectively according to the sign bit.

infinity

When the order code is 2047(each bit is 1), all mantissa bits are 0, representing infinity, divided into positive infinity and negative infinity by sign bits.

NaN

When the rank code is 2047(each bit is 1), the mantissa bits are not all zeros, representing NaN.

The value range is discussed by taking double precision as an example

Real number range

Number of specifications

  • Index range: the rank code range is [1, 2046], so the index range is [-1022, 1023].

  • Decimal range: in double precision, the mantissa number is 52, so the decimal range of the mantissa is [0, ∑ I =1522− I \sum_{I =1}^{52} 2^{-i}∑ I =1522− I], which can be obtained by summation of the geometric sequence [0, 1-2 −52{2^{-52}}2−52]. Since the floating-point number ignores the initial 1 and has a decimal value of 1 in base 10, the actual decimal range should be [1, 2-2 −52{2^{-52}}2−52]. Note The formula for summation of geometric sequences is A1 − ANq1 − Q \frac{a_1-a_nq}{1-Q}1− QA1 − ANq.

Therefore, the maximum value of double precision is the maximum value of decimal range x exponential range, so the maximum value is: Binary decimal x 21023{2^{1023}}21023 = decimal decimal x 21023{2^{1023}}21023 = (2-2 −52{2^{-52}}2−52) x 21023{2^{1023}}21023 ≈ approx≈ 1.797693135 x 10308{10^{308}}10308 is the minimum negative value of the double precision: -1.797693135 x 10308{10^{308}}10308 corresponds to -number. MAX_VALUE in js. Maximum positive value: 1.797693135 x 10308{10^{308}}10308, corresponding to Number.MAX_VALUE in JS.

Binary decimal x 2 binary exponent {2^{binary exponent}}2 binary exponent {2^{binary exponent}}2

The specifications for

In the non-specification number, the minimum value is 0 00000000000 0… 1, the corresponding decimal number is 2-1022 ^ {2} {1022} 2 x 2-1022-52 2 ^ {{- 52}} – 52 = 2-1074 {2 ^ {1074}} 2-1074 = 5 x 10-324 10 ^ {{324}} 10-324, It is equal to number.min_value in js.

Summary of real number range

Combining the real range of noncanonical numbers and canonical numbers, [-1.797693135 x 10308{10^{308}}10308, -5 x 10−324{10^{-324}}10−324] ⋃\ Bigcup ⋃ 0 ⋃\ Bigcup ⋃ [5 x 10−324{10^{-324}}10−324, 1.797693135 x 10308{10^{308}}10308].

The above range of real numbers can have accuracy loss problems when stored. The following discusses the integer range with no loss.

Lossless integer range

When the integers are stored in double precision, not all the integers can be accurately stored. Only the integers in a certain range can be accurately stored. When an integer is stored, it is first converted to binary, then shifted, and finally the number after the decimal is stored in the mantissa part. For example, 10 in decimal, 1010 in binary, 1.010 * 23{2^3}23 after the shift, and then 010 into the mantissa part. Since the mantissa part of the double is only 52 bits long, precision is lost when the shifted decimal part is longer than 52. When the mantissa length = 52, the maximum lossless integer with all bits 1 is 1.111… (52 bits), the decimal value of 253{2^{53}} 253-1, namely 9007199254740991, corresponds to Number.MAX_SAFE_INTEGER in JS.

In summary, the lossless integer range is [-9007199254740991, 9007199254740991] === [number.min_safe_INTEGER, number.max_safe_INTEGER]

conclusion

The values in JS are stored in double precision according to IEEE 754 standard. So the numeric range of JS is the same as the range of double precision floating point numbers, that is:

  • The range of real numbers is: [1.797693135 x 10308 10 ^ {{308}}, 10308, -5 x 10−324{10^{-324}}10−324] ⋃\ Bigcup ⋃ 0 ⋃\ Bigcup ⋃ [5 x 10−324{10^{-324}}10−324, 1.797693135 x 10308{10^{308}}10308]
  • Lossless Integer range: [-9007199254740991, 9007199254740991]

Their values are stored in js Number, that is:

  • The value can be: [-number.max_value, -number.min_value] ⋃\bigcup⋃ 0 ⋃\bigcup⋃ [number.min_value, number.max_value]
  • Lossless integers: [number.min_safe_INTEGER, number.max_safe_INTEGER]