Hello, I’m Liang Tang.
Today is EasyC++ topic 7, floating point types.
Click to jump to github repository, welcome star, welcome pr~
Floating point Numbers
Floating-point numbers are the second basic type in C++ that can represent numbers with a fractional part. Not only that, floating point numbers have a wider range than int and can represent a wider range of numbers.
We all know that in a computer, all data is essentially stored in binary. Integers are simple. They store the 01 string converted to binary. What about floating point numbers?
It’s easy to guess that floating-point numbers are also stored in binary, but it’s a little more complicated than converting integers directly to binary. It needs to be expressed as the following line:
Here n is the floating point number we want to store, S is the sign bit, M is the mantissa, and e is the order.
The sign bit is easy to understand. Like the sign bit of an integer, 0 represents a positive number and 1 represents a negative number. M denotes mantissa, 1≤m<2 1\le m< 21≤m<2. In an abstract way, for example, 3.0, translated to binary is (11.0)2(11.0)_2(11.0)2, equivalent to 1.1∗211.1*2^11.1∗21. So, s=1,m=1.1,e=1s=1, m=1.1,e=1s=1, m=1.1,e=1.
Now that we know how floating-point numbers are represented, how do they get stored on a computer? This requires us to dissect the details further.
About m
The first one is m, which is defined as a decimal that is greater than or equal to 1 and less than 2. We can simply write 1. Xx, where xx represents the decimal part.
Since it’s always going to be greater than or equal to 1, less than 2, it’s going to have to have the ones place be 1, so we can just leave it out, just look at the decimals. For the decimal part, we also approximate it in binary. For example, 0.625 can be expressed as 0.5 + 0.125, which is 2−1+2−32^{-1} +2 ^{-3}2−1+2−3. In binary form, it is (101)2(101)_2(101)2, except that its highest bit starts from -1.
In a 32-bit floating-point number, for example, after removing one bit for symbol and eight bits for order, there are 23 bits left for M. Since we dropped the 1 before the decimal point, our order starts at -1, which is theoretically equivalent to 24 binary bits.
About e
In floating-point storage, e is an unsigned integer. Take a 32-bit floating point number as an example. E has eight bits and can represent 0 to 255.
However, e can be negative, and according to IEEE 754, the true value of e must be subtracted by a middle number. For 8-bit e, its middle number is 127. For example, the actual value of e is 10, but it needs to be stored as 127+10=137.
In addition, e has three other cases:
- If e is not all 0 or all 1, the preceding rule is used
- When e is all zeros, e equals 1-127. The significant number m is no longer added by 1 by default, in order to restore decimals of 0.xxx and numbers close to 0
- If e is all 1, if m is all 0, it means infinity, and if m is not all 0, it means nan (not a number).
The rules for E look a little complicated, a little hard to understand at first glance, why subtract the median instead of the sign bit? After careful consideration, I found that if the symbol bit is introduced, it is difficult to distinguish between 0. XXX and e is equal to 0. Although it can also be treated with special judgment, it is not so elegant as now.
For those who don’t understand, you can skip this paragraph because it is the implementation principle of floating-point numbers, and the C++ primer doesn’t explain much about this section.
The use of floating point numbers
There are two ways to write floating point numbers in C++. The first way is to use regular decimal notation:
double a = 1.23;
float b = 3.43;
Copy the code
Another way to write this is scientific notation, which says:
double a = 2.45 e8;
double b = 1e-7;
Copy the code
2.45e8 means 2.45∗1082.45 * 10^82.45∗108, where e can be followed by either positive or negative numbers, but without Spaces.
Floating point type
Like C, C++ has three floating point types: float,double, and long double. Like integers, these three types are floating point numbers, but in different ranges.
The range of floating-point numbers is determined by a combination of two parts, one of which is a significant number. For example, 14179 is a 5-digit significant number, while 14000 has only two digits, because the following three zeros are padding bits. The number of significant digits does not depend on the position of the decimal point. In C++, float usually represents 7 significant digits, double usually 16 digits, and long double is at least the same as double.
In addition, the range of exponents they can express is at least -37 to 37. In general, float is four bytes 32-bit and double is eight bytes 64-bit, depending on your runtime environment.
Matters needing attention
There are a few things to consider when using floating-point numbers.
- Cout outputs a floating point number that removes the trailing 0
- Default for writing floating-point constants
double
Type, if requiredfloat
Type, please add suffix F or f at the end, e.g. :2.34 f
- Due to the accuracy of floating point numbers, it is not possible to directly determine whether two floating point numbers are equal or not, and the expected results may not be obtained. The correct way is to determine the accuracy range, such as:
double epsilon = 1e-8;
// Check whether a is equal to B
if (abs(a - b) < epsilon) {
// todo
}
Copy the code
To determine whether two floating point numbers a and B are equal is equivalent to the fact that the absolute value of their difference is less than a certain precision.
- Scope problems, such as running the following code will get the wrong result:
float a = 2.3 e22f;
float b = a + 1.0 f;
cout << b - a << endl;
Copy the code
The output will be 0, because 2.3e22 is a number with 23 digits to the left of the decimal point, and when you add 1, you add 1 to the 23rd place. But float can only represent the first six or seven digits of a number, not with that much precision, so the +1 operation doesn’t work at all.
This problem is a big hole, not careful will fall into the trap, must be careful.