Here are the answers to the first six questions #130:
Question 1
Int c = 0.58 * 100; printf("%d\n",c); //57 int x = 0.58f * 100.0f; printf("%d\n",x); / / 58Copy the code
When both numbers are of type float, the result is 58. Unraveling the mystery involves how floating-point numbers are stored, multiplication rules for floating-point numbers, and rounding.
See the binary representation of floating-point numbers and a few examples of how floating-point numbers can be stored, which are not covered here.
0.58 * 100
0.58 double said: 0 01111111110 0010100011110101110000101000111101011100001010001111
100 double said is (floats) : 0 10000000101 1001000000000000000000000000000000000000000000000000
Let’s look at the calculation of 0.58 * 100:
Conversion to standard science and technology law:
0.58 is 1.0010100011110101110000101000111101011100001010001111 * 2 ^ 1 100 is 1.1001000000000000000000000000000000000000000000000000 * 2 ^ 6
Add the two exponents
Minus 1 plus 6 is 5
Multiply the two mantissa numbers
* 1.1001000000000000000000000000000000000000000000000000 = 1.0010100011110101110000101000111101011100001010001111 1.11001111111111111111111111111111111111111111111111110000000000000000000000000000000000000000000000000000
normalized
The result above is to normalize the decimal without changing it (there is only one 1 to the left of the decimal)
Sign bits add mod2
0 plus 0 is 0
rounding
The important point here is that since the number of double digits is only 52, we need to discard some mantissa. IEEE uses rounding to even numbers by default. So after rounding the above results are: 1.1100111111111111111111111111111111111111111111111111
Determine the overflow
Determine whether the addition of order codes overflows. There is no
So the final result is 0, 10000000100, 1100111111111111111111111111111111111111111111111111
The result is: 57.999999999999992894572642399
Because C uses float/double to turn integers closer to zero, the result is 0.58 * 100 = 57.
Let’s take a look at the calculation process of 0.58f * 100.0f:
0.58 * 100.0 f f
The single-precision representation of 0.58 is: 0 01111110 00101000111101011100001
The single-precision representation of 100 is (floating point) : 0 10000101 10010000000000000000000
Specific process:
To standard science and technology law
0.58 is 1.00101000111101011100001 * 2 ^ 100-1 is 1.10010000000000000000000 * 2 ^ 6
Add the two exponents
Minus 1 plus 6 is 5
Multiply the two mantissa numbers
1.00101000111101011100001 * 1.10010000000000000000000 = 1.1100111111111111111111110010000000000000000000
normalized
The result above is to normalize the decimal without changing it (there is only one 1 to the left of the decimal)
Sign bits add mod2
0 plus 0 is 0
rounding
Since the number of single-precision digits is only 23, we need to discard some mantissa. The above double rounding is to round all the last zeros. But in this case, we have to round by one place. This is why the two multiplications are not equal. So after rounding the above results are: 1.11010000000000000000000
Determine the overflow
Determine whether the addition of order codes overflows. There is no
So the final result is 0 10000100 11010000000000000000000
The result is exactly 58.0. So 0.58f * 100.0f = 58
And finally, rounding even numbers is not quite the rounding we learned. Take a look at the binary representation of floating points and a few examples of floating point rounding.
Question 2
Printf ("%d",0.1 + 0.2 == 0.3); //0,not 1Copy the code
This problem involves the addition of floating-point numbers.
Let’s look at the floating number addition process:
0.1 double said: 0 01111111011 1001100110011001100110011001100110011001100110011010
0.2 double said: 0 01111111100 1001100110011001100110011001100110011001100110011010
To standard science and technology law
First, as with multiplication, we convert 0.1 and 0.2 to standard scientific notation: 0.1 is 1.1001100110011001100110011001100110011001100110011010 * 2 ^ 4 0.2 is 1.1001100110011001100110011001100110011001100110011010 * 2 ^ 3
Exponent alignment
Imagine adding the mantissa to the scientific decimal notation: 1.234567 × 10^5 + 1.017654 × 10^2. We would have to change the latter to 0.001017654 × 10^5 to make it of the same order code to add the mantissa. The same goes for binary floating-point numbers.
The order of 0.1 is -4 and the order of 0.2 is -3. Here’s a rule: small steps look up to big ones.
So the exponent to 0.1-3, the corresponding mantissa to 0.11001100110011001100110011001100110011001100110011010
Mantissa addition
0.11001100110011001100110011001100110011001100110011010 +
1.1001100110011001100110011001100110011001100110011010 =
10.01100110011001100110011001100110011001100110011001110
normalized
1.001100110011001100110011001100110011001100110011001110 * 2 ^ 2
rounding
Double tail only 52, through to the even number rounding, 1.0011001100110011001100110011001100110011001100110100 * 2 ^ – 2 are obtained
So, the end result is 0, 1111111101, 0011001100110011001100110011001100110011001100110100
The decimal 0.300000000000000044408920985006
So 0.1 + 0.2 = 0.300000000000000044408920985006! = 0.3
Question 3
Printf ("%f\n",3.14f + 1e10f-1e10f); printf("%f\n",3.14f + 1e10f-1e10f); / / 0.0, not 3.14 printf (" % f \ n ", 3.14 f + e10f e10f - 1 (1)); / / 3.14Copy the code
This question and the above question are not quite the same, I hope readers can take a look.
Let’s do it step by step. The single-precision representation of 3.14F is: 0 10000000 10010001111010111000011
The single-precision representation of 1E10f is: 010100000 00101010000001011111001
Switch to scientific notation
3.14 f says is: 1.10010001111010111000011 * 2 ^ 1 1 e10f said is: 1.00101010000001011111001 * 2 ^ 33
Exponent alignment
Small order to big order par 3.14 f to 0.0000000000000000000000000000000110010001111010111000011 * 2 ^ 33 (decimal point on the right side of the continuous 31 0)
Mantissa addition
0.0000000000000000000000000000000110010001111010111000011 + 1.00101010000001011111001 = 1.0010101000000101111100100000000110010001111010111000011
normalized
Have been normalized number, do not need to change, namely 1.0010101000000101111100100000000110010001111010111000011 * 2 ^ 33
rounding
1.00101010000001011111001 * 2 ^ 33
The result is 010100000 00101010000001011111001
This is still 1e10. We find that there seems to be no difference between 3.14 plus or minus. Because when adding the mantissa and the order, the actual significant digits of 3.14 are added to the end, and they are all rounded off.
This problem doesn’t exist when floating-point numbers are double, because the mantissa of double has 52 bits, enough to ensure that significant digits of 3.14 are not discarded when rounded.
Question 4
float d1, d2, d3, d4; D1 = 194268.02 f; d2 = 194268f; D4 = 0.02 f; d3 = d1 - d2; If (d3 > d4) printf(">0.02\n"); Else if (d3 < d4) printf("<0.02\n"); / / true else printf (" = 0.02 \ n "); //false printf("%f - %f = %f \n", d1,d2,d3); // 194268.015625-194268.000000 = 0.015625,not 194268.02-194268 = 0.02Copy the code
The steps are similar to addition, except that the mantissa operation is subtraction.
The single-precision representation of 194268.02F is: 0 10010000 01111011011011011100000001
The single-precision representation of 194268F is: 0 10010000 01111011011011011100000000
Switch to scientific notation
F = 194268.02 1.01111011011011100000001 17 194268 f = 1.01111011011011100000000 * 2 * 2 ^ ^ 17
To order
The order code is the same. Are 17
Mantissa is presupposed
The answer is easy to see it is 0.00000000000000000000001
normalized
This is special, after normalizing the mantissa is 0, 1.00000000000000000000000 * ^ 2-6
rounding
There is no
The final result is: 0 01111001 00000000000000000000000 is 0.015625 < 0.02
So why is printf 194268.02f 194268.015625? Float 194268:10111101101101111100: float 194268:101111011011011100: float 194268:101111011011011100: float 194268:101111011011011100: float 194268:101111011011011100 0 10010000 01111011011011011100000001 = 194268.015625
Question 5
GCC compare.c -o compare.o float p3x = 80838.0f; Float p2y = 2499.0 f; double v321 = p3x * p2y; printf("%f",v321); //-202014160,not -202014162Copy the code
GCC compiler with -mfpmath option
GCC -mfpmath=387 compare.c -o compare.o float p3x = 80838.0f; Float p2y = 2499.0 f; double v321 = p3x * p2y; printf("%f",v321); //-202014162,not -202014160Copy the code
Why are the two runs different? And 80838.0 * -2499 without considering accuracy the answer is -202014162, not -202014160.
Let’s first look at what the gcc-mfpmath option means:
Generate floating point arithmetics for selected unit unit. The choices for unit are: 387 Use the standard 387 floating point coprocessor present majority of chips and emulated otherwise. Code compiled with this option will run almost everywhere. The temporary results are computed in 80bit precision instead of precision specified by the type resulting in slightly different results compared to most of other chips. See -ffloat-store for more detailed description. This is the default choice for i386 compiler. sse Use scalar floating point instructions present in the SSE instruction set. This instruction set is supported by Pentium3 and newer chips, in the AMD line by Athlon-4, Athlon-xp and Athlon-mp chips. The earlier version of SSE instruction set supports only single precision arithmetics, thus the double and extended precision arithmetics is still done using 387. Later version, present only in Pentium4 and the future AMD x86-64 chips supports double precision arithmetics too. For the i386 compiler, you need to use -march=cpu-type, -msse or -msse2 switches to enable SSE extensions and make this option effective. For the x86-64 compiler, these extensions are enabled by default. The resulting code should be considerably faster in the majority of cases and avoid the numerical instability problems of 387 code, but may break some existing code that expects temporaries to be 80bit. This is the default choice for the x86-64 compiler.
Copy the code
The 387 option saves the temporary result of the operation to 80 bits, which are then truncated to 32 or 64 bits depending on the target type (float/double). FPU does this to reduce problems due to rounding. (In this case, the FPU type gives the “correct” result).
But the SSE option directly truncates the result to single-precision 32 bits, which are then assigned to the target type 64 bits.
Process of floating point calculation as it has already been said, there is no longer well step by step, only the mantissa multiplication of normalized decimal is 1.10000001010011111011101001. But since the target type is double, FPU does not truncate. Instead, it is directly assigned to double, resulting in -202014162. However, in SSE mode, the ending 001 will be truncated to ensure that the mantras are only 23 bits, so the final result is -202014160.
So if we change the target type to float, the result will not be different. Or simply change both operands to double.
At the end of the analysis, let’s look at the assembly code generated by FPU and SSE modes:
- FPU
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $0x479de300, %eax
movl %eax, -16(%rbp)
movl $0xc51c3000, %eax
movl %eax, -12(%rbp)
flds -16(%rbp)
fmuls -12(%rbp)
fstpl -8(%rbp)
movl $.LC2, %eax
movsd -8(%rbp), %xmm0
movq %rax, %rdi
movl $1, %eax
call printf
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
Copy the code
- SSE
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $0x479de300, %eax
movl %eax, -16(%rbp)
movl $0xc51c3000, %eax
movl %eax, -12(%rbp)
movss -16(%rbp), %xmm0
mulss -12(%rbp), %xmm0
unpcklps %xmm0, %xmm0
cvtps2pd %xmm0, %xmm0
movsd %xmm0, -8(%rbp)
movl $.LC2, %eax
movsd -8(%rbp), %xmm0
movq %rax, %rdi
movl $1, %eax
call printf
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
Copy the code
It is obvious that one is FLDS/FMULS/FSTPL series and the other is MOVSS/MULSS/MOVSD series.
Question 6
Compare the running time of the following two programs: test.c
Float x = 1.1; Float z = 1.123; float y=x; for(int j=0; j<90000000; j++) { y*=x; y/=z; Y + = 0.1 f; Y - f = 0.1; }Copy the code
test1.c
Float x = 1.1; Float z = 1.123; float y=x; for(int j=0; j<90000000; j++) { y*=x; y/=z; y+=0; //diffenent y-=0; //different }Copy the code
The second program took about 10 times as long to run as the first, so why the gap?
We print j at j=50000000 and the end of the loop: result of the first program:
0.0000001751516833792265970259904861450195312500000000000000000000
0.0000001788139343261718750000000000000000000000000000000000000000
Copy the code
Results of the second procedure:
0.0000000000000000000000000000000000000000000630584308946167681916
0.0000000000000000000000000000000000000000000630584308946167681916
Copy the code
Why did this happen? (hint, the binary representation of 0.1 1.10011001100110011001101 * 2 ^ 4, can return to the previous analysis of 3.14 f + 1 e10f e10f – 1! F = 0.0. Leave a blank space for you to think about). ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ ^ – ^ Since the exponent of 0.1 is -4, adding and subtracting floating-point numbers increases the exponent from the smallest to the largest. Therefore, when the program runs to a certain extent, it will appear something like the following:
0.00000000... (many 0).. 1yyy * 2^-4 + 1.xxxxxxxxx * 2^-4
Copy the code
For precision and rounding reasons, the final result might be 1.xxxx… 1yy * 2^-4, the addend may drop some significant digits, but since the exponent is still -4, the result is still a normalized decimal. -0.1f again, this is a normalized decimal.
The second program, however, does not normalize the decimal 0.1 constraint exponent, so the result gets smaller and smaller, and finally degrades to an unnormalized decimal: 0 00000000 XXXXXXXXXXXX.
The computer will be about 10-100 times slower in calculating normalized decimals than in calculating normalized decimals. Why is that? The author is uneducated, and his knowledge of hardware and circuit is not systematic, so he can’t understand it all the time. (Knowledge of hardware and circuits has always been a sore point for me, often preventing me from delving deeper into the problem.)
Check out the following two articles on how to avoid de-normalizing decimals:
- Denormal number
- x87 and SSE Floating Point Assists in IA-32: Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ)
On the author’s computer, using
#include <xmmintrin.h>
_mm_setcsr( _mm_getcsr() | 0x8040 );
Copy the code
You can make the second program perform as well as the first. But have sent a lot of precision, coupled with the above two sentences after the program run result is 0.0, and the result that allow de-normalized decimal arithmetic is 0.0000000000000000000000000000000000000000000630584308946167681916.
A quick note on why the above two lines avoid denormalized decimals: In the SSE and SSE2 instruction sets, there Are two modes FTZ(Flush-to-zero) and DAZ(denormals-are-zero) that help speed up denormalized decimals. Where FTZ means that when the result of an operation produces a nonnormalized decimal, FTZ mode sets the result of the operation to 0. DAZ means that when an operand has a nonnormalized decimal, set it to 0 and then evaluate it with the other operands. You can also say that FTZ affects the output and DAZ affects the input.
For the FTZ mode to take effect, two conditions must be met:
- Bit 15 of the MXCSR register (FTZ bit) is 1.
- The 15th bit is 1(underflow exception).
For the DAZ pattern to work, two conditions need to be met:
- Bit 6 of the MXCSR register (DAZ bit) is 1.
- The processor must support DAZ mode.
_mm_getcsr() gets the value of the MXCSR register. Let’s print the result:
printf("%d",_mm_getcsr()); //8064 binary 1111110000000Copy the code
The eleventh digit is 0. However, the FTZ bit and DAZ bit are both 0, so we can make the FTZ mode and DAZ mode take effect by using 0x8040(1000000001000000).
(after)
References:
- Multiplying Floating Point Numbers
- Floating-point arithmetic
- An answer to a common question about PHP floating point numbers
- A floating-point number cross-platform problem
- x87 and SSE Floating Point Assists in IA-32: Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ)
- Denormal number