Jake Wharton
Which is Better on Android: Divide by 2 or Shift by 1?
The original address: https://jakewharton.com/which-is-better-on-android-divide-by-two-or-shift-by-one/
Translator: Bingxin said
I have been trying to port the AndroidX Collection Library to Kotlin Multiplatform to test binary compatibility, performance, ease of use and different memory models. Some data structures in the library use a binary tree based on an array-based implementation to store elements. There are many places in Java code where a shift operation is used instead of multiplying and dividing quadratic powers. When ported to Kotlin, this code is converted to a slightly twisted infix operator, which confuses the code’s intent.
I’ve done some research on whether shift or multiplication/division performs better, and most people have heard “shift performs better,” but are skeptical about whether it’s true. Some people think the compiler might make some optimizations before the code runs to the CPU.
To satisfy my curiosity and avoid using Kotlin’s infix shift operator, I’ll answer who’s better and some related questions. Let’s go !
I SHR 1, I SHL 1
Who optimized the code?
Before our code is executed by the CPU, there are several important compilers: Javac/Kotlinc, D8, R8, and ART.
Each of these steps had an opportunity to optimize, but did they?
javac
class Example {
static int multiply(int value) {
return value * 2;
}
static int divide(int value) {
return value / 2;
}
static int shiftLeft(int value) {
return value << 1;
}
static int shiftRight(int value) {
return value >> 1; }}Copy the code
Compile the above code under JDK 14 and present the bytecode through JavAP.
$ javac Example.java
$ javap -c Example
Compiled from "Example.java"
class Example {
static int multiply(int);
Code:
0: iload_0
1: iconst_2
2: imul
3: ireturn
static int divide(int);
Code:
0: iload_0
1: iconst_2
2: idiv
3: ireturn
static int shiftLeft(int);
Code:
0: iload_0
1: iconst_1
2: ishl
3: ireturn
static int shiftRight(int);
Code:
0: iload_0
1: iconst_1
2: ishr
3: ireturn
}
Copy the code
Each method begins with the iloAD_0 directive, indicating that the first parameter is loaded. Both multiplication and division use the iconst_2 instruction to load the literal 2. The imul and IDIV instructions are then executed to multiply and divide int. The shift operation also loads the literal 1 and then uses the ishl and ishr instructions to perform the shift operation.
No optimizations have been made, but if you know anything about Java, it’s no surprise. Javac is not an optimized compiler, leaving most of the work to the runtime compiler or AOT on the JVM.
kotlinc
fun multiply(value: Int) = value * 2
fun divide(value: Int) = value / 2
fun shiftLeft(value: Int) = value shl 1
fun shiftRight(value: Int) = value shr 1
Copy the code
Kotlin is compiled into Java bytecode through Kotlinc under Kotlin version 1.4-M1 and viewed using Javap.
$ kotlinc Example.kt
$ javap -c ExampleKt
Compiled from "Example.kt"
public final class ExampleKt {
public static final int multiply(int);
Code:
0: iload_0
1: iconst_2
2: imul
3: ireturn
public static final int divide(int);
Code:
0: iload_0
1: iconst_2
2: idiv
3: ireturn
public static final int shiftLeft(int);
Code:
0: iload_0
1: iconst_1
2: ishl
3: ireturn
public static final int shiftRight(int);
Code:
0: iload_0
1: iconst_1
2: ishr
3: ireturn
}
Copy the code
The output is exactly the same as Java.
This is using the original JVM backend of Kotlin, but using the forthcoming IR-based backend (via -Xuse-ir) also produces the same output.
I framed the one above because I didn’t understand it
D8
The bytecode converted from the above example Kotlin code is used to generate a DEX file using the latest D8 compiler.
$ java -jar $R8_HOME/build/libs/d8.jar \ --release \ --output . \ ExampleKt.class $ dexdump -d classes.dex Opened 'classes.dex', DEX version '035' Class #0 - Class descriptor : 'LExampleKt; ' Access flags : 0x0011 (PUBLIC FINAL) Superclass : 'Ljava/lang/Object; ' Direct methods - #0 : (in LExampleKt;) name : 'divide' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 000118: |[000118] ExampleKt.divide:(I)I 000128: db00 0102 |0000: div-int/lit8 v0, v1, #int 2 // #02 00012c: 0f00 |0002: return v0
#1 : (in LExampleKt;) name : 'multiply' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - Copy the code
000130: |[000130] ExampleKt.multiply:(I)I
000140: da00 0102 |0000: mul-int/lit8 v0, v1, #int 2 // #02
000144: 0f00 |0002: return v0#2 : (in LExampleKt;) name : 'shiftLeft' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - Copy the code
000148: |[000148] ExampleKt.shiftLeft:(I)I
000158: e000 0101 |0000: shl-int/lit8 v0, v1, #int 1 // #01
00015c: 0f00 |0002: return v0#3 : (in LExampleKt;) name : 'shiftRight' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - Copy the code
Copy the code
000160: |[000160] ExampleKt.shiftRight:(I)I
000170: e100 0101 |0000: shr-int/lit8 v0, v1, #int 1 // #01
000174: 0f00 |0002: return v0
(Slightly optimized output)
Dalvik bytecode is register-based, while Java bytecode is stack based. In the end, each method actually uses just one bytecode to manipulate the associated integer operation. They both use the v1 register to hold the first method parameter, plus a literal 1 or 2.
So nothing changes, and D8 is not an optimization compiler (although it can do method-local optimization).
R8
To run R8, we need to configure obfuscation rules to prevent our code from being removed.
-keep,allowoptimization class ExampleKt { <methods>; } Copy the code
The above rules are passed with the –pg-conf argument.
$ java -jar $R8_HOME/build/libs/r8.jar \ --lib $ANDROID_HOME/platforms/android-29/android.jar \ --release \ --pg-conf rules.txt \ --output . \ ExampleKt.class $ dexdump -d classes.dex Opened 'classes.dex', DEX version '035' Class #0 - Class descriptor : 'LExampleKt; ' Access flags : 0x0011 (PUBLIC FINAL) Superclass : 'Ljava/lang/Object; ' Direct methods - #0 : (in LExampleKt;) name : 'divide' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 000118: |000118] ExampleKt.divide:(I)I 000128: db00 0102 |0000: div-int/lit8 v0, v1, #int 2 / / # 02 00012c: 0f00 |0002: return v0 #1 : (in LExampleKt;) name : 'multiply' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 000130: |000130] ExampleKt.multiply:(I)I 000140: da00 0102 |0000: mul-int/lit8 v0, v1, #int 2 / / # 02 000144: 0f00 |0002: return v0 #2 : (in LExampleKt;) name : 'shiftLeft' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 000148: |000148] ExampleKt.shiftLeft:(I)I 000158: e000 0101 |0000: shl-int/lit8 v0, v1, #int 1 / / # 01 00015c: 0f00 |0002: return v0 #3 : (in LExampleKt;) name : 'shiftRight' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 000160: |000160] ExampleKt.shiftRight:(I)I 000170: e100 0101 |0000: shr-int/lit8 v0, v1, #int 1 / / # 01 000174: 0f00 |0002: return v0 Copy the code
It’s exactly the same as the output of D8.
ART
The Dalvik bytecode output from R8 above is used as input to the ART, which is run on an Android 10 x86 VIRTUAL machine.
$ adb push classes.dex /sdcard/classes.dex $ adb shell generic_x86:/ $ su generic_x86:/ # dex2oat --dex-file=/sdcard/classes.dex --oat-file=/sdcard/classes.oat generic_x86:/ # oatdump --oat-file=/sdcard/classes.oat OatDexFile: 0: LExampleKt; (offset=0x000003c0) (type_idx=1) (Initialized) (OatClassAllCompiled) 0: int ExampleKt.divide(int) (dex_method_idx=0) CODE: (code_offset=0x00001010 size_offset=0x0000100c size=15)... 0x00001010: 89C8 mov eax, ecx 0x00001012: 8D5001 lea edx, [eax + 1] 0x00001015: 85C0 test eax, eax 0x00001017: 0F4DD0 cmovnl/ge edx, eax 0x0000101a: D1FA sar edx 0x0000101c: 89D0 mov eax, edx 0x0000101e: C3 ret 1: int ExampleKt.multiply(int) (dex_method_idx=1) CODE: (code_offset=0x00001030 size_offset=0x0000102c size=5)... 0x00001030: D1E1 shl ecx 0x00001032: 89C8 mov eax, ecx 0x00001034: C3 ret 2: int ExampleKt.shiftLeft(int) (dex_method_idx=2) CODE: (code_offset=0x00001030 size_offset=0x0000102c size=5)... 0x00001030: D1E1 shl ecx 0x00001032: 89C8 mov eax, ecx 0x00001034: C3 ret 3: int ExampleKt.shiftRight(int) (dex_method_idx=3) CODE: (code_offset=0x00001040 size_offset=0x0000103c size=5)... 0x00001040: D1F9 sar ecx 0x00001042: 89C8 mov eax, ecx 0x00001044: C3 ret Copy the code
(Slightly optimized output)
The x86 assembly code shows that ART intervenes in the math and replaces some of it with a shift operation.
First, multiply and shiftLeft now have the same implementation, and both use SHL to move one bit to the left. Otherwise, if you look at the file offset (the leftmost column), you’ll see exactly the same. ART recognized that the two methods had the same method body and de-duplicated them when they were translated into x86 assembly code.
Divide and shiftRight, however, are implemented differently in that they do not work together with SAR to move one bit to the right. The Divide method uses four additional instructions before calling SAR to handle negative input.
Execute the same instructions on an Android 10 Pixel4 device to see how ART compiles code into ARM assembly code.
OatDexFile: 0: LExampleKt; (offset=0x000005a4) (type_idx=1) (Verified) (OatClassAllCompiled) 0: int ExampleKt.divide(int) (dex_mmultiply and shiftLeft ethod_idx=0) CODE: (code_offset=0x00001009 size_offset=0x00001004 size=10)... 0x00001008: 0fc8 lsrs r0, r1, #31 0x0000100a: 1841 adds r1, r0, r1 0x0000100c: 1049 asrs r1, #1 0x0000100e: 4608 mov r0, r1 0x00001010: 4770 bx lr 1: int ExampleKt.multiply(int) (dex_method_idx=1) CODE: (code_offset=0x00001021 size_offset=0x0000101c size=4)... 0x00001020: 0048 lsls r0, r1, #1 0x00001022: 4770 bx lr 2: int ExampleKt.shiftLeft(int) (dex_method_idx=2) CODE: (code_offset=0x00001021 size_offset=0x0000101c size=4)... 0x00001020: 0048 lsls r0, r1, #1 0x00001022: 4770 bx lr 3: int ExampleKt.shiftRight(int) (dex_method_idx=3) CODE: (code_offset=0x00001031 size_offset=0x0000102c size=4)... 0x00001030: 1048 asrs r0, r1, #1 0x00001032: 4770 bx lr Copy the code
Similarly, multiply and shiftLeft use LSLS to move the left one bit and remove the duplicate method body. ShiftRight does the right shift with the ASRS instruction, while division uses another right shift instruction, LSRS, to handle negative input.
At this point, we can safely say that using value << 1 instead of value * 2 does no good. Stop doing this in arithmetic operations and keep it only when bitwise operations are strictly required.
However, value / 2 and value >> 1 still produce different assembly instructions and therefore have different performance. Thankfully, value / 2 does not perform a generic division operation and is still based on a shift operation, so the performance differences may not be significant.
Is shift faster than division?
To determine which shift or division is faster, I tested it using Jetpack Benchmark.
class DivideOrShiftTest { @JvmField @Rule val benchmark = BenchmarkRule() @Test fun divide(a) { val value = "4".toInt() // Ensure not a constant. var result = 0 benchmark.measureRepeated { result = value / 2 } println(result) // Ensure D8 keeps computation. } @Test fun shift(a) { val value = "4".toInt() // Ensure not a constant. var result = 0 benchmark.measureRepeated { result = value shr 1 } println(result) // Ensure D8 keeps computation.}}Copy the code
I don’t own an x86 device, so I tested it on Android 10 Pixel3, and here’s what I found:
android.studio.display.benchmark=4 ns DivideOrShiftTest.divide count=4006 mean=4 median=4 min=4 standardDeviation=0 Copy the code
android.studio.display.benchmark=3 ns DivideOrShiftTest.shift count=3943 mean=3 median=3 min=3 standardDeviation=0
There is really no difference between using division and shifting, the difference is nanosecond. Using negative numbers doesn’t make any difference.
At this point, we can safely say that using value >> 1 instead of value / 2 does no good. Stop doing this in arithmetic operations and keep it only when bitwise operations are strictly required.
Can D8/R8 reduce Apk volume?
If there are two ways to express the same operation, the one with better performance should be selected. If the performance is the same, you should choose the one that can reduce the Apk volume.
We now know that value * 2 and value << 1 produce the same assembly code on ART. Therefore, if one is more space-efficient in Dalvik, we should definitely use it instead of the other. Let’s look at D8’s output, which also produces bytecode of the same size:
#1 : (in LExampleKt;) Name: "multiply" ⋮ 000140: da00 0102 | 0000: the mul - int/lit8 where v0, v1, 2 / / # 02 # int
#2 : (in LExampleKt;) Name: 'shiftLeft ⋮Copy the code
Copy the code
000158: e000 0101 |0000: shl-int/lit8 v0, v1, #int 1 // #01
Multiplication may take up more space to store literals. Compare value * 32_768 with value << 15.
#1 : (in LExampleKt;) Name: "multiply" ⋮ 000128:1400 0080 0000 | 0000: const where v0, 00012 e # 0.000000 / / float # 00008000:9201 0100 | 0003: mul-int v1, v1, v0
#2 : (in LExampleKt;) Name: 'shiftLeft ⋮Copy the code
Copy the code
00015c: e000 000f |0000: shl-int/lit8 v0, v0, #int 15 // #0f
I mentioned this issue on D8, but I strongly suspect that the probability of this happening is 0, so it’s not worth it.
The output from D8 and R8 also shows that value / 2 and value >> 1 have the same cost for Dalvik.
#0 : (in LExampleKt;) Name: 'divide ⋮ 000128: db00 0102 | 0000: div - int/lit8 where v0, v1, 2 / / # 02 # int
#2 : (in LExampleKt;) Name: 'shiftLeft ⋮Copy the code
Copy the code
000158: e000 0101 |0000: shl-int/lit8 v0, v1, #int 1 // #01
The bytecode size above also changes when the literal size reaches 32768. Because of negative numbers, it is not absolutely safe to use a right shift unconditionally instead of division to the second power. We can make this substitution without making it negative.
Does the second power division of unsigned numbers also use shifts?
Java bytecode does not have unsigned numbers, but you can simulate them using signed numbers. Java provides static methods to convert signed numbers to unsigned numbers. Kotlin provides the unsigned type UInt, which provides the same functionality but, unlike Java, abstracts itself as a data type. As you can imagine, the division of a second power can definitely be rewritten with a right shift operation.
Use Kotlin to illustrate the following two cases.
fun javaLike(value: Int) = Integer.divideUnsigned(value, 2) fun kotlinLike(value: UInt) = value / 2U Copy the code
Compiled by Kotlinc (Kotlin 1.4-M1) :
$ kotlinc Example.kt $ javap -c ExampleKt Compiled from "Example.kt" public final class ExampleKt { public static final int javaLike(int); Code: 0: iload_0 1: iconst_2 2: invokestatic #12 // Method java/lang/Integer.divideUnsigned:(II)I 5: ireturn Copy the code
public static final int kotlinLike-WZ4Q5Ns(int); Code: 0: iload_0 1: istore_1 2: iconst_2 3: istore_2 4: iconst_0 5: istore_3 6: iload_1 7: iload_2 8: invokestatic #20 // Method kotlin/UnsignedKt."uintDivide-J1ME1BU":(II)I 11: ireturn }
Kotlin failed to recognize that this was a quadratic division, which could have been replaced by the IUSHR shift operation. I also submitted this issue to Jetbrain.
Using -xuse-i also doesn’t change anything (except to remove some loads/stores). However, working with Java8 is different.
$kotlinc - JVMS -target 1.8 example. kt $javap -c ExampleKt Compiled from "example. kt" Public Final class ExampleKt { public static final int javaLike(int); Code: 0: iload_0 1: iconst_2 2: invokestatic #12 // Method java/lang/Integer.divideUnsigned:(II)I 5: ireturnCopy the code
public static final int kotlinLike-WZ4Q5Ns(int); Code: 0: iload_0 1: iconst_2 2: invokestatic #12 // Method java/lang/Integer.divideUnsigned:(II)I 5: ireturn }
Integer.divideUnsigned methods are available as of Java 8. Since this makes the bodies of the two functions exactly the same, let’s go back to the old version for comparison.
And then R8. Significantly different from the above, we use the Kotlin library as input and specify the minimum API, — min-API 24. Because integer. divideUnsigned is only available in API 24 and beyond.
$ java -jar $R8_HOME/build/libs/r8.jar \ --lib $ANDROID_HOME/platforms/android-29/android.jar \ --min-api 24 \ --release \ --pg-conf rules.txt \ --output . \ ExampleKt.class kotlin-stdlib.jar $ dexdump -d classes.dex Opened 'classes.dex', DEX version '039' Class #0 - Class descriptor : 'LExampleKt; ' Access flags : 0x0011 (PUBLIC FINAL) Superclass : 'Ljava/lang/Object; ' Direct methods - #0 : (in LExampleKt;) name : 'javaLike' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 0000f8: |[0000f8] ExampleKt.javaLike:(I)I 000108: 1220 |0000: const/4 v0, #int 2 // #2 00010a: 7120 0200 0100 |0001: invoke-static {v1, v0}, Ljava/lang/Integer; .divideUnsigned:(II)I // method@0002 000110: 0a01 |0004: move-result v1 000112: 0f01 |0005: return v1
#1 : (in LExampleKt;) name : 'kotlinLike-WZ4Q5Ns' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - Copy the code
Copy the code
000114: |[000114] ExampleKt.kotlinLike-WZ4Q5Ns:(I)I 000124: 8160 |0000: int-to-long v0, v6 000126: 0000 0000 | 0001:1802 FFFF FFFF const - wide v2, # 0.000000 / double / # 000130:00000000 FFFFFFFF c020 | 0006: and-long/2addr v0, v2 000132: 1226 |0007: const/4 v6, #int 2 // #2 000134: 8164 |0008: int-to-long v4, v6 000136: c042 |0009: and-long/2addr v2, v4 000138: be20 |000a: div-long/2addr v0, v2 00013a: 8406 |000b: long-to-int v6, v0 00013c: 0f06 |000c: return v6
Kotlin has his own implementation of unsigned integers that are directly inlined inside functions. It does this by converting arguments and literals to long, dividing by long, and finally converting to int. When we eventually run them through ART they’re just translated to equivalent x86 so we’re going to leave this function Behind. This is a missed optimization opportunity.
For the Java version, R8 also does not use a shift operation to replace divideUnsigned. I have submitted the issue for continuous tracking.
The final optimization opportunity is ART.
$ adb push classes.dex /sdcard/classes.dex $ adb shell generic_x86:/ $ sugenzong generic_x86:/ # dex2oat --dex-file=/sdcard/classes.dex --oat-file=/sdcard/classes.oat generic_x86:/ # oatdump --oat-file=/sdcard/classes.oat OatDexFile: 0: LExampleKt; (offset=0x000003c0) (type_idx=1) (Initialized) (OatClassAllCompiled) 0: int ExampleKt.javaLike(int) (dex_method_idx=0) CODE: (code_offset=0x00001010 size_offset=0x0000100c size=63)... 0x00001010: 85842400E0FFFF test eax, [esp + -8192] StackMap[0] (native_pc=0x1017, dex_pc=0x0, register_mask=0x0, stack_mask=0b) 0x00001017: 55 push ebp 0x00001018: 83EC18 sub esp, 24 0x0000101b: 890424 mov [esp], eax 0x0000101e: 6466833D0000000000 cmpw fs:[0x0], 0 ; state_and_flags 0x00001027: 0F8519000000 jnz/ne +25 (0x00001046) 0x0000102d: E800000000 call +0 (0x00001032) 0x00001032: 5D pop ebp 0x00001033: BA02000000 mov edx, 2 0x00001038: 8B85CE0F0000 mov eax, [ebp + 4046] 0x0000103e: FF5018 call [eax + 24] StackMap[1] (native_pc=0x1041, dex_pc=0x1, register_mask=0x0, stack_mask=0b) 0x00001041: 83C418 add esp, 24 0x00001044: 5D pop ebp 0x00001045: C3 ret 0x00001046: 64FF15E0020000 call fs:[0x2e0] ; pTestSuspend StackMap[2] (native_pc=0x104d, dex_pc=0x0, register_mask=0x0, stack_mask=0b) 0x0000104d: EBDE jmp -34 (0x0000102d) 1: int ExampleKt.kotlinLike-WZ4Q5Ns(int) (dex_method_idx=1) CODE: (code_offset=0x00001060 size_offset=0x0000105c size=67)... ⋮ Copy the code
ART does not make an inline call to divideUnsigned, instead making a regular method call. I submitted this issue for tracking.
The last
It’s been a long journey, congratulations on completing it (or just flipping to the bottom of the article). Let’s sum up.
- ART rewrites multiplication/division for quadratic powers using left/right shift (there are additional instructions for negative numbers).
- There is no significant performance difference between right shift and quadratic division.
- The Dalvik bytecode size of shift and multiplication and division is the same.
- No one has optimized unsigned division (at least not yet), but you probably haven’t either.
With these facts, you can answer the question at the beginning of the article.
On Android, divide by 2 or move right by 1?
All is not! Shift operations are used only when bitwise operations are actually required, and multiplication and division are used for other mathematical operations. I’m going to start switching bits from AndroidX Collection to multiplication and division. See you next time!
May translate more recently, encounter some good articles always can’t help but share with everyone.
In fact, the translation is not easier than the original text, I will write the translation at least on the basis of a thorough reading, intensive reading twice. If you think the article is good, feel free to read, forward, share!
This article is formatted using MDNICE