Jake Wharton

Which is Better on Android: Divide by 2 or Shift by 1?

The original address: https://jakewharton.com/which-is-better-on-android-divide-by-two-or-shift-by-one/

Translator: Bingxin said

I have been trying to port the AndroidX Collection Library to Kotlin Multiplatform to test binary compatibility, performance, ease of use and different memory models. Some data structures in the library use a binary tree based on an array-based implementation to store elements. There are many places in Java code where a shift operation is used instead of multiplying and dividing quadratic powers. When ported to Kotlin, this code is converted to a slightly twisted infix operator, which confuses the code’s intent.

I’ve done some research on whether shift or multiplication/division performs better, and most people have heard “shift performs better,” but are skeptical about whether it’s true. Some people think the compiler might make some optimizations before the code runs to the CPU.

To satisfy my curiosity and avoid using Kotlin’s infix shift operator, I’ll answer who’s better and some related questions. Let’s go !

I SHR 1, I SHL 1

Who optimized the code?

Before our code is executed by the CPU, there are several important compilers: Javac/Kotlinc, D8, R8, and ART.


Each of these steps had an opportunity to optimize, but did they?

javac

class Example {
  static int multiply(int value) {
    return value * 2;
  }
  static int divide(int value) {
    return value / 2;
  }
  static int shiftLeft(int value) {
    return value << 1;
  }
  static int shiftRight(int value) {
    return value >> 1; }}Copy the code

Compile the above code under JDK 14 and present the bytecode through JavAP.

$ javac Example.java
$ javap -c Example
Compiled from "Example.java"
class Example {
  static int multiply(int);
    Code:
       0: iload_0
       1: iconst_2
       2: imul
       3: ireturn

  static int divide(int);
    Code:
       0: iload_0
       1: iconst_2
       2: idiv
       3: ireturn

  static int shiftLeft(int);
    Code:
       0: iload_0
       1: iconst_1
       2: ishl
       3: ireturn

  static int shiftRight(int);
    Code:
       0: iload_0
       1: iconst_1
       2: ishr
       3: ireturn
}
Copy the code

Each method begins with the iloAD_0 directive, indicating that the first parameter is loaded. Both multiplication and division use the iconst_2 instruction to load the literal 2. The imul and IDIV instructions are then executed to multiply and divide int. The shift operation also loads the literal 1 and then uses the ishl and ishr instructions to perform the shift operation.

No optimizations have been made, but if you know anything about Java, it’s no surprise. Javac is not an optimized compiler, leaving most of the work to the runtime compiler or AOT on the JVM.

kotlinc

fun multiply(value: Int) = value * 2
fun divide(value: Int) = value / 2
fun shiftLeft(value: Int) = value shl 1
fun shiftRight(value: Int) = value shr 1
Copy the code

Kotlin is compiled into Java bytecode through Kotlinc under Kotlin version 1.4-M1 and viewed using Javap.

$ kotlinc Example.kt
$ javap -c ExampleKt
Compiled from "Example.kt"
public final class ExampleKt {
  public static final int multiply(int);
    Code:
       0: iload_0
       1: iconst_2
       2: imul
       3: ireturn

  public static final int divide(int);
    Code:
       0: iload_0
       1: iconst_2
       2: idiv
       3: ireturn

  public static final int shiftLeft(int);
    Code:
       0: iload_0
       1: iconst_1
       2: ishl
       3: ireturn

  public static final int shiftRight(int);
    Code:
       0: iload_0
       1: iconst_1
       2: ishr
       3: ireturn
}
Copy the code

The output is exactly the same as Java.

This is using the original JVM backend of Kotlin, but using the forthcoming IR-based backend (via -Xuse-ir) also produces the same output.

I framed the one above because I didn’t understand it

D8

The bytecode converted from the above example Kotlin code is used to generate a DEX file using the latest D8 compiler.

$ java -jar $R8_HOME/build/libs/d8.jar \ --release \ --output . \ ExampleKt.class $ dexdump -d classes.dex Opened 'classes.dex', DEX version '035' Class #0 - Class descriptor : 'LExampleKt; ' Access flags : 0x0011 (PUBLIC FINAL) Superclass : 'Ljava/lang/Object; ' Direct methods - #0 : (in LExampleKt;) name : 'divide' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 000118: |[000118] ExampleKt.divide:(I)I 000128: db00 0102 |0000: div-int/lit8 v0, v1, #int 2 // #02 00012c: 0f00 |0002: return v0
#1              : (in LExampleKt;)
  name          : 'multiply'
  type          : '(I)I'
  access        : 0x0019 (PUBLIC STATIC FINAL)
  code          -
Copy the code

000130: |[000130] ExampleKt.multiply:(I)I
000140: da00 0102 |0000: mul-int/lit8 v0, v1, #int 2 // #02
000144: 0f00 |0002: return v0

#2              : (in LExampleKt;)
  name          : 'shiftLeft'
  type          : '(I)I'
  access        : 0x0019 (PUBLIC STATIC FINAL)
  code          -
Copy the code

000148: |[000148] ExampleKt.shiftLeft:(I)I
000158: e000 0101 |0000: shl-int/lit8 v0, v1, #int 1 // #01
00015c: 0f00 |0002: return v0

#3              : (in LExampleKt;)
  name          : 'shiftRight'
  type          : '(I)I'
  access        : 0x0019 (PUBLIC STATIC FINAL)
  code          -
Copy the code

Copy the code

000160: |[000160] ExampleKt.shiftRight:(I)I
000170: e100 0101 |0000: shr-int/lit8 v0, v1, #int 1 // #01
000174: 0f00 |0002: return v0

(Slightly optimized output)

Dalvik bytecode is register-based, while Java bytecode is stack based. In the end, each method actually uses just one bytecode to manipulate the associated integer operation. They both use the v1 register to hold the first method parameter, plus a literal 1 or 2.

So nothing changes, and D8 is not an optimization compiler (although it can do method-local optimization).

R8

To run R8, we need to configure obfuscation rules to prevent our code from being removed.

-keep,allowoptimization class ExampleKt {
  <methods>;
}
Copy the code

The above rules are passed with the –pg-conf argument.

$ java -jar $R8_HOME/build/libs/r8.jar \
      --lib $ANDROID_HOME/platforms/android-29/android.jar \
      --release \
      --pg-conf rules.txt \
      --output . \
      ExampleKt.class
$ dexdump -d classes.dex
Opened 'classes.dex', DEX version '035'
Class #0            -
  Class descriptor  : 'LExampleKt; '
  Access flags      : 0x0011 (PUBLIC FINAL)
  Superclass        : 'Ljava/lang/Object; '
  Direct methods    -
    #0              : (in LExampleKt;)
      name          : 'divide'
      type          : '(I)I'
      access        : 0x0019 (PUBLIC STATIC FINAL)
      code          -
000118: |000118] ExampleKt.divide:(I)I
000128: db00 0102                    |0000: div-int/lit8 v0, v1, #int 2 / / # 02
00012c: 0f00                         |0002: return v0

    #1              : (in LExampleKt;)
      name          : 'multiply'
      type          : '(I)I'
      access        : 0x0019 (PUBLIC STATIC FINAL)
      code          -
000130: |000130] ExampleKt.multiply:(I)I
000140: da00 0102                    |0000: mul-int/lit8 v0, v1, #int 2 / / # 02
000144: 0f00                         |0002: return v0

    #2              : (in LExampleKt;)
      name          : 'shiftLeft'
      type          : '(I)I'
      access        : 0x0019 (PUBLIC STATIC FINAL)
      code          -
000148: |000148] ExampleKt.shiftLeft:(I)I
000158: e000 0101                    |0000: shl-int/lit8 v0, v1, #int 1 / / # 01
00015c: 0f00                         |0002: return v0

    #3              : (in LExampleKt;)
      name          : 'shiftRight'
      type          : '(I)I'
      access        : 0x0019 (PUBLIC STATIC FINAL)
      code          -
000160: |000160] ExampleKt.shiftRight:(I)I
000170: e100 0101                    |0000: shr-int/lit8 v0, v1, #int 1 / / # 01
000174: 0f00                         |0002: return v0
Copy the code

It’s exactly the same as the output of D8.

ART

The Dalvik bytecode output from R8 above is used as input to the ART, which is run on an Android 10 x86 VIRTUAL machine.

$ adb push classes.dex /sdcard/classes.dex
$ adb shell
generic_x86:/ $ su
generic_x86:/ # dex2oat --dex-file=/sdcard/classes.dex --oat-file=/sdcard/classes.oat
generic_x86:/ # oatdump --oat-file=/sdcard/classes.oat
OatDexFile:
0: LExampleKt; (offset=0x000003c0) (type_idx=1) (Initialized) (OatClassAllCompiled)
  0: int ExampleKt.divide(int) (dex_method_idx=0)
    CODE: (code_offset=0x00001010 size_offset=0x0000100c size=15)...
      0x00001010:     89C8      mov eax, ecx
      0x00001012:   8D5001      lea edx, [eax + 1]
      0x00001015:     85C0      test eax, eax
      0x00001017:   0F4DD0      cmovnl/ge edx, eax
      0x0000101a:     D1FA      sar edx
      0x0000101c:     89D0      mov eax, edx
      0x0000101e:       C3      ret
  1: int ExampleKt.multiply(int) (dex_method_idx=1)
    CODE: (code_offset=0x00001030 size_offset=0x0000102c size=5)...
      0x00001030:     D1E1      shl ecx
      0x00001032:     89C8      mov eax, ecx
      0x00001034:       C3      ret
  2: int ExampleKt.shiftLeft(int) (dex_method_idx=2)
    CODE: (code_offset=0x00001030 size_offset=0x0000102c size=5)...
      0x00001030:     D1E1      shl ecx
      0x00001032:     89C8      mov eax, ecx
      0x00001034:       C3      ret
  3: int ExampleKt.shiftRight(int) (dex_method_idx=3)
    CODE: (code_offset=0x00001040 size_offset=0x0000103c size=5)...
      0x00001040:     D1F9      sar ecx
      0x00001042:     89C8      mov eax, ecx
      0x00001044:       C3      ret
Copy the code

(Slightly optimized output)

The x86 assembly code shows that ART intervenes in the math and replaces some of it with a shift operation.

First, multiply and shiftLeft now have the same implementation, and both use SHL to move one bit to the left. Otherwise, if you look at the file offset (the leftmost column), you’ll see exactly the same. ART recognized that the two methods had the same method body and de-duplicated them when they were translated into x86 assembly code.

Divide and shiftRight, however, are implemented differently in that they do not work together with SAR to move one bit to the right. The Divide method uses four additional instructions before calling SAR to handle negative input.

Execute the same instructions on an Android 10 Pixel4 device to see how ART compiles code into ARM assembly code.

OatDexFile:
0: LExampleKt; (offset=0x000005a4) (type_idx=1) (Verified) (OatClassAllCompiled)
  0: int ExampleKt.divide(int) (dex_mmultiply and shiftLeft ethod_idx=0)
    CODE: (code_offset=0x00001009 size_offset=0x00001004 size=10)...
      0x00001008: 0fc8      lsrs r0, r1, #31
      0x0000100a: 1841      adds r1, r0, r1
      0x0000100c: 1049      asrs r1, #1
      0x0000100e: 4608      mov r0, r1
      0x00001010: 4770      bx lr
  1: int ExampleKt.multiply(int) (dex_method_idx=1)
    CODE: (code_offset=0x00001021 size_offset=0x0000101c size=4)...
      0x00001020: 0048      lsls r0, r1, #1
      0x00001022: 4770      bx lr
  2: int ExampleKt.shiftLeft(int) (dex_method_idx=2)
    CODE: (code_offset=0x00001021 size_offset=0x0000101c size=4)...
      0x00001020: 0048      lsls r0, r1, #1
      0x00001022: 4770      bx lr
  3: int ExampleKt.shiftRight(int) (dex_method_idx=3)
    CODE: (code_offset=0x00001031 size_offset=0x0000102c size=4)...
      0x00001030: 1048      asrs r0, r1, #1
      0x00001032: 4770      bx lr
Copy the code

Similarly, multiply and shiftLeft use LSLS to move the left one bit and remove the duplicate method body. ShiftRight does the right shift with the ASRS instruction, while division uses another right shift instruction, LSRS, to handle negative input.

At this point, we can safely say that using value << 1 instead of value * 2 does no good. Stop doing this in arithmetic operations and keep it only when bitwise operations are strictly required.

However, value / 2 and value >> 1 still produce different assembly instructions and therefore have different performance. Thankfully, value / 2 does not perform a generic division operation and is still based on a shift operation, so the performance differences may not be significant.

Is shift faster than division?

To determine which shift or division is faster, I tested it using Jetpack Benchmark.

class DivideOrShiftTest {
  @JvmField @Rule val benchmark = BenchmarkRule()

  @Test fun divide(a) {
    val value = "4".toInt() // Ensure not a constant.
    var result = 0
    benchmark.measureRepeated {
      result = value / 2
    }
    println(result) // Ensure D8 keeps computation.
  }

  @Test fun shift(a) {
    val value = "4".toInt() // Ensure not a constant.
    var result = 0
    benchmark.measureRepeated {
      result = value shr 1
    }
    println(result) // Ensure D8 keeps computation.}}Copy the code

I don’t own an x86 device, so I tested it on Android 10 Pixel3, and here’s what I found:

android.studio.display.benchmark=4 ns DivideOrShiftTest.divide
count=4006
mean=4
median=4
min=4
standardDeviation=0
Copy the code

android.studio.display.benchmark=3 ns DivideOrShiftTest.shift count=3943 mean=3 median=3 min=3 standardDeviation=0

There is really no difference between using division and shifting, the difference is nanosecond. Using negative numbers doesn’t make any difference.

At this point, we can safely say that using value >> 1 instead of value / 2 does no good. Stop doing this in arithmetic operations and keep it only when bitwise operations are strictly required.

Can D8/R8 reduce Apk volume?

If there are two ways to express the same operation, the one with better performance should be selected. If the performance is the same, you should choose the one that can reduce the Apk volume.

We now know that value * 2 and value << 1 produce the same assembly code on ART. Therefore, if one is more space-efficient in Dalvik, we should definitely use it instead of the other. Let’s look at D8’s output, which also produces bytecode of the same size:

#1 : (in LExampleKt;) Name: "multiply" ⋮ 000140: da00 0102 | 0000: the mul - int/lit8 where v0, v1, 2 / / # 02 # int
#2 : (in LExampleKt;) Name: 'shiftLeft ⋮Copy the code

Copy the code

000158: e000 0101 |0000: shl-int/lit8 v0, v1, #int 1 // #01

Multiplication may take up more space to store literals. Compare value * 32_768 with value << 15.

#1 : (in LExampleKt;) Name: "multiply" ⋮ 000128:1400 0080 0000 | 0000: const where v0, 00012 e # 0.000000 / / float # 00008000:9201 0100 | 0003: mul-int v1, v1, v0
#2 : (in LExampleKt;) Name: 'shiftLeft ⋮Copy the code

Copy the code

00015c: e000 000f |0000: shl-int/lit8 v0, v0, #int 15 // #0f

I mentioned this issue on D8, but I strongly suspect that the probability of this happening is 0, so it’s not worth it.

The output from D8 and R8 also shows that value / 2 and value >> 1 have the same cost for Dalvik.

#0 : (in LExampleKt;) Name: 'divide ⋮ 000128: db00 0102 | 0000: div - int/lit8 where v0, v1, 2 / / # 02 # int
#2 : (in LExampleKt;) Name: 'shiftLeft ⋮Copy the code

Copy the code

000158: e000 0101 |0000: shl-int/lit8 v0, v1, #int 1 // #01

The bytecode size above also changes when the literal size reaches 32768. Because of negative numbers, it is not absolutely safe to use a right shift unconditionally instead of division to the second power. We can make this substitution without making it negative.

Does the second power division of unsigned numbers also use shifts?

Java bytecode does not have unsigned numbers, but you can simulate them using signed numbers. Java provides static methods to convert signed numbers to unsigned numbers. Kotlin provides the unsigned type UInt, which provides the same functionality but, unlike Java, abstracts itself as a data type. As you can imagine, the division of a second power can definitely be rewritten with a right shift operation.

Use Kotlin to illustrate the following two cases.

fun javaLike(value: Int) = Integer.divideUnsigned(value, 2)
fun kotlinLike(value: UInt) = value / 2U
Copy the code

Compiled by Kotlinc (Kotlin 1.4-M1) :

$ kotlinc Example.kt
$ javap -c ExampleKt
Compiled from "Example.kt"
public final class ExampleKt {
  public static final int javaLike(int);
    Code:
       0: iload_0
       1: iconst_2
       2: invokestatic  #12       // Method java/lang/Integer.divideUnsigned:(II)I
       5: ireturn
Copy the code

public static final int kotlinLike-WZ4Q5Ns(int); Code: 0: iload_0 1: istore_1 2: iconst_2 3: istore_2 4: iconst_0 5: istore_3 6: iload_1 7: iload_2 8: invokestatic #20 // Method kotlin/UnsignedKt."uintDivide-J1ME1BU":(II)I 11: ireturn }

Kotlin failed to recognize that this was a quadratic division, which could have been replaced by the IUSHR shift operation. I also submitted this issue to Jetbrain.

Using -xuse-i also doesn’t change anything (except to remove some loads/stores). However, working with Java8 is different.

$kotlinc - JVMS -target 1.8 example. kt $javap -c ExampleKt Compiled from "example. kt" Public Final class ExampleKt { public static final int javaLike(int); Code: 0: iload_0 1: iconst_2 2: invokestatic #12 // Method java/lang/Integer.divideUnsigned:(II)I 5: ireturnCopy the code

public static final int kotlinLike-WZ4Q5Ns(int); Code: 0: iload_0 1: iconst_2 2: invokestatic #12 // Method java/lang/Integer.divideUnsigned:(II)I 5: ireturn }

Integer.divideUnsigned methods are available as of Java 8. Since this makes the bodies of the two functions exactly the same, let’s go back to the old version for comparison.

And then R8. Significantly different from the above, we use the Kotlin library as input and specify the minimum API, — min-API 24. Because integer. divideUnsigned is only available in API 24 and beyond.

$ java -jar $R8_HOME/build/libs/r8.jar \ --lib $ANDROID_HOME/platforms/android-29/android.jar \ --min-api 24 \ --release  \ --pg-conf rules.txt \ --output . \ ExampleKt.class kotlin-stdlib.jar $ dexdump -d classes.dex Opened 'classes.dex', DEX version '039' Class #0 - Class descriptor : 'LExampleKt; ' Access flags : 0x0011 (PUBLIC FINAL) Superclass : 'Ljava/lang/Object; ' Direct methods - #0 : (in LExampleKt;) name : 'javaLike' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 0000f8: |[0000f8] ExampleKt.javaLike:(I)I 000108: 1220 |0000: const/4 v0, #int 2 // #2 00010a: 7120 0200 0100 |0001: invoke-static {v1, v0}, Ljava/lang/Integer; .divideUnsigned:(II)I // method@0002 000110: 0a01 |0004: move-result v1 000112: 0f01 |0005: return v1
#1              : (in LExampleKt;)
  name          : 'kotlinLike-WZ4Q5Ns'
  type          : '(I)I'
  access        : 0x0019 (PUBLIC STATIC FINAL)
  code          -
Copy the code

Copy the code

000114: |[000114] ExampleKt.kotlinLike-WZ4Q5Ns:(I)I 000124: 8160 |0000: int-to-long v0, v6 000126: 0000 0000 | 0001:1802 FFFF FFFF const - wide v2, # 0.000000 / double / # 000130:00000000 FFFFFFFF c020 | 0006: and-long/2addr v0, v2 000132: 1226 |0007: const/4 v6, #int 2 // #2 000134: 8164 |0008: int-to-long v4, v6 000136: c042 |0009: and-long/2addr v2, v4 000138: be20 |000a: div-long/2addr v0, v2 00013a: 8406 |000b: long-to-int v6, v0 00013c: 0f06 |000c: return v6

Kotlin has his own implementation of unsigned integers that are directly inlined inside functions. It does this by converting arguments and literals to long, dividing by long, and finally converting to int. When we eventually run them through ART they’re just translated to equivalent x86 so we’re going to leave this function Behind. This is a missed optimization opportunity.

For the Java version, R8 also does not use a shift operation to replace divideUnsigned. I have submitted the issue for continuous tracking.

The final optimization opportunity is ART.

$ adb push classes.dex /sdcard/classes.dex
$ adb shell
generic_x86:/ $ sugenzong
generic_x86:/ # dex2oat --dex-file=/sdcard/classes.dex --oat-file=/sdcard/classes.oat
generic_x86:/ # oatdump --oat-file=/sdcard/classes.oat
OatDexFile:
0: LExampleKt; (offset=0x000003c0) (type_idx=1) (Initialized) (OatClassAllCompiled)
  0: int ExampleKt.javaLike(int) (dex_method_idx=0)
    CODE: (code_offset=0x00001010 size_offset=0x0000100c size=63)...
      0x00001010:         85842400E0FFFF             test eax, [esp + -8192]
        StackMap[0] (native_pc=0x1017, dex_pc=0x0, register_mask=0x0, stack_mask=0b)
      0x00001017:                     55             push ebp
      0x00001018:                 83EC18             sub esp, 24
      0x0000101b:                 890424             mov [esp], eax
      0x0000101e:     6466833D0000000000             cmpw fs:[0x0], 0  ; state_and_flags
      0x00001027:           0F8519000000             jnz/ne +25 (0x00001046)
      0x0000102d:             E800000000             call +0 (0x00001032)
      0x00001032:                     5D             pop ebp
      0x00001033:             BA02000000             mov edx, 2
      0x00001038:           8B85CE0F0000             mov eax, [ebp + 4046]
      0x0000103e:                 FF5018             call [eax + 24]
        StackMap[1] (native_pc=0x1041, dex_pc=0x1, register_mask=0x0, stack_mask=0b)
      0x00001041:                 83C418             add esp, 24
      0x00001044:                     5D             pop ebp
      0x00001045:                     C3             ret
      0x00001046:         64FF15E0020000             call fs:[0x2e0]  ; pTestSuspend
        StackMap[2] (native_pc=0x104d, dex_pc=0x0, register_mask=0x0, stack_mask=0b)
      0x0000104d:                   EBDE             jmp -34 (0x0000102d)
  1: int ExampleKt.kotlinLike-WZ4Q5Ns(int) (dex_method_idx=1)
    CODE: (code_offset=0x00001060 size_offset=0x0000105c size=67)...
      ⋮
Copy the code

ART does not make an inline call to divideUnsigned, instead making a regular method call. I submitted this issue for tracking.

The last

It’s been a long journey, congratulations on completing it (or just flipping to the bottom of the article). Let’s sum up.

  1. ART rewrites multiplication/division for quadratic powers using left/right shift (there are additional instructions for negative numbers).
  2. There is no significant performance difference between right shift and quadratic division.
  3. The Dalvik bytecode size of shift and multiplication and division is the same.
  4. No one has optimized unsigned division (at least not yet), but you probably haven’t either.

With these facts, you can answer the question at the beginning of the article.

On Android, divide by 2 or move right by 1?

All is not! Shift operations are used only when bitwise operations are actually required, and multiplication and division are used for other mathematical operations. I’m going to start switching bits from AndroidX Collection to multiplication and division. See you next time!


May translate more recently, encounter some good articles always can’t help but share with everyone.

In fact, the translation is not easier than the original text, I will write the translation at least on the basis of a thorough reading, intensive reading twice. If you think the article is good, feel free to read, forward, share!


This article is formatted using MDNICE