Leonnewton · 2016/03/24 10:45

Author:leonnewton

0 x00 preface


DEXLabs has published a blog called Detecting Android Sandboxes, which suggests a method for Detecting Android Sandboxes, along with PoC. This paper introduces the content of the original text, and adds the author’s own practical experiment results and discussion.

0x01 Binary translation mechanism for Qemu


Take a look at Baidu Baike’s explanation of binary translation:

Binary Translation is a technique for directly translating executable binaries from one processor to another.

Qemu uses binary translation technology to translate native code, such as code under ARM, into the same or different instruction set under host system, such as x86 instruction set. Binary translation is faster than instruction set emulation because translated blocks of code can be cached and then executed directly. The following is a flow chart of Qemu binary translation process:

As you can see from the figure, when a branch instruction is encountered, the following code is processed. The following code address can be found in the branch instruction, so the first search is to see if there is a code cache. When hit, the code block is already cached, the code has been translated, so just execute it. When there is no hit, the translation function translates the code until it hits the next branch instruction. The translated code is then cached and executed.

0x02 An optimization of Qemu


The physical CPU increments the program counter by one after each instruction, and the program counter is always the latest value. For a virtual CPU, it would increment the program counter by one after each instruction in the code is executed to ensure that the program counter is up to date. However, because the translated code executes locally, that is, on the emulated CPU, it only needs to return the latest correct value if the original code needs to access the program counter (since this does not affect normal operation on the host system). Qemu updates the program counter whenever it needs to return the latest value, but not every other time as an optimization measure. Therefore, the program counter points to the beginning of a block of code because it needs to be updated every time a branch is made.

0x03 is associated with optimization


Imagine what this optimization would do in a multitasking operating system when interrupts occur. Since the program counter is not updated with every execution, the virtual CPU cannot be interrupted during the execution of a code block and cannot recover the correct program counter value after the interrupt. So when running a block of code, task scheduling usually doesn’t happen. The whole process is shown below:

0x04 Experimental verification


It says you can’t interrupt, you can’t schedule tasks, so the experiment artificially creates the task scheduling situation, and then examines the task scheduling situation. Record the address where the task is scheduled and check the distribution of these addresses. Then in a real computer environment, these addresses should be approximately the same number, evenly distributed. Each address is equally likely to have a scheduling situation. In Qemu’s virtual environment, task scheduling occurs at the end of a block of code, so addresses are not uniformly distributed but vary widely.

The specific experiment was conducted through two threads. A thread increments a global variable in a code block, assigning the global variable a value of 1 each time through the loop. The other thread also loops, reading the previous global variable each time. Then count how many times each number is read out. The process is as follows:

The code for the incrementing thread is as follows:

#! cpp void* thread1(void * data){ for(;;) { __asm__ __volatile__ ("mov r0, %[global];" "mov r1, #1;" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" "add r1, r1, #1;" "str r1, [r0];" : :[global] "r" (&global_value) : ); }}Copy the code

We’re going from 2 all the way up to 32.

The other thread reads the global variable directly and counts the read values

#! cpp for(i=0; i<50000; i++) count[global_value]++;Copy the code

And then compute the variance of the count array to see how discrete the distribution is, and I tried three different ways.

  1. Directly calculate the variance and standard deviation of the original data;
  2. Divide all the data by the maximum, and then calculate the variance and standard deviation;

  3. After deviation standardization, the variance and standard deviation are calculated.

0x05 Results and discussion


That’s a total of 50,000 loop reads.

In the simulator, the variance and standard deviation calculated in three ways are as follows:

Each number was counted as follows:

Everything except count[32] is 0.

In the real machine, the variance and standard deviation calculated by the three methods are as follows:

Each number was counted as follows:

Although the count[32] is more than anywhere else, it is almost evenly distributed in the middle.

So what does the experiment mean? It’s just that while one thread is incrementing, the other thread interrupts it at some point, and then from the read value, it interrupts it at some point, so it’s scheduling. Therefore, the results do conform to the above analysis, and the simulator only performs task scheduling at the end of the code block when there are branch instructions. And in the real machine, the experiment found that this is also the most scheduled, but the rest of the above is basically evenly distributed. You can design your own index of reaction dispersion to determine whether you are running in the simulator.