Today, we will introduce the knowledge related to Dalvik VIRTUAL machine. First, we will introduce the knowledge related to Dalvik bytecode, which we are most concerned about, and then go deep into the field of Android reverse. The reason why I write this article is because there are girls to learn this, plus many information on the Internet is too scattered and one-sided, of course, more important is to make a summary for the past.

Dalvik register

Before we get started, let’s learn something about registers. Registers used in Dalvik are all 32 bits, while 64-bit data are represented by two adjacent 32 bit registers. In other words, for 64-bit data such as double, two 32 bit registers are needed to store data.

Virtual machine register

We know that Dalvik supports 65536 registers at most (numbered from 0 to 65535), but there are only 37 registers in THE CPU of ARM architecture, so how to solve this asymmetry? Registers in Dalvik are virtual registers, which are implemented by mapping real registers. We know that each Dalvik maintains a call stack that supports mapping between virtual registers and real registers. When executing a specific function,Dalvik determines the number of registers to be used by the function according to the. Registers instruction. Specific principle, you can refer to Davilk’s implementation.

The following registers are all virtual registers.

Rules for using registers

For a method that uses M registers (m= number of local variable registers L + number of parameter registers n), the local registers use l registers starting from v0, while the parameter registers use the last N registers. For example, if the instance method test(String a,String b) uses five registers :0,1,2,3,4, then the parameter register can use two, three, and four registers, as shown in the figure below:

Register naming

Registers can be named in two different ways :v and P nomenclature. These two nomenclatures simply affect the readability of the bytecode.

V nomenclature

Represents local variables and parameters used in methods beginning with a lowercase v.

For the above example method test(String a,String b),v0,v1 are the registers that local variables can use,v2,v3,v4 are the registers that parameters can use:

P word nomenclature

The parameter name starts with a lowercase letter p and increases from p0. Registers that local variables can use still start with V.

For test(String A,String b),v0,v1 are the registers that local variables can use, and P0,p1,v2 are the registers that arguments can use:

Dalvik descriptor

Similar to the JVM,Davilk bytecode also has a set of methods for describing types, methods, and fields, which combine with Davilk’s instructions to form complete assembly code.

Bytecode and data types

Davilk bytecode has only two types: basic and reference types. Objects and arrays are reference types, and Davilk’s bytecode types are described in the same way that descriptors in the JVM are: a capital letter is used for both basic types and void types with no return value, and object types are represented by the letter L plus the fully qualified name of the object. The array is represented by [, and the specific rules are as follows:

What is a fully qualified name? For example, if the full name of String is java.lang.String, the fully qualified name is Java /lang/String; “, which is the “of java.lang.String. Replace “/” with a semicolon at the end; Closing.

Java type Type descriptor
boolean Z
byte B
short S
char C
int I
long J
float F
double D
void V
Object type L
An array type [

Here we focus on object types and array types:

Object type

L can represent any class in a Java type. In Java code it is referred to as package.name.objectName, while in Davilk it is described as Lpackage/name/ObjectName. L is the Java class type defined above, indicating that it is followed by a tired fully qualified name. For example, in Java, java.lang.String corresponds to Ljava/lang/String; .

An array type

The [type is used to represent arrays of all primitive types, followed by the descriptor of the primitive type. Each dimension is preceded by a [. For example, int[] in Java is encoded as [I;. Two-dimensional arrays int[][] are [[I;, while three-dimensional arrays are [[[I;].

For arrays of objects,[is followed by the full qualifier of the corresponding class. For example, in Java, String[] corresponds to [Java /lang/String;

Field description

Field description in Davilk is divided into two types, basic type field description and reference type description, but the two types of description format is the same: object type descriptor -> field name: type descriptor; For example, com.sbbic.Test contains a String name field and an int age field.

Lcom/sbbic/Test; ->name:Ljava/lang/String; Lcom/sbbic/test; ->age:ICopy the code

Description of method

Java method signature includes method name, parameter and return value. In Davilk, the corresponding description rules are: object type descriptor -> method name (parameter type descriptor) Return value type descriptor

Here are a few examples, using java.lang.String as an example:

Public char charAt(int index){... Davilk :Ljava/lang/String; Public void Cend (int srcBegin,int srcEnd,char DST [],int dstBegin){... Davilk :Ljava/lang/String; Public Boolean equals(Object anObject){...} Davilk :Ljava/lang/String; ->equals(Ljava/lang/Object)ZCopy the code

Dalvik instruction set

To master the above description of fields and methods, we can only know how to describe a field and method, while the specific logic in the method needs to understand the instruction set in Dalvik. Because Dalvik is register-based, the instruction set is very different from the instruction set in the JVM and more similar to the assembly instructions in x86.

Data definition instruction

Data definition directives are used to define constants, classes, etc. used in code. The basic directive is const

instruction describe
const/4 vA,#+B Assign the value to register vA by expanding the value symbol to 32
const-wide/16 vAA,#+BBBB Assign a register to vAA after extending the numeric symbol to 64 bits
const-string vAA,string@BBBB High string is assigned to register vAA by string index
const-class vAA,type@BBBB Get a class reference by type index and assign it to register vAA

Data manipulation instruction

Move instruction is used for data operation, which represents move destination,source, that is, data data is moved from the source register (source register) to the destionation register (source register), which can understand the assignment operation between variables in Java. Move instructions are followed by different suffixes, depending on the bytecode and type.

instruction describe
move vA,vB Assign the value of the vB register to the vA register, both of which are 4 bits
move/from16 vAA,VBBBB Assign the value of the vBBBB register (16 bits) to the vAA register (7 bits),from16 indicating that the source register vBBBB is 16 bits
move/16 vAAAA,vBBBB Assign the value of register vBBBB to register vAAAA,16 means that both source register vBBBB and target register vAAAA are 16 bits
move-object vA,vB Assign an object reference in the vB register to the vA register, which is 4 bits, as is the vB register
move-result vAA Assign the single-word (32-bit) non-object result of the last invoke instruction (method call) operation to the vAA register
move-result-wide vAA Assign the binary (64-bit) non-object result of the previous invoke instruction operation to the vAA register
mvoe-result-object vAA Assign the result of the object operated on by the previous Invoke instruction to the vAA register
move-exception vAA Save the exception that occurred at the previous run time to the vAA register

Object operation instruction

Operations related to object instances, such as object creation, object inspection, and so on.

instruction describe
new-instance vAA,type@BBBB Constructs an object of the specified type to assign a reference to the vAA register. Array objects are not included here
instance-of vA,vB,type@CCCC Determines if the reference to an object in the vB register is of the specified type, if so, assign v1 to 1, otherwise 0
check-cast vAA,type@BBBB Converts a reference to an object in the vAA register to the specified type. On success, the result is assigned to vAA; otherwise, a ClassCastException is thrown.

Array manipulation instruction

We do not find instructions to create objects in the instance manipulation instructions. Special instructions are set up in Davilk for array manipulation.

instruction instructions
new-array vA,vB,type@CCCC Creates an array of the specified type and size (specified by the vB register) and assigns it to the vA register
fill-array-data vAA,+BBBBBBBB Fills the array with the specified data. VAA represents a reference to the array (the address of the first element of the array).

Data operation instruction

Data operation mainly includes two kinds: arithmetic operation and logical operation. 1. Arithmetic operation instruction

instruction instructions
add-type Add instruction
sub-type Subtraction instructions
mul-type Multiplication instructions
div-type Division instructions
rem-type o

2. Logical primitive instruction

instruction instructions
and-type And operation instruction
or-type Or operation instruction
xor-type Xor meta-instruction

3. Displacement instruction

instruction instructions
shl-type There is a sign shift instruction
shr-type There is a sign right shift instruction
ushr-type Unsigned right shift instruction

The above type indicates the type of data in the register to be operated on. It can be -int,-float,-long,-double, etc.

More instructions

The comparison instruction is used to compare the size of values in two registers. The basic format is CMP +kind-type vAA,vBB,vCC. Type indicates the type of comparison data, such as -long,-float, etc. Kind stands for operation type, so there are CMPL, CMPG and CMP comparison instructions. Coml stands for compare less, CMPG stands for compare Greater, so CMPL returns 1 if vBB is less than the value of vCC, otherwise -1, and 0 if equal; CMPG: vBB is greater than the value in the vCC register. VBB is greater than the value in the vCC register.

instruction instructions
cmpl-float vAA,vBB,vCC Compares two single-precision floating point numbers. If the value in the vBB register is greater than that in the vCC register, -1 is returned to the vAA, 0 is returned for equality, and 1 is returned for less
cmpg-float vAA,vBB,vCC Compare two single-precision floating-point numbers and return 1 if the value in the vBB register is greater than the vCC value, 0 for equality, and -1 for less than
cmpl-double vAA,vBB,vCC Compare two double-precision floating-point numbers, returning -1 if the value in the vBB register is greater than the vCC value, 0 if equal, and 1 if less
cmpg-double vAA,vBB,vCC Compare a double – precision floating-point number
cmp-double vAA,vBB,vCC Equivalent to CMPG -double vAA,vBB,vCC instruction

Field operation instruction

Field manipulation instructions represent set and value operations on object fields, like the longer set and get methods you use in your code. The basic instructions are iput-type,iget-type,sput-type, and sget-type. Type indicates the data type.

Common field read and write operations

The iPUT-type and IGet-type directives prefixed with I are used for read and write operations on fields.

instruction instructions
iget-byte vX,vY,filed_id Read the value of the filed_id field in the object in the vY register and assign it to the vX register
iput-byte vX,vY,filed_id Set the value of the filed_id field in the object in the vY register to the value of the vX register
iget-boolean vX,vY,filed_id
iput-boolean vX,vY,filed_id
iget-long vX,vY,filed_id
iput-long vX,vY,filed_id

Static field read and write operations

The sput-type and sget-type directives prefixed with s are used for reading and writing static fields

instruction instructions
sget-byte vX,vY,filed_id
sput-byte vX,vY,filed_id
sget-boolean vX,vY,filed_id
sput-boolean vX,vY,filed_id
sget-long vX,vY,filed_id
sput-long vX,vY,filed_id

Method call instruction

The method instructions in Davilk are largely similar to the JVM’s middle instructions. There are currently five instruction sets:

instruction instructions
invoke-direct{parameters},methodtocall Call the instance’s direct method, which is a private decorated method. Note that the first element in {} represents the current instance object, this, followed by the actual arguments. For example, invoke-virtual {v3,v1,v4}, test2. method5:(II)V,v3 represents the current instance object of Test2, while v1 and v4 are method parameters
invoke-static{parameters},methodtocall Call the static method of the instance, where {} are all method arguments
invoke-super{parameters},methodtocall Call the parent class method
invoke-virtual{parameters},methodtocall Invoke the virtual methods of the instance, that is, the methods that the public and protected modifications modify
invoke-interface{parameters},methodtocall Calling interface methods

These five commands are the basic commands,in addition to which you will also encounter invoke-direct/range,invoke-static/range,invoke-super/range,invoke-virtual/range, and invoke-interface/ RA Nge instruction, the only difference between this type of instruction and the above instruction is that the latter can set the range of registers that method parameters can use, which is used when more than four parameters are used.

Again, the structure of {} for non-static methods is {current instance object, argument 1, argument 2… Parameter n}, and for static methods {parameter 1, parameter 2… Parameters n}

Note that if you want to get a return value from a method execution, you need to get the result of the execution using the move-result directive described above.

Method return instruction

Davilk also provides a Return command to Return the result of the execution of a method:

instruction instructions
return-void Return nothing
return vAA Returns a 32-bit value of a non-object type
return-wide vAA Returns a 64-bit value of a non-object type
return-object vAA A reference to an object type is returned

Synchronization instructions

Synchronizing a sequence of instructions is usually represented by a synchronized block in Java. The JVM supports the semantics of the synchronized keyword through monitorenter and Monitorexit directives, and Davilk provides two similar directives to support SY Nchronized semantics:

instruction instructions
monitor-enter vAA Gets the lock operation for the specified object
monitor-exit vAA Release the lock for the specified object

Abnormal instruction

Long ago, VMS also used JSR and RET directives to implement exceptions, but today’s JVMS have thrown that out the window in favor of exception tables. Davilk still uses instructions to do this:

instruction instructions
throw vAA Throws an exception of the specified type in the vAA register

Jump instruction

The jump instruction is used to move from the current address bar to a specified offset, mostly in the if and switch branches. Goto, Packed -switch and IF-test instructions are provided in Davilk to realize the jump operation

instruction operation
goto +AA Unconditionally jump to the specified offset (AA is the offset)
packed-switch vAA,+BBBBBBBB The value in the vAA register is determined in the switch branch, and the value in the BBBBBBBB register is the index value in the offset table (Packed -switch-payload).
spare-switch vAA,+BBBBBBBB The branch hop command is similar to the Packed -switch, except that the index in the BBBBBBBB offset table (spread-switch-payload) is payload
if-test vA,vB,+CCCC Conditional jump instruction, used to compare values in the vA and vB registers, jump to the specified offset (CCCC) if the condition is met,test stands for comparison rule, can be eq. Lt, etc.

In conditional comparisons, test in if-test represents the comparison rule. This directive is used a lot, so we simply sit down and say:

instruction instructions
if-eq vA,vB,target The equality in the vA and vB registers is equivalent to if(a==b) in Java, such as if-eq v3,v10,002c, which means jump to current position+002c if the condition is true. The rest are similar
if-ne vA,vB,target Equivalent to if(a! =b)
if-lt vA,vB,target The value in the vA register is less than vB, equivalent to if(a in Java<b)
if-gt vA,vB,target Equivalent to if(a) in Java>b)
if-ge vA,vB,target Equivalent to if(a) in Java> =b)
if-le vA,vB,target Equivalent to if(a) in Java<=< code="">b)

In addition to the above instructions,Davilk also provides a zero-value conditional instruction, which is used to compare with 0, which can be understood as fixing the vB register value of the above instruction to 0.

instruction instructions
if-eqz vAA,target Equivalent to if(a==0) or if(! a)
if-nez vAA,target Equivalent to if(a! = 0), or if (a)
if-ltz vAA,target Equivalent to if(a) in Java<0)
if-gtz vAA,target Equivalent to if(a) in Java>0)
if-lez vAA,target Equivalent to if(a) in Java<=< code="">0)
if-gtz vAA,target Equivalent to if(a) in Java> =0)

The only difference between the two offset tables is whether the values in the table are in order. We will explain in detail in the following section.

Data conversion instruction

Data type conversions are familiar to any Java developer and are used to convert two different data types to each other. Its basic instruction format is: UNOp vA,vB, which means to operate on the median value of the vB register and save the result in the vA register.

instruction instructions
int-to-long Type to a long integer
float-to-int The single-precision floating-point type becomes an integer
int-to-byte Integer to byte type
neg-int Complement instructions, complement integers
not-int Invert instructions, invert integers

So far, we have given a brief explanation of the instructions in Davilk. Davilk’s instructions are very much a combination of x86 instructions and JVM instruction structures and semantics, so the instructions in Davilk are generally easy to learn. For more detailed instructions, please refer to Davilk instruction set

Detail the SMALI file

We introduced Dalvik’s instructions above, and now we take a look at the SMALI file. Although we write Android applications in Java, Dalvik does not load. Class files directly. Instead, Dalvik optimizes. Therefore, we cannot analyze apK files directly by analyzing.class. Instead, we need to decompilate dex files with the tool baksmali.jar to obtain the corresponding smali files. Smali files can be considered as Davilk bytecode files, but they are not exactly the same.

Each.smali file is composed of Davilk directives and follows a certain structure. There are many directives in Smali that describe the corresponding Java file, all of which begin with “. At the beginning, common instructions are as follows:

keywords instructions
.filed Define fields
The method… end method Define methods
. The annotation… end annotation Custom annotation
.implements Define interface instructions
.local Specifies the number of local variables in a method
.registers Specifies the total number of registers used within a method
.prologue Represents the beginning of the code in a method
.line Represents the specified line in a Java source file
.paramter Specifies the parameters of the method
.param Paramter and. Paramter have the same meanings but different formats

A lot of people here are confused about.local and.register, so if you’re one of them please go back to the register point above.

Here is a brief description of the structure of the smali file:

1. File header description

The first three lines of the smali file describe the information for the current class:

.class < access modifier > [non-access modifier] < class name >.super < parent name >.source < source name >Copy the code

The content in <> represents indispensable,[] represents optional. Access modifiers are called public,protected, and private, default. The non-permission modifier means final,abstract.

.class public final Lcom/sbbic/demo/Device;
.super Ljava/lang/Object;
.source "Device.java"Copy the code

2. File body

After the file header is the body of the file, that is, the main body of the class, including the interface description, annotation description, field description and method description of the class. Let’s look at the structure of a field and a method respectively (remember the method and field representations we talked about in Davilk).

Interface description

If the class implements an interface, it passes the.implements definition, which looks like this:

#interfaces. ImplementsCopy the code

For example:


.implements Landroid/view/View$OnClickListener;Copy the code

Smali adds a #Interface annotation to it

Notes describe

If a class uses an annotation, it uses the. Annotation definition: the format is as follows:

.annotation [attribute of annotation] < annotation class name > [annotation field = value]... .endCopy the code

The field

Smali uses. Field to describe fields. We know that Java is divided into static fields (class attributes) and ordinary fields (instance attributes).

1. Common fields:

#instance fields. Field < access modifier > [non-access modifier] < field name >:< field type >Copy the code

Access modifier than you already very ripe, but here the access modifier is final, volidate, transient. For example:


.field private TAG:Ljava/lang/String;Copy the code

Static fields add static to the definition of ordinary fields as follows:

#static fields. Field < access permission >Copy the code

For example:

# static fields. Field private static final PI :F = 3.14fCopy the code

Note: The smali file is also a static field. For common fields, #static field and #instan filed annotation are added respectively.

Methods described

Smali uses. Method to describe the method. The specific definition format is as follows:

Direct methods are called direct methods. Remember Davilk’s invoke-direct instruction? Forget the children’s shoes to turn over, here is not to explain.

#direct methods. method < access modifier > [non-access modifier] < method prototype > <.locals> [.parameter] [.prologue] [.line] < code logic >.endCopy the code

Explain key parameter: the number of parameter and the number of method parameters, namely a few parameters there is few. The parameter, the default starting from 1, namely p1, p2, p2… Those familiar with Java will remember that this type of method takes a default argument to the current object. In Smali, the default object argument to the method is p0.

For example:

# direct methods .method public constructor ()V .registers 2 .prologue .line 8 invoke-direct {p0}, Landroid/app/Activity; ->()V .line 10 const-string v0, "MainActivity" iput-object v0, p0, Lcom/social_touch/demo/MainActivity; ->TAG:Ljava/lang/String; .line 13 const/4 v0, 0x0 iput-boolean v0, p0, Lcom/social_touch/demo/MainActivity; ->running:Z return-void .end methodCopy the code

Note that Smali annotates it with #direct Method

The only difference between virtual methods and direct methods is the annotation :#virtual methods

Parameter1 [.locals> [.parameter1] [.parameter2] [.prologue] [.lineCopy the code

3. Smali file structure for inner classes

The smali file of the inner class is slightly different. The file name of the corresponding smali file of the inner class is [outer class name $inner class name.smali], which is explained in more detail below.

4. Example demonstration

The structure of the SMali file is also very clear and easy to read once familiar with. Let’s look at a simple smali file. To make it easier to understand, let’s first post a snippet of Java code:

public class MainActivity extends Activity implements View.OnClickListener { private String TAG = "MainActivity"; Private static final float PI = (float) 3.14; public volatile boolean running = false; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); } @Override public void onClick(View view) { int result = add(4, 5); System.out.println(result); result = sub(9, 3); if (result > 4) { log(result); } } public int add(int x, int y) { return x + y; } public synchronized int sub(int x, int y) { return x + y; } public static void log(int result) { Log.d("MainActivity", "the result:" + result); }}Copy the code

Let’s take a look at the decompilated smali file. Note that different decompiler tools may decompilate slightly different files, such as.param instead of.paramter, no.register, etc., but generally the meaning is the same.

Class public Lcom/social_touch/demo/MainActivity; .super Landroid/app/Activity; OnClickListener implements the view. OnClickListener interface (interface. Implements) Landroid/view/View$OnClickListener; Field private static final PI :F = 3.14 F # define String TAG # instance fields.field private TAG:Ljava/lang/String; Running. field public volatile running:Z # Constructor specifies a Boolean running. Field volatile running:Z # constructor Public constructor () v. locals 1# Call the init() method in the Activity invoke-direct {p0}, Landroid/app/Activity; ->()V .line 10 const-string v0, "MainActivity" iput-object v0, p0, Lcom/social_touch/demo/MainActivity; ->TAG:Ljava/lang/String; .line 13 const/4 v0, 0x0 iput-boolean v0, p0, Lcom/social_touch/demo/MainActivity; ->running:Z return-void. End method # static method log(). Method public static log(I) v. locals 3 .prologue. Line 42 #v0 set to "MainActivity" const-string v0, "MainActivity" # create StringBuilder object and assign its reference to v1 register new-instance v1, Ljava/lang/StringBuilder; Invoke-direct {v1}, Ljava/lang/StringBuilder; ->()V #v2 ther result: const-string v2, "The result:" #{v1,v2} in braces holds a reference to the StringBuilder object. invoke-virtual {v1, v2}, Ljava/lang/StringBuilder; ->append(Ljava/lang/String;) Ljava/lang/StringBuilder; (append()) = (v1); (v1) = (append()); (v1) = (v1) P0: invoke-virtual {v1, p0}, Ljava/lang/StringBuilder; ->append(I)Ljava/lang/StringBuilder; Invoke toString() {v1}, Ljava/lang/StringBuilder; ->toString()Ljava/lang/String; # call a static method e() from the Log class. # call a static method e() from the Log class. Because e() is static, {v0,v1} becomes the parameter register invoke-static {v0,v1}, Landroid/util/Log; ->d(Ljava/lang/String; Ljava/lang/String;) Return -void. End method # virtual methods. Method public add(II) I.locals 1.parameter Parameter "y"# prologue. Line 34 # Add -int v0 Method public onClick(Landroid/view/ view;) V. locals 4. parameter "view" # view. Prologue const/4 v3, 0x4 #v3 Call add() invoke-virtual {p0, v3, v1}, Lcom/social_touch/demo/MainActivity; # - > add (II) I get the add method of execution result from where v0 register move - result where v0. Line 24 # 24 of the Java source file. The local where v0, Result :I #v1 out sget-object v1, Ljava/lang/System; ->out:Ljava/io/PrintStream; Invoke-virtual {v1, v0}, Ljava/ IO /PrintStream; Println (I) v. line 26 const/16 v1, 0x9#v1 = 9 const/4 v2, Call sub() {p0,v1,v2},p0 refers to this, v1,v2 is the parameter invoke-virtual {p0,v1,v2}, Lcom/social_touch/demo/MainActivity; ->sub(II)I # get sub() result v0.line 28 if-le v0, v3, Log () invoke-static {v0}, Lcom/social_touch/demo/MainActivity; ->log(I)V .line 31 :cond_0 return-void .end method .method protected onCreate(Landroid/os/Bundle;) V. locals 1. parameter "savedInstanceState" # call onCreate() invoke-super {p0,  p1}, Landroid/app/Activity; ->onCreate(Landroid/os/Bundle;) V. line 18 const v0, 0x7f04001A #v0 = 0 Lcom/social_touch/demo/MainActivity; Method declared- >setContentView(I) v. line 19 return-void. End method #declared-synchronized Declared -synchronized sub(II) i.locals 1.parameter "x".parameter "y".prologue. Line 38 monitor-enter P0 # Add lock object P0 for this method Add-int v0, p1, p2 # monitor-exit p0 return v0.end methodCopy the code

conclusion

I still feel that there are a lot of points not understood, I will add later.