DEX file structure mind map and analytical source see the end of the article.

Past Catalogue:

Class file format details

Smali: Hello World

Smali — Mathematical operations, conditional judgment, loops

Smali syntax parsing — classes

Android Reverse Note-Androidmanifest.xml file format parsing

As we all know from the first article in the series that Class file formats are examined,.java source files compiled by the compiler produce.class files that are recognized by the JVM. In Android, both Dalvik and Art are quite different from the JVM. Instead of using Class files directly, The Android system aggregates all Class files into a DEX file, which is more compact than a single Class file and can be executed directly under the Android Runtime.

It is necessary to understand the DEX file structure for learning hotfix framework, reinforcement and reverse related knowledge. After r parsing the Class file and androidmanifest.xml file structure, I found that reading binary files is addictive. We will continue to analyze other file structures in Apk files, such as so files, resources.arsc files, etc.

The DEX file is generated

Before parsing the DEX file structure, let’s look at how to generate a DEX file. In order to facilitate the analysis, this article will not take DEX file from the market App to analyze, but manually generate a simplest DEX file. Class file parsing:

public class Hello {

    private static String HELLO_WORLD = "Hello World!";

    public static void main(String[] args) { System.out.println(HELLO_WORLD); }}Copy the code

Hello.class file, and then use the Sdk’s built-in dx tool to generate DEX file:

dx --dex --output=Hello.dex  Hello.class
Copy the code

The DX tool is located in the BUILd-tools directory of the Sdk and can be added to environment variables for easy call. Dx also supports multiple Class files to generate dex.

DEX File structure

An overview of

DEX file structure learning, to recommend two materials.

The first one is snow God, from non-insect,

The second is the definition of DEX file format in Android source code, dalvik/libdex/ dexfile. h, which defines the various parts of DEX file in detail.

XML file format parsing. It provides a rich file template to support the parsing of common file formats. You can easily view the various parts of the file structure and their corresponding hexadecimal. I usually use the 010 Editor when parsing the file structure. Here is a screenshot of the hello. dex file generated before 010 Editor opened:

We can see the file structure of DEX at a glance, which is really a sharp tool. Before parsing in detail, let’s first roughly divide the DEX file into layers, as shown below:

At the end of the article I put a detailed mind map, you can also read the article against the mind map.

To explain in turn:

  • header :DEX file header, which records some information about the current file and the offset of other data structures in the file
  • string_ids :The offset of a string
  • type_ids :The offset of the type information
  • proto_ids :The offset declared by the method
  • field_ids :Offset of field information
  • method_ids :The offset of method information (class, method declaration, and method name)
  • class_def :The offset of the class information
  • data :: data area
  • link_data :Statically linked data areas

All data from header to data are arrays of offsets. Real data is not stored. All data is stored in the data data area, which is searched according to its offsets. With a general overview of the DEX file, let’s take a closer look at the various parts.

header

For the specific format of the header of the DEX file, see dexfile. h:

struct DexHeader {
    u1  magic[8];           / / the magic number
    u4  checksum;           // Adler check value
    u1  signature[kSHA1DigestLen]; // sha1 check value
    u4  fileSize;           // DEX file size
    u4  headerSize;         // DEX file header size
    u4  endianTag;          / / byte order
    u4  linkSize;           // Link segment size
    u4  linkOff;            // The offset of the link segment
    u4  mapOff;             // The DexMapList offset
    u4  stringIdsSize;      / / DexStringId number
    u4  stringIdsOff;       // the DexStringId offset
    u4  typeIdsSize;        / / DexTypeId number
    u4  typeIdsOff;         // the DexTypeId offset
    u4  protoIdsSize;       / / DexProtoId number
    u4  protoIdsOff;        // DexProtoId offset
    u4  fieldIdsSize;       / / DexFieldId number
    u4  fieldIdsOff;        // The DexFieldId offset
    u4  methodIdsSize;      / / DexMethodId number
    u4  methodIdsOff;       // The DexMethodId offset
    u4  classDefsSize;      / / DexCLassDef number
    u4  classDefsOff;       // DexClassDef offset
    u4  dataSize;           // Data segment size
    u4  dataOff;            // Data segment offset
};
Copy the code

Where u represents an unsigned number, u1 is an 8-bit unsigned number, and u4 is a 32-bit unsigned number.

Magic is a constant used to mark DEX files, which can be decomposed into:

File ID dex + Newline + dex version + 0Copy the code

The string is in the format of dex\n035\0 in hexadecimal format 0x6465780A30333500.

Checksum is the checksum obtained by alder32 algorithm for files excluding magic and checksum, which is used to determine whether the DEX file is tampered.

Signature is a hash value derived from sha1 of all files except magic, checksum, and signature.

EndianTag Indicates whether a DEX file is represented at the big end or at the small end. Since the DEX file is running on the Android system, it is generally represented as a small endian, and this value is also constant 0x12345678.

The rest marks the number of other data structures in the DEX file and their offsets in the data area, respectively. We can easily obtain the contents of each data structure based on the offset. The first data structure, string_IDS, follows the DEX file structure above.

string_ids

struct DexStringId {
    u4 stringDataOff;
};
Copy the code

String_ids is an array of offsets. StringDataOff represents the offset of each string in the data section. In the data section, the first byte represents the length of the string, followed by the string data. This logic is relatively simple, so let’s look at the code:

private void parseDexString(a) {
    log("\nparse DexString");
    try {
        int stringIdsSize = dex.getDexHeader().string_ids__size;
        for (int i = 0; i < stringIdsSize; i++) {
            int string_data_off = reader.readInt();
            byte size = dexData[string_data_off]; The first byte represents the length of the string, followed by the contents of the string
            String string_data = new String(Utils.copy(dexData, string_data_off + 1, size));
            DexString string = new DexString(string_data_off, string_data);
            dexStrings.add(string);
            log("string[%d] data: %s", i, string.string_data); }}catch(IOException e) { e.printStackTrace(); }}Copy the code

The print result is as follows:

parse DexString
string[0] data: <clinit>
string[1] data: <init>
string[2] data: HELLO_WORLD
string[3] data: Hello World!
string[4] data: Hello.java
string[5] data: LHello;
string[6] data: Ljava/io/PrintStream;
string[7] data: Ljava/lang/Object;
string[8] data: Ljava/lang/String;
string[9] data: Ljava/lang/System;
string[10] data: V
string[11] data: VL
string[12] data: [Ljava/lang/String;
string[13] data: main
string[14] data: out
string[15] data: println
Copy the code

It contains variable names, method names, file names, and so on, and this string pool is often encountered later in parsing other structures.

type_ids

struct DexTypeId {
    u4  descriptorIdx;
};
Copy the code

Type_ids represents type information, descriptorIdx points to the element in string_IDS. The corresponding type information can be parsed from the string pool read directly from the index in the previous step as follows:

private void parseDexType(a) {
    log("\nparse DexTypeId");
    try {
        int typeIdsSize = dex.getDexHeader().type_ids__size;
        for (int i = 0; i < typeIdsSize; i++) {
            int descriptor_idx = reader.readInt();
            DexTypeId dexTypeId = new DexTypeId(descriptor_idx, dexStringIds.get(descriptor_idx).string_data);
            dexTypeIds.add(dexTypeId);
            log("type[%d] data: %s", i, dexTypeId.string_data); }}catch(IOException e) { e.printStackTrace(); }}Copy the code

Analysis results:

parse DexType
type[0] data: LHello;
type[1] data: Ljava/io/PrintStream;
type[2] data: Ljava/lang/Object;
type[3] data: Ljava/lang/String;
type[4] data: Ljava/lang/System;
type[5] data: V
type[6] data: [Ljava/lang/String;
Copy the code

proto_ids

struct DexProtoId {
    u4  shortyIdx;          /* index into stringIds for shorty descriptor */
    u4  returnTypeIdx;      /* index into typeIds list for return type */
    u4  parametersOff;      /* file offset to type_list for parameter types */
};
Copy the code

Proto_ids represents method declaration information, which contains the following three variables:

  • ShortyIdx: Points to string_ids, the string that represents the method declaration
  • ReturnTypeIdx: Points to type_IDS and represents the return type of the method
  • ParametersOff: Offset of the method parameter list

The data structure of the method parameter list is represented by a DexTypeList in dexfile. h:

struct DexTypeList {
    u4  size;               /* #of entries in list */
    DexTypeItem list[1];    /* entries */
};

struct DexTypeItem {
    u2  typeIdx;            /* index into typeIds */
};
Copy the code

Size represents the number of method parameters, which are represented by DexTypeItem, which has only one typeIdx attribute, pointing to the corresponding item in type_IDS. The specific parsing code is as follows:

private void parseDexProto(a) {
    log("\nparse DexProto");
    try {
        int protoIdsSize = dex.getDexHeader().proto_ids__size;
        for (int i = 0; i < protoIdsSize; i++) {
            int shorty_idx = reader.readInt();
            int return_type_idx = reader.readInt();
            int parameters_off = reader.readInt();

            DexProtoId dexProtoId = new DexProtoId(shorty_idx, return_type_idx, parameters_off);
            log("proto[%d]: %s %s %d", i, dexStringIds.get(shorty_idx).string_data,
                    dexTypeIds.get(return_type_idx).string_data, parameters_off);

            if (parameters_off > 0) { parseDexProtoParameters(parameters_off); } dexProtos.add(dexProtoId); }}catch(IOException e) { e.printStackTrace(); }}Copy the code

Analysis results:

parse DexProto
proto[0]: V V 0
proto[1]: VL V 412
parameters[0]: Ljava/lang/String;
proto[2]: VL V 420
parameters[0]: [Ljava/lang/String;
Copy the code

field_ids

struct DexFieldId {
    u2  classIdx;           /* index into typeIds list for defining class */
    u2  typeIdx;            /* index into typeIds for field type */
    u4  nameIdx;            /* index into stringIds for field name */
};
Copy the code

Field_ids indicates the field information, specifying the field class, field type, and field name. In dexfile. h, field_ids is defined as DexFieldId, and the meanings of each field are as follows:

  • ClassIdx: Points to type_IDS and represents information about the class where the field resides
  • TypeIdx: points to yPE_IDS and indicates the field type information
  • NameIdx: points to string_IDS and represents the field name

Code parsing is very simple, I will not post, directly look at the results of parsing:

parse DexField field[0]: LHello; ->HELLO_WORLD; Ljava/lang/String; field[1]: Ljava/lang/System; ->out; Ljava/io/PrintStream;Copy the code

method_ids

struct DexMethodId {
    u2  classIdx;           /* index into typeIds list for defining class */
    u2  protoIdx;           /* index into protoIds for method prototype */
    u4  nameIdx;            /* index into stringIds for method name */
};
Copy the code

Method_ids specifies the class of the method, the method declaration, and the method name. In dexfile. h, DexMethodId is used to represent this item, and its attribute meanings are as follows:

  • ClassIdx: Points to type_IDS, indicating the type of the class
  • ProtoIdx: refers to type_IDS and represents method declarations
  • NameIdx: points to string_ids, indicating the method name

Analysis results:

parse DexMethod
method[0]: LHello; proto[0] <clinit>
method[1]: LHello; proto[0] <init>
method[2]: LHello; proto[2] main
method[3]: Ljava/io/PrintStream; proto[1] println
method[4]: Ljava/lang/Object; proto[0] <init>
Copy the code

class_def

struct DexClassDef {
    u4  classIdx;           /* index into typeIds for this class */
    u4  accessFlags;
    u4  superclassIdx;      /* index into typeIds for superclass */
    u4  interfacesOff;      /* file offset to DexTypeList */
    u4  sourceFileIdx;      /* index into stringIds for source file name */
    u4  annotationsOff;     /* file offset to annotations_directory_item */
    u4  classDataOff;       /* file offset to class_data_item */
    u4  staticValuesOff;    /* file offset to DexEncodedArray */
};
Copy the code

Class_def is the most complex and core part of the DEX file structure. It represents all the information of the class and corresponds to DexClassDef in dexfile.h:

  • ClassIdx: Points to type_IDS for class information
  • AccessFlags: Access identifier
  • SuperclassIdx: Points to type_IDS and represents parent class information
  • InterfacesOff: indicates the offset to the DexTypeList, indicating the interface information
  • SourceFileIdx: points to string_ids and indicates the source file name
  • AnnotationOff: Annotated information
  • ClassDataOff: Offset to DexClassData, representing the data portion of the class
  • StaticValueOff: An offset to DexEncodedArray representing static data for the class

DefCLassData

ClassDataOff contains the core data of a class, which is defined as DexClassData in the Android source code. It is not in dexfile.h, but in dexclass.h:

struct DexClassData {
    DexClassDataHeader header;
    DexField*          staticFields;
    DexField*          instanceFields;
    DexMethod*         directMethods;
    DexMethod*         virtualMethods;
};
Copy the code

DexClassDataHeader defines the number of fields and methods in a class, which is also defined in dexclass.h:

struct DexClassDataHeader {
    u4 staticFieldsSize;
    u4 instanceFieldsSize;
    u4 directMethodsSize;
    u4 virtualMethodsSize;
};
Copy the code
  • StaticFieldsSize: Indicates the number of static fields
  • InstanceFieldsSize: number of instance fields
  • DirectMethodsSize: indicates the number of direct methods
  • VirtualMethodsSize: indicates the number of virtual methods

Note that the data is of type LEB128 when reading. It is a variable length type, and each LEB128 consists of 1 to 5 bytes with only 7 significant bits per byte. If the highest bit of the first byte is 1, the second byte needs to be continued, if the second byte has the highest bit of 1, the third byte needs to be continued, and so on, until the last byte has the highest bit of 0, up to 5 bytes. In addition to LEB128, there is the unsigned type ULEB128.

So why use this data structure? We all know that ints in Java are 4 bytes, 32 bits, but most of the time they are not 4 bytes at all. Using this variable length structure can save space. For running on Android, a little extra space is definitely a good thing. Java read ULEB128 code below:

public static int readUnsignedLeb128(byte[] src, int offset) {
    int result = 0;
    int count = 0;
    int cur;
    do {
        cur = copy(src, offset, 1) [0];
        cur &= 0xff;
        result |= (cur & 0x7f) << count * 7;
        count++;
        offset++;
        DexParser.POSITION++;
    } while ((cur & 0x80) = =128 && count < 5);
    return result;
}
Copy the code

So let’s go back to DexClassData. The header section defines the number of fields and methods, followed by the specific data for static fields, instance fields, direct methods, and virtual methods. Fields are represented by DexField and methods by DexMethod.

DexField

struct DexField {
    u4 fieldIdx;    /* index to a field_id_item */
    u4 accessFlags;
};
Copy the code
  • FieldIdx: indicates field information pointing to field_ids
  • AccessFlags: Access identifier

DexMethod

struct DexMethod {
    u4 methodIdx;    /* index to a method_id_item */
    u4 accessFlags;
    u4 codeOff;      /* file offset to a code_item */
46};
Copy the code

Method_idx is an index to method_IDS that represents method information. AccessFlags is the access identifier for this method. CodeOff is the offset of the structure DexCode. If you stick to this, do you realize that the DEX contains the code, or instruction, that corresponds to the main method in Hello.java? Yes, DexCode is used to store the details of the method and the instructions in it.

struct DexCode {
    u2  registersSize;  // Number of registers
    u2  insSize;        // The number of arguments
    u2  outsSize;       // Number of registers to use when calling other methods
    u2  triesSize;      // Number of try/catch statements
    u4  debugInfoOff;   // Indicates the offset of the debug message
    u4  insnsSize;      // The number of instruction sets
    u2  insns[1];       / / instructions
    /* followed by optional u2 padding */  // 2 bytes for alignment
    /* followed by try_item[triesSize] */
    /* followed by uleb128 handlersSize */
    /* followed by catch_handler_item[handlersSize] */
};
Copy the code

Open the 010 Editor, locate the DexCode corresponding to the main() method, and compare with it for analysis:

public class Hello {

    private static String HELLO_WORLD = "Hello World!";

    public static void main(String[] args) { System.out.println(HELLO_WORLD); }}Copy the code

The hexadecimal representation of the DexCode corresponding to the main() method is:

03 00 01 00 02 00 00 00 79 02 00 00 08 00 00 00
62 00 01 00 62 01 00 00 6E 20 03 00 10 00 0E 00
Copy the code

The number of registers used is three. The number of arguments is 1, which is String[] args in main(). The number of registers used when calling external methods is two. The number of instructions is 8.

Finally, instructions. The main() function has eight instructions, the second line of the hex above. Try to parse this instruction. The official website of Android has the relevant introduction and link of Dalvik directive.

The first instruction 62 00 01 00, query document 62 corresponds to sget-object vAA, field@BBBB, AA corresponds to 00, indicating the V0 register. BBBB corresponds to 01 00, indicating the field with index 1 in field_ids. The field is Ljava/lang/System according to the preceding parsing result. ->out; Ljava/ IO /PrintStream 62 00 01 00

sget-object v0, Ljava/lang/System; ->out:Ljava/io/PrintStream;Copy the code

And then 62, 01, 00, 00. Or sget-object vAA, field@BBBB, AA corresponds to 01, BBBB corresponds to 0000, use v1 register, field bit field_ids index 0 field, namely LHello; ->HELLO_WORLD; Ljava/lang/String, the complete instruction is:

sget-object v1, LHello; ->HELLO_WORLD:Ljava/lang/String;Copy the code

6E is invoked-virtual {vC, vD, vE, vF, vG}, meth@BBBB. 6E followed by a hexadecimal 2 to indicate that the calling method is two arguments, so BBBB is 03 00, pointing to the method with index 3 in method_ids. This method is Ljava/ IO /PrintStream based on the previous parsing results; ->println(Ljava/lang/String;) V. The complete instructions are:

invoke-virtual {v0, v1}, Ljava/io/PrintStream; ->println(Ljava/lang/String;) VCopy the code

And then the last 0E, look at the document and that instruction is return-void, and I’m done with this main() method.

Put the above instructions together:

62 00 01 00 : sget-object v0, Ljava/lang/System; ->out:Ljava/io/PrintStream; 62 01 00 00 : sget-object v1, LHello; ->HELLO_WORLD:Ljava/lang/String; 6E 20 03 00 : invoke-virtual {v0, v1}, Ljava/io/PrintStream; ->println(Ljava/lang/String;) V OE OO :return-void
Copy the code

This is the complete instruction for the main() method. Remember my previous article on the Smali syntax: Hello World, the result of which is the same as the Smali code for hello.java:

.method public static main([Ljava/lang/String;)V
    .registers 3

    .prologue
    .line 6
    sget-object v0, Ljava/lang/System; ->out:Ljava/io/PrintStream; sget-object v1, LHello; ->HELLO_WORLD:Ljava/lang/String; invoke-virtual {v0, v1}, Ljava/io/PrintStream; ->println(Ljava/lang/String;) V .line7
    return-void
.end method
Copy the code

conclusion

This kind of article is really smelly and long, but bear to read, there will be a lot of goods. To conclude, here’s a mind map:

Java version DEX file format parsing source code, point me DexParser

Article first published wechat public account: Bingxin said, focus on Java, Android original knowledge sharing, LeetCode problem solving.

More JDK source code analysis, scan code to pay attention to me!