DEX file structure mind map and analytical source see the end of the article.
Past Catalogue:
Class file format details
Smali: Hello World
Smali — Mathematical operations, conditional judgment, loops
Smali syntax parsing — classes
Android Reverse Note-Androidmanifest.xml file format parsing
As we all know from the first article in the series that Class file formats are examined,.java source files compiled by the compiler produce.class files that are recognized by the JVM. In Android, both Dalvik and Art are quite different from the JVM. Instead of using Class files directly, The Android system aggregates all Class files into a DEX file, which is more compact than a single Class file and can be executed directly under the Android Runtime.
It is necessary to understand the DEX file structure for learning hotfix framework, reinforcement and reverse related knowledge. After r parsing the Class file and androidmanifest.xml file structure, I found that reading binary files is addictive. We will continue to analyze other file structures in Apk files, such as so files, resources.arsc files, etc.
The DEX file is generated
Before parsing the DEX file structure, let’s look at how to generate a DEX file. In order to facilitate the analysis, this article will not take DEX file from the market App to analyze, but manually generate a simplest DEX file. Class file parsing:
public class Hello {
private static String HELLO_WORLD = "Hello World!";
public static void main(String[] args) { System.out.println(HELLO_WORLD); }}Copy the code
Hello.class file, and then use the Sdk’s built-in dx tool to generate DEX file:
dx --dex --output=Hello.dex Hello.class
Copy the code
The DX tool is located in the BUILd-tools directory of the Sdk and can be added to environment variables for easy call. Dx also supports multiple Class files to generate dex.
DEX File structure
An overview of
DEX file structure learning, to recommend two materials.
The first one is snow God, from non-insect,
The second is the definition of DEX file format in Android source code, dalvik/libdex/ dexfile. h, which defines the various parts of DEX file in detail.
XML file format parsing. It provides a rich file template to support the parsing of common file formats. You can easily view the various parts of the file structure and their corresponding hexadecimal. I usually use the 010 Editor when parsing the file structure. Here is a screenshot of the hello. dex file generated before 010 Editor opened:
We can see the file structure of DEX at a glance, which is really a sharp tool. Before parsing in detail, let’s first roughly divide the DEX file into layers, as shown below:
At the end of the article I put a detailed mind map, you can also read the article against the mind map.
To explain in turn:
header :
DEX file header, which records some information about the current file and the offset of other data structures in the filestring_ids :
The offset of a stringtype_ids :
The offset of the type informationproto_ids :
The offset declared by the methodfield_ids :
Offset of field informationmethod_ids :
The offset of method information (class, method declaration, and method name)class_def :
The offset of the class informationdata :
: data arealink_data :
Statically linked data areas
All data from header to data are arrays of offsets. Real data is not stored. All data is stored in the data data area, which is searched according to its offsets. With a general overview of the DEX file, let’s take a closer look at the various parts.
header
For the specific format of the header of the DEX file, see dexfile. h:
struct DexHeader {
u1 magic[8]; / / the magic number
u4 checksum; // Adler check value
u1 signature[kSHA1DigestLen]; // sha1 check value
u4 fileSize; // DEX file size
u4 headerSize; // DEX file header size
u4 endianTag; / / byte order
u4 linkSize; // Link segment size
u4 linkOff; // The offset of the link segment
u4 mapOff; // The DexMapList offset
u4 stringIdsSize; / / DexStringId number
u4 stringIdsOff; // the DexStringId offset
u4 typeIdsSize; / / DexTypeId number
u4 typeIdsOff; // the DexTypeId offset
u4 protoIdsSize; / / DexProtoId number
u4 protoIdsOff; // DexProtoId offset
u4 fieldIdsSize; / / DexFieldId number
u4 fieldIdsOff; // The DexFieldId offset
u4 methodIdsSize; / / DexMethodId number
u4 methodIdsOff; // The DexMethodId offset
u4 classDefsSize; / / DexCLassDef number
u4 classDefsOff; // DexClassDef offset
u4 dataSize; // Data segment size
u4 dataOff; // Data segment offset
};
Copy the code
Where u represents an unsigned number, u1 is an 8-bit unsigned number, and u4 is a 32-bit unsigned number.
Magic is a constant used to mark DEX files, which can be decomposed into:
File ID dex + Newline + dex version + 0Copy the code
The string is in the format of dex\n035\0 in hexadecimal format 0x6465780A30333500.
Checksum is the checksum obtained by alder32 algorithm for files excluding magic and checksum, which is used to determine whether the DEX file is tampered.
Signature is a hash value derived from sha1 of all files except magic, checksum, and signature.
EndianTag Indicates whether a DEX file is represented at the big end or at the small end. Since the DEX file is running on the Android system, it is generally represented as a small endian, and this value is also constant 0x12345678.
The rest marks the number of other data structures in the DEX file and their offsets in the data area, respectively. We can easily obtain the contents of each data structure based on the offset. The first data structure, string_IDS, follows the DEX file structure above.
string_ids
struct DexStringId {
u4 stringDataOff;
};
Copy the code
String_ids is an array of offsets. StringDataOff represents the offset of each string in the data section. In the data section, the first byte represents the length of the string, followed by the string data. This logic is relatively simple, so let’s look at the code:
private void parseDexString(a) {
log("\nparse DexString");
try {
int stringIdsSize = dex.getDexHeader().string_ids__size;
for (int i = 0; i < stringIdsSize; i++) {
int string_data_off = reader.readInt();
byte size = dexData[string_data_off]; The first byte represents the length of the string, followed by the contents of the string
String string_data = new String(Utils.copy(dexData, string_data_off + 1, size));
DexString string = new DexString(string_data_off, string_data);
dexStrings.add(string);
log("string[%d] data: %s", i, string.string_data); }}catch(IOException e) { e.printStackTrace(); }}Copy the code
The print result is as follows:
parse DexString
string[0] data: <clinit>
string[1] data: <init>
string[2] data: HELLO_WORLD
string[3] data: Hello World!
string[4] data: Hello.java
string[5] data: LHello;
string[6] data: Ljava/io/PrintStream;
string[7] data: Ljava/lang/Object;
string[8] data: Ljava/lang/String;
string[9] data: Ljava/lang/System;
string[10] data: V
string[11] data: VL
string[12] data: [Ljava/lang/String;
string[13] data: main
string[14] data: out
string[15] data: println
Copy the code
It contains variable names, method names, file names, and so on, and this string pool is often encountered later in parsing other structures.
type_ids
struct DexTypeId {
u4 descriptorIdx;
};
Copy the code
Type_ids represents type information, descriptorIdx points to the element in string_IDS. The corresponding type information can be parsed from the string pool read directly from the index in the previous step as follows:
private void parseDexType(a) {
log("\nparse DexTypeId");
try {
int typeIdsSize = dex.getDexHeader().type_ids__size;
for (int i = 0; i < typeIdsSize; i++) {
int descriptor_idx = reader.readInt();
DexTypeId dexTypeId = new DexTypeId(descriptor_idx, dexStringIds.get(descriptor_idx).string_data);
dexTypeIds.add(dexTypeId);
log("type[%d] data: %s", i, dexTypeId.string_data); }}catch(IOException e) { e.printStackTrace(); }}Copy the code
Analysis results:
parse DexType
type[0] data: LHello;
type[1] data: Ljava/io/PrintStream;
type[2] data: Ljava/lang/Object;
type[3] data: Ljava/lang/String;
type[4] data: Ljava/lang/System;
type[5] data: V
type[6] data: [Ljava/lang/String;
Copy the code
proto_ids
struct DexProtoId {
u4 shortyIdx; /* index into stringIds for shorty descriptor */
u4 returnTypeIdx; /* index into typeIds list for return type */
u4 parametersOff; /* file offset to type_list for parameter types */
};
Copy the code
Proto_ids represents method declaration information, which contains the following three variables:
- ShortyIdx: Points to string_ids, the string that represents the method declaration
- ReturnTypeIdx: Points to type_IDS and represents the return type of the method
- ParametersOff: Offset of the method parameter list
The data structure of the method parameter list is represented by a DexTypeList in dexfile. h:
struct DexTypeList {
u4 size; /* #of entries in list */
DexTypeItem list[1]; /* entries */
};
struct DexTypeItem {
u2 typeIdx; /* index into typeIds */
};
Copy the code
Size represents the number of method parameters, which are represented by DexTypeItem, which has only one typeIdx attribute, pointing to the corresponding item in type_IDS. The specific parsing code is as follows:
private void parseDexProto(a) {
log("\nparse DexProto");
try {
int protoIdsSize = dex.getDexHeader().proto_ids__size;
for (int i = 0; i < protoIdsSize; i++) {
int shorty_idx = reader.readInt();
int return_type_idx = reader.readInt();
int parameters_off = reader.readInt();
DexProtoId dexProtoId = new DexProtoId(shorty_idx, return_type_idx, parameters_off);
log("proto[%d]: %s %s %d", i, dexStringIds.get(shorty_idx).string_data,
dexTypeIds.get(return_type_idx).string_data, parameters_off);
if (parameters_off > 0) { parseDexProtoParameters(parameters_off); } dexProtos.add(dexProtoId); }}catch(IOException e) { e.printStackTrace(); }}Copy the code
Analysis results:
parse DexProto
proto[0]: V V 0
proto[1]: VL V 412
parameters[0]: Ljava/lang/String;
proto[2]: VL V 420
parameters[0]: [Ljava/lang/String;
Copy the code
field_ids
struct DexFieldId {
u2 classIdx; /* index into typeIds list for defining class */
u2 typeIdx; /* index into typeIds for field type */
u4 nameIdx; /* index into stringIds for field name */
};
Copy the code
Field_ids indicates the field information, specifying the field class, field type, and field name. In dexfile. h, field_ids is defined as DexFieldId, and the meanings of each field are as follows:
- ClassIdx: Points to type_IDS and represents information about the class where the field resides
- TypeIdx: points to yPE_IDS and indicates the field type information
- NameIdx: points to string_IDS and represents the field name
Code parsing is very simple, I will not post, directly look at the results of parsing:
parse DexField field[0]: LHello; ->HELLO_WORLD; Ljava/lang/String; field[1]: Ljava/lang/System; ->out; Ljava/io/PrintStream;Copy the code
method_ids
struct DexMethodId {
u2 classIdx; /* index into typeIds list for defining class */
u2 protoIdx; /* index into protoIds for method prototype */
u4 nameIdx; /* index into stringIds for method name */
};
Copy the code
Method_ids specifies the class of the method, the method declaration, and the method name. In dexfile. h, DexMethodId is used to represent this item, and its attribute meanings are as follows:
- ClassIdx: Points to type_IDS, indicating the type of the class
- ProtoIdx: refers to type_IDS and represents method declarations
- NameIdx: points to string_ids, indicating the method name
Analysis results:
parse DexMethod
method[0]: LHello; proto[0] <clinit>
method[1]: LHello; proto[0] <init>
method[2]: LHello; proto[2] main
method[3]: Ljava/io/PrintStream; proto[1] println
method[4]: Ljava/lang/Object; proto[0] <init>
Copy the code
class_def
struct DexClassDef {
u4 classIdx; /* index into typeIds for this class */
u4 accessFlags;
u4 superclassIdx; /* index into typeIds for superclass */
u4 interfacesOff; /* file offset to DexTypeList */
u4 sourceFileIdx; /* index into stringIds for source file name */
u4 annotationsOff; /* file offset to annotations_directory_item */
u4 classDataOff; /* file offset to class_data_item */
u4 staticValuesOff; /* file offset to DexEncodedArray */
};
Copy the code
Class_def is the most complex and core part of the DEX file structure. It represents all the information of the class and corresponds to DexClassDef in dexfile.h:
- ClassIdx: Points to type_IDS for class information
- AccessFlags: Access identifier
- SuperclassIdx: Points to type_IDS and represents parent class information
- InterfacesOff: indicates the offset to the DexTypeList, indicating the interface information
- SourceFileIdx: points to string_ids and indicates the source file name
- AnnotationOff: Annotated information
- ClassDataOff: Offset to DexClassData, representing the data portion of the class
- StaticValueOff: An offset to DexEncodedArray representing static data for the class
DefCLassData
ClassDataOff contains the core data of a class, which is defined as DexClassData in the Android source code. It is not in dexfile.h, but in dexclass.h:
struct DexClassData {
DexClassDataHeader header;
DexField* staticFields;
DexField* instanceFields;
DexMethod* directMethods;
DexMethod* virtualMethods;
};
Copy the code
DexClassDataHeader defines the number of fields and methods in a class, which is also defined in dexclass.h:
struct DexClassDataHeader {
u4 staticFieldsSize;
u4 instanceFieldsSize;
u4 directMethodsSize;
u4 virtualMethodsSize;
};
Copy the code
- StaticFieldsSize: Indicates the number of static fields
- InstanceFieldsSize: number of instance fields
- DirectMethodsSize: indicates the number of direct methods
- VirtualMethodsSize: indicates the number of virtual methods
Note that the data is of type LEB128 when reading. It is a variable length type, and each LEB128 consists of 1 to 5 bytes with only 7 significant bits per byte. If the highest bit of the first byte is 1, the second byte needs to be continued, if the second byte has the highest bit of 1, the third byte needs to be continued, and so on, until the last byte has the highest bit of 0, up to 5 bytes. In addition to LEB128, there is the unsigned type ULEB128.
So why use this data structure? We all know that ints in Java are 4 bytes, 32 bits, but most of the time they are not 4 bytes at all. Using this variable length structure can save space. For running on Android, a little extra space is definitely a good thing. Java read ULEB128 code below:
public static int readUnsignedLeb128(byte[] src, int offset) {
int result = 0;
int count = 0;
int cur;
do {
cur = copy(src, offset, 1) [0];
cur &= 0xff;
result |= (cur & 0x7f) << count * 7;
count++;
offset++;
DexParser.POSITION++;
} while ((cur & 0x80) = =128 && count < 5);
return result;
}
Copy the code
So let’s go back to DexClassData. The header section defines the number of fields and methods, followed by the specific data for static fields, instance fields, direct methods, and virtual methods. Fields are represented by DexField and methods by DexMethod.
DexField
struct DexField {
u4 fieldIdx; /* index to a field_id_item */
u4 accessFlags;
};
Copy the code
- FieldIdx: indicates field information pointing to field_ids
- AccessFlags: Access identifier
DexMethod
struct DexMethod {
u4 methodIdx; /* index to a method_id_item */
u4 accessFlags;
u4 codeOff; /* file offset to a code_item */
46};
Copy the code
Method_idx is an index to method_IDS that represents method information. AccessFlags is the access identifier for this method. CodeOff is the offset of the structure DexCode. If you stick to this, do you realize that the DEX contains the code, or instruction, that corresponds to the main method in Hello.java? Yes, DexCode is used to store the details of the method and the instructions in it.
struct DexCode {
u2 registersSize; // Number of registers
u2 insSize; // The number of arguments
u2 outsSize; // Number of registers to use when calling other methods
u2 triesSize; // Number of try/catch statements
u4 debugInfoOff; // Indicates the offset of the debug message
u4 insnsSize; // The number of instruction sets
u2 insns[1]; / / instructions
/* followed by optional u2 padding */ // 2 bytes for alignment
/* followed by try_item[triesSize] */
/* followed by uleb128 handlersSize */
/* followed by catch_handler_item[handlersSize] */
};
Copy the code
Open the 010 Editor, locate the DexCode corresponding to the main() method, and compare with it for analysis:
public class Hello {
private static String HELLO_WORLD = "Hello World!";
public static void main(String[] args) { System.out.println(HELLO_WORLD); }}Copy the code
The hexadecimal representation of the DexCode corresponding to the main() method is:
03 00 01 00 02 00 00 00 79 02 00 00 08 00 00 00
62 00 01 00 62 01 00 00 6E 20 03 00 10 00 0E 00
Copy the code
The number of registers used is three. The number of arguments is 1, which is String[] args in main(). The number of registers used when calling external methods is two. The number of instructions is 8.
Finally, instructions. The main() function has eight instructions, the second line of the hex above. Try to parse this instruction. The official website of Android has the relevant introduction and link of Dalvik directive.
The first instruction 62 00 01 00, query document 62 corresponds to sget-object vAA, field@BBBB, AA corresponds to 00, indicating the V0 register. BBBB corresponds to 01 00, indicating the field with index 1 in field_ids. The field is Ljava/lang/System according to the preceding parsing result. ->out; Ljava/ IO /PrintStream 62 00 01 00
sget-object v0, Ljava/lang/System; ->out:Ljava/io/PrintStream;Copy the code
And then 62, 01, 00, 00. Or sget-object vAA, field@BBBB, AA corresponds to 01, BBBB corresponds to 0000, use v1 register, field bit field_ids index 0 field, namely LHello; ->HELLO_WORLD; Ljava/lang/String, the complete instruction is:
sget-object v1, LHello; ->HELLO_WORLD:Ljava/lang/String;Copy the code
6E is invoked-virtual {vC, vD, vE, vF, vG}, meth@BBBB. 6E followed by a hexadecimal 2 to indicate that the calling method is two arguments, so BBBB is 03 00, pointing to the method with index 3 in method_ids. This method is Ljava/ IO /PrintStream based on the previous parsing results; ->println(Ljava/lang/String;) V. The complete instructions are:
invoke-virtual {v0, v1}, Ljava/io/PrintStream; ->println(Ljava/lang/String;) VCopy the code
And then the last 0E, look at the document and that instruction is return-void, and I’m done with this main() method.
Put the above instructions together:
62 00 01 00 : sget-object v0, Ljava/lang/System; ->out:Ljava/io/PrintStream; 62 01 00 00 : sget-object v1, LHello; ->HELLO_WORLD:Ljava/lang/String; 6E 20 03 00 : invoke-virtual {v0, v1}, Ljava/io/PrintStream; ->println(Ljava/lang/String;) V OE OO :return-void
Copy the code
This is the complete instruction for the main() method. Remember my previous article on the Smali syntax: Hello World, the result of which is the same as the Smali code for hello.java:
.method public static main([Ljava/lang/String;)V
.registers 3
.prologue
.line 6
sget-object v0, Ljava/lang/System; ->out:Ljava/io/PrintStream; sget-object v1, LHello; ->HELLO_WORLD:Ljava/lang/String; invoke-virtual {v0, v1}, Ljava/io/PrintStream; ->println(Ljava/lang/String;) V .line7
return-void
.end method
Copy the code
conclusion
This kind of article is really smelly and long, but bear to read, there will be a lot of goods. To conclude, here’s a mind map:
Java version DEX file format parsing source code, point me DexParser
Article first published wechat public account: Bingxin said, focus on Java, Android original knowledge sharing, LeetCode problem solving.
More JDK source code analysis, scan code to pay attention to me!