Class file format details

Write once, run everywhere! We all know this is the famous slogan for Java. Different operating systems, different cpus have different instruction sets, and how to achieve platform independence depends on the Java VIRTUAL machine. Computers can only ever recognize zeros and ones as binaries, and the virtual machine is the bridge between the code we write and the computer. The virtual machine compiles the.java source program files we wrote into.class files in bytecode format, which is the program storage format used by all virtual machines and all platforms. This is the essence of platform independence. Virtual machines implement platform independence in the application layer of the operating system. In fact, not only is the platform independent, but the JVM is also language independent. Common JVM languages, such as Scala, Groovy, and more recently Kotlin, the official Android development language, all compile to.class files through their respective language compilers. A proper understanding of the Class file format is of great benefit to our development and reverse.

Class file structure

The structure of the class file is clear, as follows:

ClassFile {
  u4              magic;
  u2              minor_version;
  u2              major_version;
  u2              constant_pool_count;
  cp_info         constant_pool[constant_pool_count-1];
  u2              access_flags;
  u2              this_class;
  u2              super_class;
  u2              interfaces_count;
  u2              interfaces[interfaces_count];
  u2              fields_count;
  field_info      fields[fields_count];
  u2              methods_count;
  method_info     methods[methods_count];
  u2              attributes_count;
  attribute_info  attributes[attributes_count];
}
Copy the code

Where U2, u4 represent unsigned numbers of 2 and 4 bytes respectively. It is also important to note that the multi-byte data of classs files is stored in big-endian notation, which needs to be paid attention to during parsing.

The best way to understand a file structure is to parse it. Androidmanifest.xml, dex, etc., will learn their file structure directly through code parsing. Here is the simplest Hello. Java program to parse:

public class Hello {

    private static String HELLO_WORLD = "Hello World!"; public static void main(String[] args) { System.out.println(HELLO_WORLD); }}Copy the code

The javac command is compiled to generate the hello. class file. The 010Editor is a handy tool to view and analyze binary file structures and is much smarter than Winhex or Ghex. Here is a screenshot of opening the hello.class file with 010Editor:

The file structure is clear. Clicking each structure will also automatically mark the corresponding hexadecimal data in the upper half of the file content, quite convenient. The following is the structure of the directory parsing item by item.

magic

The magic number of the class file is interesting, 0xCAFEBABE, and maybe the Java founders were really into coffee, including the Java icon being a cup of coffee.

minor_version && major_version

Minor_version indicates the minor version number, and major_version indicates the major version number. Each version of the JDK has its own specific version number. Older JDK versions are backward compatible with older Class files, but older JDK versions cannot run older Class files, and virtual machines refuse to execute older Class files even if the file format has not changed. In the figure above, the major version number is 52, representing JDK 1.8. Releases below JDK 1.8 cannot be executed.

constant_pool

Constant pools are the most important part of Class files, holding various data types and being associated with many other projects. When parsing, we can think of a constant pool as an array or a set. Since it is an array or a set, we need to determine its length first. First take a look at a screenshot of the constant pool section of the Hello.class file:

The constant pool section starts with a u2 type, which represents the capacity in the constant pool, 34 in this example. Note that constant pool subscripts start at 1, which means that the Class file has 33 constants. So why does the index start at 1? The purpose is to indicate that no constant pool entry is referenced in a particular case, in which case the subscript is 0.

The following table shows some common data types for constant pools:

The class type The volunteers describe
CONSTANT_Utf8_info 1 The character string is utF-8 encoded
CONSTANT_Integer_info 3 Integer literals
CONSTANT_Float_info 4 Floating point literals
CONSTANT_Long_info 5 Long integer literals
CONSTANT_Double_info 6 A double – precision floating-point literal
CONSTANT_Class_info 7 Symbolic reference to a class or interface
CONSTANT_String_info 8 String type literals
CONSTANT_Fieldref_info 9 Symbolic reference to a field
CONSTANT_Methodref_info 10 Symbolic references to methods in a class
CONSTANT_InterfaceMethodref_info 11 Symbolic references to methods in the interface
CONSTANT_NameAndType_info 12 A partial symbolic reference to a field or method
CONSTANT_MethodHandle_info 15 Represents a method handle
CONSTANT_MethodType_info 16 Identify method types
CONSTANT_InvokeDynamic_info 18 Represents a dynamic method call point

There are more than a dozen data types in the constant pool, each with its own data structure, but they all have a common attribute tag. A tag is a flag bit that identifies a data structure. Instead of looking at every data structure, we’ll take a rough look at the constant pool structure of the Hello.class file.

Let’s start with the first item in the hello. class file constant pool:

This is a CONSTANT_Methodref_info, which represents some information about a method in a class, and its data structure is tag class_index name_AND_type_index. The tag is 10. The value of class_index is 7, which is a constant pool index pointing to an item in the constant pool. Note that the constant pool index starts at 1, so it actually points to the sixth data item:

CONSTANT_Methodref_info class_index is always CONSTANT_Class_info, tag is 7, and represents a class or interface. Name_index is also the constant pool index. Item 26 can be seen in the image above:

This is a CONSTANT_Utf8_info, which is a string as the name indicates, with the length attribute indicating the length, followed by byte[] representing the content of the string. From the 010Editor parsing, you can see that this string is Java /lang/Object, representing the fully qualified name of the class.

The first constant pool entry, CONSTANT_Methodref_info, is name_index. The other property is name_and_type_index. It always points to CONSTANT_NameAndType_info, which represents a field or method, and has a value of 19. Let’s look at item 18 of the constant pool:

CONSTANT_NameAndType_info has a tag of 12 and has two attributes, name_index and descriptor_index, both to CONSTANT_Utf8_info. Name_index represents the unqualified name of a field or method, where the value is

. Descriptor_index represents the field or method descriptor, where the value is ()V.

At this point, the first data item in the constant pool is parsed, and each subsequent data item can be parsed as follows. This is where you see the importance of the constant pool, which contains most of the information in the Class file.

Let’s look at the file structure behind constant pools:

access_flags

Access representation represents the access permissions and attributes of a class or interface. The following figure shows the values and meanings of some access flags:

Sign the name Flag values Containing righteousness
ACC_PUBIC 0x0001 Whether the type is public
ACC_FINAL 0x0010 Whether to declare final
ACC_SUPER 0x0020 This flag must be true for all classes compiled after JDK1.0.2
ACC_INTERFACE 0x0200 Interface or not
ACC_ABSTRACT 0x0400 Whether the type is abstract
ACC_SYNTHETIC 0x1000 Flag that this class is not generated by user code
ACC_ANNOTATION 0x2000 Annotation or not
ACC_ENUM 0x4000 Whether it is an enumeration type

The hello. class file is marked with decimal 33 for access. Hello. Java is a normal class, decorated with public, so it should have the ACC_PUBIC and ACC_SUPER tags. 0x0001 + 0x0010 is exactly decimal 33.

this_class && super_class && interfaces_count && interfaces[]

Why are these items explained in the same section? Because these items of data together determine the inheritance relationship of the class. This_class represents the class index, which is used to determine the fully qualified name of the class. As you can see in the figure, the index value is 6, pointing to the fifth data item in the constant pool, which must be CONSTANT_Class_info. If you look in the constant pool, you can see the class name is Hello, which represents the name of the current class. Super_class represents the superclass index, also pointing to CONSTANT_Class_info, with a Java /lang/Object value. As we all know, Object is the only class in Java that does not have a superclass, so its superclass index is 0.

The two characters immediately following super_class are interfaces_count, which indicates the number of interfaces implemented by the class. Since hello.java does not implement any interface, this value is 0. If several interfaces are implemented, the interface information is stored in subsequent interfaces[].

fields_count && field_info

A collection of field tables that represent variables declared in the class. Fields_count specifies the number of variables, and fields[] stores information about variables. Note that variables refer to member variables and do not include local variables in methods. Recall that the hello. Java file has only one variable:

private static String HELLO_WORLD = "Hello World!";
Copy the code

The above variable declaration tells us that there is a String variable called HELLO_WORLD that is private static. So this is exactly what Fields [] needs to store. Let’s look at the structure of filed_info:

Access_flags is an access flag that represents the access permissions and basic attributes of a field, much like the access flags of the classes we examined earlier. The following table shows the names and meanings of some common access flags:

Sign the name Flag values Containing righteousness
ACC_PUBIC 0x0001 Whether or not to public
ACC_PRIVATE 0x0002 Whether it is private
ACC_PROTECTED 0x0004 Whether it is protected
ACC_STATIC 0x0008 Whether it is the static
ACC_FINAL 0x0010 Whether it is final
ACC_VOLATILE 0x0040 Whether it is volatile
ACC_TRANSIENT 0x0080 Whether for transient
ACC_SYNTHETIC 0x1000 Is generated automatically by the compiler
ACC_ENUM 0x4000 Whether the enum

Private Static is 0x0002 + 0x0008, which is 10 in decimal notation.

Name_index is the constant pool index and represents the name of the field. See constant pool item 7, which is a CONSTANT_Utf8_info and has the value HELLO_WORLD.

See item 8 of the constant pool, is a CONSTANT_Utf8_info, the value is Ljava/lang/String; .

This gives you complete information about the field. You can also see in the figure that Descriptor_index is followed by attributes_count, where the value is 0, or else by attributes[]. We’ll talk about the property table later, but we won’t do that.

methods_count && method_info

Following the collection of field tables is the collection of method tables, representing the methods in the class. The structure of the method table collection is similar to that of the field table collection, as shown below:

Access_flags represents access flags with slightly different flag values and field tables, as shown below:

Sign the name Flag values Containing righteousness
ACC_PUBIC 0x0001 Whether or not to public
ACC_PRIVATE 0x0002 Whether it is private
ACC_PROTECTED 0x0004 Whether it is protected
ACC_STATIC 0x0008 Whether it is the static
ACC_FINAL 0x0010 Whether it is final
ACC_SYNCHRONIZED 0x0020 Whether for sychronized
ACC_BRIDGE 0x0040 The bridge method generated by the compiler or not
ACC_VARARGS 0x0080 Whether to accept indefinite parameters
ACC_NATIVE 0x0100 Whether it is native
ACC_ABSTRACT 0x0400 Whether it is the abstract
ACC_STRICTFP 0x0800 Whether for strictfp
ACC_SYNTHETIC 0x1000 Is generated automatically by the compiler

Name_index and Descriptor_index, like field tables, denote the name of the method and the description of the method, respectively, pointing to the CONSTANT_Utf8_info item in the constant pool. The specific code in the method is stored in a subsequent property sheet, which is compiled by the compiler and stored in bytecode format. The property table is examined in detail in the next section.

attributes_count && attribute_info

Property tables have appeared several times before, including field tables and method tables that contain property tables. There are many types of property sheets, representing source file names, compile-generated bytecode instructions, constant values defined by final, exceptions thrown by methods, and so on. There are 21 predefined attributes in the Java Virtual Machine Specification (Java SE 7). Only properties that appear in the Hello.class file are analyzed here.

Take a look at the last two entries in the Hello.class file immediately after the method table.

Attributes_count declares the length of the attribute list that follows it, in this case 1, followed by an attribute. As can be seen from the structure of the property above, this is a fixed-length property, but most of the property types are indeterminate.

Attribute_name_index is the index of the attribute name, pointing to the CONSTANT_Utf8_info in the constant pool, which is the type of the attribute. This attribute is 17, so it points to the 16th entry in the constant pool, which is SourceFile. Indicates that this is a SourceFile attribute whose value is the name of the SourceFile.

Attribute_length is the length of the attribute, but it doesn’t include attribute_name_index and itself, so the length of the entire attribute should be attribute_length + 6.

Sourcefile_index is the index of the sourcefile name, pointing to CONSTANT_Utf8_info in the constant pool, with an index value of 18 pointing to item 17, which is not hard to guess is the sourcefile name hello.java.

This property is relatively simple, so let’s look at the property table in the main method table, which represents the bytecode generated by compiling the code in the main method:

As you can see, the main method table contains a property. Its structure is relatively complex, so let’s analyze it item by item.

Attribute_name_index points to item 11 in the constant pool, and the string is Code, indicating that this is a Code attribute. The Code attribute is the most important attribute in the Class file, and it stores the bytecode generated by compilation of Java Code.

Attribute_length is 38, which means that the next 38 bytes are the content of this attribute.

Max_stack represents the maximum depth of the operand stack. The operand stack is never deeper than this at any point in the method’s execution. This value is used to allocate the operation stack depth in the stack frame.

Max_locals represents the storage space required by local variables, in slot units. Slot is the minimum unit that a virtual machine can use to allocate memory for local variables.

Code_length refers to the length of the bytecode generated by compilation, followed by the code used to store the bytecode. As you can see in the figure above, the bytecode here is 10 bytes long. Let’s look at the 10 bytes:

B2 00 02 B2 00 03 B6 00 04 B1
Copy the code

Bytecode instructions are composed of opcodes followed by the required parameters. An opcode is a single byte that represents a particular operation and may have zero arguments. Instructions for the bytecode instruction set are described in detail in the Java Virtual Machine Specification.

Here we continue our analysis of the bytecode mentioned above. The first operator is 0xB2, and the table lookup represents getStatic, which gets the static field of the class. This is followed by a two-byte index value pointing to the second item in the constant pool, a CONSTANT_Fieldref_info that represents a symbolic reference to a field. The class name is Java /lang/System, the name is out, and the descriptor is Ljava/ IO /PrintStream. . The byte code 0xB20002 is a static field out of class System of type Ljava/ IO /PrintStream.

HELLO_WORLD (Ljava/lang/String); HELLO_WORLD (Ljava/lang/String);

The third operator, 0xb6, represents an operation called invokevirtual, which means invoking instance methods. It is followed by two bytes pointing to CONSTANT_Methodref_info in the constant pool. The class name is Java/IO /PrintStream, the method name is println, and the method descriptor is Ljava/lang/String; V. The operation performed by these three bytes of bytecode is our print statement.

The last operator, 0xb1, represents a return, indicating that the method returns void. At this point, the method completes execution.

At this point, the hello. class file structure is basically analyzed. Let’s review the basic structure of the Class file:

The magic number of deputy version number | | the major version number | | | constants constant pool quantity access tokens | | | the superclass index class index number of the interface table | | interface field number method table | | | method table | | quantity attribute table

The items are arranged in a tightly packed Class file in strict order without any delimiters.

You can also use the javap command to quickly view the contents of the Class file:

javap -verbose Hello.class

The results are shown below:

Code parsing

Class file format code parsing is relatively simple, read the file stream to parse item by item.

Magic number and master/slave version parsing:

private void parseHeader(a) {
    try {
        String magic = reader.readHexString(4);
        log("magic: %s", magic);

        int minor_version = reader.readUnsignedShort();
        log("minor_version: %d", minor_version);

        int major_version = reader.readUnsignedShort();
        log("major_version: %d", major_version);
    } catch (IOException e) {
        log("Parser header error:%s", e.getMessage()); }}Copy the code

Constant pool parsing:

private void parseConstantPool(a) {
    try {
        int constant_pool_count = reader.readUnsignedShort();
        log("constant_pool_count: %d", constant_pool_count);

        for (int i = 0; i < constant_pool_count - 1; i++) {

            int tag = reader.readUnsignedByte();
            switch (tag) {
                case ConstantTag.METHOD_REF:
                    ConstantMethodref methodRef = new ConstantMethodref();
                    methodRef.read(reader);
                    log("%s", methodRef.toString());
                    break;

                case ConstantTag.FIELD_REF:
                    ConstantFieldRef fieldRef = new ConstantFieldRef();
                    fieldRef.read(reader);
                    log("%s", fieldRef.toString());
                    break;

                case ConstantTag.STRING:
                    ConstantString string = new ConstantString();
                    string.read(reader);
                    log("%s", string.toString());
                    break;

                case ConstantTag.CLASS:
                    ConstantClass clazz = new ConstantClass();
                    clazz.read(reader);
                    log("%s", clazz.toString());
                    break;

                case ConstantTag.UTF8:
                    ConstantUtf8 utf8 = new ConstantUtf8();
                    utf8.read(reader);
                    log("%s", utf8.toString());
                    break;

                case ConstantTag.NAME_AND_TYPE:
                    ConstantNameAndType nameAndType = new ConstantNameAndType();
                    nameAndType.read(reader);
                    log("%s", nameAndType.toString());
                    break; }}}catch (IOException e) {
        log("Parser constant pool error:%s", e.getMessage()); }}Copy the code

Analysis of remaining information:

private void parseOther(a) {
    try {
        int access_flags = reader.readUnsignedShort();
        log("access_flags: %d", access_flags);

        int this_class = reader.readUnsignedShort();
        log("this_class: %d", this_class);

        int super_class = reader.readUnsignedShort();
        log("super_class: %d", super_class);

        int interfaces_count = reader.readUnsignedShort();
        log("interfaces_count: %d", interfaces_count);

        // TODO parse interfaces[]

        int fields_count = reader.readUnsignedShort();
        log("fields_count: %d", fields_count);

        List<Field> fieldList=new ArrayList<>();
        for (int i = 0; i < fields_count; i++) {
            Field field=new Field();
            field.read(reader);
            fieldList.add(field);
            log(field.toString());
        }

        int method_count=reader.readUnsignedShort();
        log("method_count: %d", method_count);

        List<Method> methodList=new ArrayList<>();
        for (int i=0; i<method_count; i++){ Method method=new Method();
            method.read(reader);
            methodList.add(method);
            log(method.toString());
        }

        int attribute_count=reader.readUnsignedShort();
        log("attribute_count: %d", attribute_count);

        List<Attribute> attributeList = new ArrayList<>();
        for (int i = 0; i < attribute_count; i++) {
            Attribute attribute=newAttribute(); attribute.read(reader); attributeList.add(attribute); log(attribute.toString()); }}catch(IOException e) { e.printStackTrace(); }}Copy the code

Because of the wide variety of attributes, the attributes are not parsed in detail here, just to deepen the understanding of the Class file structure, equivalent to a low-profile version of Javap.

Class file structure basic understanding here, article related files and Class file parsing project source code are here, Android -reverse.

The next article begins to learn smali language, SMali syntax parsing — Hello World

Article update on wechat public number: Bingxin said, focus on Java, Android original knowledge sharing, LeetCode solution, welcome to pay attention to!