Java cross-platform implementation is based on the JVM virtual machine, Java source code, compiled will generate a. Class file, called bytecode file. The Java virtual machine is responsible for translating bytecode files into machine code for a particular platform and then running them. In order to ensure the universality of Class files across multiple platforms, Java officials have formulated a strict Class file format. Understanding the Class file structure can help you decompile.class files or modify bytecode for code injection during program compilation.
Class file structure overview
Start by creating a Java class:
public class HelloWorld {
private static int num = 0;
public String name = "HelloWorld";
public static void main(String[] args) {
String[] strs = {"bigkai1"."bigkai2"};
for (int i = 0; i < 10; i++) {
num++;
if(i == 5) continue;
System.out.println("HelloWorld!"); }}}Copy the code
Then go to the current class directory and run the javac command to generate the class file:
$ javac HelloWorld.java
Copy the code
We can see a helloWorld.class file generated under the Java file. Open this file using the class file parser CLASspy and see the overall structure of the file:
The overall structure of the Class file is:
ClassFile {
u4 magic;
u2 minor_version;
u2 major_version;
u2 constant_pool_count;
cp_info constant_pool[constant_pool_count- 1];
u2 access_flags;
u2 this_class;
u2 super_class;
u2 interfaces_count;
u2 interfaces[interfaces_count];
u2 fields_count;
field_info fields[fields_count];
u2 methods_count;
method_info methods[methods_count];
u2 attributes_count;
attribute_info attributes[attributes_count];
}
Copy the code
I have drawn a simple diagram of the structure of the Class file:
In the JVM, Class files are described in a C-like language, using unsigned integers as basic data types: single-byte U1, 2-byte U2, 4-byte U4, and 8-byte U8.
The following is an analysis of each part of the file.
The magic number
A Magic Number is the identifier of a Class file. It is a 4-byte integer that the virtual machine considers a Class file only if the first four bytes are 0xCAFEBABE. This fixation of identifiers at the beginning is used in many places, such as zip files.
Check our Class file to see if it has this identifier:
When I manually change CA FE BA BE to CA FE BA BA, the VM will throw the following error when verifying the file:
The version number
After the magic number, there are two versions of the Class: minor_version and major_version. Together, they indicate which version of the JDK the current Class file was compiled with. Here is an excerpt from the Java official website:
We can check the corresponding JDK version by version number:
In my Class file, the version number is 0x0037, which is 55 in decimal notation, corresponding to JDk11.
For class files whose major_version is 56 or higher, minor_version must be 0 or 65535.
For class files with major_version between 45 and 55, minor_version can be any value.
When I manually change the large version number to 0x0039, which corresponds to jdk14, and then load the class file, because my JDK version is 11, the virtual machine can only be backward compatible, so the error will be reported:
Constant pool
Constant pool is one of the most important contents in a Class file. Constant pool can be roughly divided into static constant pool and runtime constant pool. Static constant pool is stored in a Class file, and runtime constant pool refers to the method area of the constant pool after the Class file is loaded into the content. Here we are parsing static constant pools.
The format of each entry in the static constant pool is:
cp_info {
u1 tag;
u1 info[];
}
Copy the code
Tag represents the constant type indicated by the entry. There are 17 constants:
I parse the first item of the generated Class file constant pool:
It can be seen that its tag is 0A. According to the above table, it is a CONSTANT_Methodref. The structure is:
CONSTANT_Methodref_info {
u1 tag;
u2 class_index;
u2 name_and_type_index;
}
Copy the code
Then, based on the 0x000C after it, class_index is the 12th item in the constant pool
The value of class_index is the index of the constant pool and represents the type of class or interface that has fields or methods as members.
- In the CONSTANT_Fieldref_info structure, the class_index entry can be a class type or an interface type.
- In the CONSTANT_Methodref_info structure, the class_index entry must be a class type, not an interface type.
- In the CONSTANT_InterfaceMethodref_info structure, the class_index entry must be an interface type, not a class type.
It then reads two bytes later, 0x001C, which represents the name of the field or method and the index value of the descriptor in the constant pool.
We look at its class from class_index:
You can see that its tag is 7, indicating CONOSTANT_CLASS, and the structure is:
CONSTANT_Class_info {
u1 tag;
u2 name_index;
}
Copy the code
Its name_index indicates the name of the class, so let’s look at the item 0x0028:
It is CONSTANT_Utf8 and has the structure:
CONSTANT_Utf8_info {
u1 tag;
u2 length;
u1 bytes[length];
}
Copy the code
It has a length of 0x0010=16, so read through 16 bytes to get its name: Java /lang/Object.
Then we look at its name_and_type_index pointer, which corresponds to entry 28:
Tag =12 indicates that this is a CONSTANT_NameAndType with the structure:
CONSTANT_NameAndType_info {
u1 tag;
u2 name_index;
u2 descriptor_index;
}
Copy the code
Name_index already knows what it means, let’s go straight to descriptor_index, which is used to represent a valid field or method descriptor:
Field descriptors corresponding to different letters are:
For method descriptors, it has parameter descriptors and return descriptors. For return descriptors, it simply adds a V that corresponds to the return value void.
Class access tag
After the constant pool, there is the class access token, which is a u2-type byte that represents the access information for the class. The mapped access modifier is as follows:
Each type of representation is represented by setting specific bits in the access tag. As you can see from the diagram, my Class file is 0x0021:
Then you can know the class access modifiers for ACC_PUBLIC | ACC_SUPER x0021 (0 = 0 x0020 + x0001).
Class relation information
After the access tag is this_class, super_class (the topmost parent of all classes is Object) super_class, and the number of interfaces implemented, interface_count, interface_index.
View my Class file:
The class is 11, the parent class is 12, and the number of interfaces is 0.
Seeing that they are of type CONSTANT_Class, look up their names based on the last two bytes:
They are both of type CONSTANT_Utf8 and read in the corresponding structure as HelloWorld, Java /lang/Object respectively. The class name is HelloWorld, and its parent class is java.lang.object, which does not implement an interface.
Field information
After the class information is the field information, which is composed of field number (fields_count) and field table (fields_info), the field number is a U2 type, mainly see the structure of the field information table:
field_info {
u2 access_flags; // Field access tag
u2 name_index; / / the field name
u2 descriptor_index; / / descriptors
u2 attributes_count; // The number of field attributes
attribute_info attributes[attributes_count]; // Field attribute table
}
Copy the code
Field access tags: Access tags that are similar to classes and are evaluated similarly to classes.
Field name: Points to the constant pool index.
Descriptor: used to describe the field type, pointing to the constant pool index, the field type is:
Number of field attributes: Records the number of field attributes. Attributes are additional information about the field, such as initialization values, comments, and so on.
Field property: Stores the specific content of the property.
Take the Class file I generated:
There are two fields, the first field bytecode is expressed as 00 0 a 00 00 00 0 0 d e, visit marked ACC_PRIVATE | ACC_STATIC (a = 00 00 0 02 + 00 08), field called 00 0 d, the descriptor is 00 0 e, The number of attributes is 00, 00.
The contents of a property sheet are described in a method.
methods
The method of Class file consists of method number and method content. Method number is u2 type data followed by method information. The structure of method information is as follows:
method_info {
u2 access_flags; // Access the tag
u2 name_index; / / the method name
u2 descriptor_index; / / descriptors
u2 attributes_count; // Number of attributes
attribute_info attributes[attributes_count]; // Attribute content
}
Copy the code
Methods have many more access tags than fields:
Name_index is the index of the method name, Descriptor_index represents the method signature (parameter, return value, etc.), and method descriptors are represented in the constant pool as the return value (parameter 1 parameter 2).
Focus on attribute_info, which is structured as follows:
attribute_info {
u2 attribute_name_index; / / the property name
u4 attribute_length; // Attribute length
u1 info[attribute_length]; / / property
}
Copy the code
There are various attributes:
Take a quick look at common attributes:
For the following attributes, some that are not required at runtime, you can disable or require this information to be generated using the -g: None or -g: Vars options in Javac, respectively.
Code
The Code attribute stores the bytecode and other information of the method and is the execution body of the method. The structure of the Code attribute is:
Code_attribute {
u2 attribute_name_index; // Attribute name -- fixed to Code
u4 attribute_length; // Attribute length (excluding the first 6 bytes)
u2 max_stack; // Maximum depth of operand stack
u2 max_locals; // Maximum number of local variables
u4 code_length; // Method bytecode length
u1 code[code_length]; // Bytecode content
u2 exception_table_length; // The exception processing table length
/* In the code starting at the start_PC offset of the method bytecode and ending at the end_PC offset, if an exception specified by catch_type is encountered, the code jumps to handler_PC. * /
{ u2 start_pc;
u2 end_pc;
u2 handler_pc;
u2 catch_type;
} exception_table[exception_table_length]; // Exception handling table contents
u2 attributes_count; // Number of attributes
attribute_info attributes[attributes_count]; // Attribute content
}
Copy the code
ConstantValue
The ConstantValue property notifies the VIRTUAL machine to automatically assign values to static variables. Only variables (class variables) decorated with the static keyword can use this property. Its structure is as follows:
ConstantValue_attribute {
u2 attribute_name_index; / / fixed ConstantValue
u4 attribute_length; / / 2
u2 constantvalue_index; // A valid index for the constant pool
}
Copy the code
If the ACC_STATIC flag is set in the access_flags item of the field_info structure, the field represented by the field_info structure will be assigned the value represented by its ConstantValue property as part of the initialization of the class or interface that declares the field, This action occurs before the class or interface initialization method that calls the class or interface.
Signature
Signature was released in JDK1.5 and appears in the property tables of class, property table, and method table structures. The Signature attribute records generic Signature information for any class, interface, initializer, or member whose generic Signature contains Type Variables or Parameterized Types. Its structure is as follows:
Signature_attribute {
u2 attribute_name_index; / / fixed Signature
u4 attribute_length; / / 2
/* If the signature attribute is an attribute of the class file structure, the constant pool entry at the index must be a constant information structure representing the class signature. If the signature attribute is an attribute of the method information structure, it must be a method signature. Otherwise, it must be a field signature. * /
u2 signature_index; // Constant pool valid index.
}
Copy the code
The reason for using such an attribute specifically to record generic types is that generics in the Java language are pseudo-generics implemented by erasing. In bytecode attributes, generic information is erased after compilation (type variables, parameterized types). The benefits of using erase are simple implementation (mainly changes to the Javac compiler, with few internal changes to the virtual machine), easy implementation of Backport, and runtime memory savings for some types. The downside is that the runtime does not treat generic types as normal user-defined types in the same way that languages such as C# do with true generics support. For example, the runtime does not get generic information when it reflects. The Signature attribute was added to compensate for this shortcoming, and Java’s reflection API is now able to retrieve generic types from this attribute.
LineNumberTable
LineNumberTable is used to record the mapping between bytecode offsets and line numbers. It is not required at runtime, but is generated by default in a Class file. The error line number will not be displayed on the stack, and breakpoints cannot be set from the source line when debugging the program. The LineNumberTable property has the following structure:
LineNumberTable_attribute {
u2 attribute_name_index; // LineNumberTable is fixed
u4 attribute_length; // Attribute length
u2 line_number_table_length; // Entry length
{ u2 start_pc; // Bytecode offset
u2 line_number; / / line number
} line_number_table[line_number_table_length]; // Entry content
}
Copy the code
LocalVariableTable
The LocalVariableTable property is a LocalVariableTable and is not required at runtime, but is generated in the Class file by default. If this property is not generated, all parameter names will be lost when the method is referenced by someone else. The IDE will use placeholders such as arg0 and arg1 instead of argument names. Its structure is as follows:
LocalVariableTable_attribute {
u2 attribute_name_index; / / fixed LocalVariableTable
u4 attribute_length; // Entry length
u2 local_variable_table_length;
{ u2 start_pc; // Bytecode offset
u2 length; / / the length
u2 name_index; // Local variable name
u2 descriptor_index; // Local variable descriptor
u2 index; // The slot of the local variable in the local variable table of the current stack frame
} local_variable_table[local_variable_table_length]; // Entry content
}
Copy the code
StackMapTable
It is an attribute introduced in JDK1.6, located in the property table of the Code attribute. This interface holds data for several stack mapping frames. This attribute is not required at runtime, and is only used for Class type verification. It is used by the new Type Checker during the bytecode validation phase of the virtual machine class load, which is intended to replace the performance-intensive Type derivation validator based on data flow analysis. Its structure is as follows:
StackMapTable_attribute {
u2 attribute_name_index; / / fixed StackMapTable
u4 attribute_length; // Entry length
u2 number_of_entries; // Stack map frame properties
stack_map_frame entries[number_of_entries]; // Stack mapping frame details
}
Copy the code
Stack_map_frame has the following structure:
union stack_map_frame {
/* same_frame { u1 frame_type = SAME; / / 0-63} * /
same_frame; // Indicates whether the local variable table at the current code location is the same as at the previous comparison location, and the operand stack is empty
/* same_locals_1_stack_item_frame { u1 frame_type = SAME_LOCALS_1_STACK_ITEM; // 64-127 verification_type_info stack[1]; } * /
same_locals_1_stack_item_frame; // Indicates that the current frame and the previous frame have the same local variables, and the number of variables in the operand stack is 1
/* same_locals_1_stack_item_frame_extended { u1 frame_type = SAME_LOCALS_1_STACK_ITEM_EXTENDED; // 247- u2 offset_delta; verification_type_info stack[1]; } * /
same_locals_1_stack_item_frame_extended; // indicates that the current frame and the previous frame have the same local variables, the number of variables in the operand stack is 1, and offset_delta exceeds same_LOCALs_1_STACK_ITEM_frame
/* chop_frame { u1 frame_type = CHOP; // 248-250 u2 offset_delta; } * /
chop_frame; // Indicates that the operand stack is empty and the current local variable table has K (K= 2510FRrame_type) fewer local variables than the previous frame
/* same_frame_extended { u1 frame_type = SAME_FRAME_EXTENDED; // 251- u2 offset_delta; } * /
same_frame_extended; // Indicates whether the current code location has the same local variable table as the previous comparison location, and the operand stack is empty, supporting a larger offset_delta
/* append_frame { u1 frame_type = APPEND; // 252-254 u2 offset_delta; verification_type_info locals[frame_type - 251]; } * /
append_frame; // Indicates that the current frame has K more local variables (K=frame_type-251) than the previous frame and the operand stack is empty
/* full_frame { u1 frame_type = FULL_FRAME; // 255 u2 offset_delta; u2 number_of_locals; Verification_type_info locals[number_of_locals]; // Local variable table data type U2 number_of_stack_items; Verification_type_info stack[number_of_stack_items]; // Operand stack type} */
full_frame; // Complete the local variable table and operand stack
}
Copy the code
Exceptions
In addition to the Code attribute, each method can have an Exceptions attribute that holds Exceptions that the method may throw. Its structure is as follows:
Exceptions_attribute {
u2 attribute_name_index; // Fixed Exceptions
u4 attribute_length; // Attribute length
u2 number_of_exceptions; // Number of entries, the number of exceptions that may be thrown
u2 exception_index_table[number_of_exceptions]; // All exceptions are stored, and each entry is an index of the execution constant pool
}
Copy the code
Note: Method Exceptions indicate the Exceptions that a method may throw. They are usually specified by the throws keyword. The exception table within Code is an exception handling mechanism generated by a try-catch statement.
Method bytecode analysis
A simple analysis of the methods section of the Class file I generated:
For the main method, you can see its access identifier is 00 09 (ACC_STATIC | ACC__PUBLIC), name_index corresponding constant pool at 15, descriptor corresponding constant pool at 16, number of properties for 00, 01, then look at its property sheet:
The maximum number of operation stacks is 4, the maximum number of local variables is 54, and the length of the method is 54. Code stores the instructions for the virtual machine to execute. Here we mainly analyze its code content.
instruction
The INSTRUCTION set of the JVM has many types, which can be broadly divided into:
- Const series: responsible for theSimple numeric typesSend to the top of the stack. For example, for ints, this method can only be used
Minus 1,0,1,2,3,4,5
Push to the top of the stack. For ints, other values use the push series. - Push series: This series of commands is responsible for sending an integer number (small in length) to the top of the stack. It takes a parameter that specifies the number to send to the top of the stack, and LDC is used for data that is out of range.
- LDC series: This series of commands is responsible for pushing numeric or String constants from the constant pool to the top of the stack. The command is followed by a parameter that represents the position (number) of the constant in the constant pool.
Constants of numeric type outside the operation scope of const and push commands, as well as any strings not created by new, are placed in the constant pool.
- The load series:
- LoadA family: Responsible for sending local variables to the top of the stack. The local variable here can be a reference type as well as a numeric type.
- LoadB series: is responsible for sending an item of an array to the top of the stack. This command determines which items of which array to operate on based on the contents of the stack.
- Store series:
- StoreA series: Is responsible for storing values at the top of the stack into local variables. The local variable here can be a reference type as well as a numeric type.
- StoreB series: Stores the value of the top item on the stack into an array. This command determines which items of which array to operate on based on the contents of the stack.
- Pop series: Pop the top of the stack (or assign and push the top of the stack).
- Type conversion family: This instruction is specifically for type conversion; The mnemonics for such instructions are given in the form x2y. Where x could be
i,f,l,d,y
May bei,f,l,d,c,s,b
. - Operation series: Provides basic addition, subtraction, multiplication, and division operations for VMS.
- Array family: Object manipulation instructions, can be further subdivided into create instructions, field access instructions, type checking instructions, array manipulation instructions.
- Control series: represents conditional control. In general, it is divided into comparison instruction, conditional jump instruction, comparison conditional jump instruction, multi-conditional branch jump, unconditional jump instruction and so on.
- Function series: including function call instructions, function return instructions.
- Synchronization control series: The Java virtual machine provides Monitorenter, MonitoreXit to complete the entry and exit of critical sections. Achieve multi-threaded synchronization.
Specific instructions can be found on the CSDN JVM instruction Set Collation blog.
Instruction bytecode analysis
Now that you know the bytecode instructions, let’s put the generated Class file to the test:
The actual code is:
private static int num = 0;
public static void main(String[] args) {
String[] strs = {"bigkai1"."bigkai2"};
for (int i = 0; i < 10; i++) {
num++;
if(i == 5) continue;
System.out.println("HelloWorld!"); }}Copy the code
Anewarray creates a reference to an array and pushes it to the top of the stack. Dup copies the top value and pushes it to the top of the stack. Iconst_0 pushes the int 0 to the top of the stack. Use LDC to push a String constant value from the constant pool to the top of the stack, pointing to 05 — bigkai1 in the constant pool. Call aastore to store the top of the stack reference value to the specified index position in the specified array. At this point, the first element bigkai1 is stored in the string array STRS.
Next, I call dUP to copy and push the top element onto the stack again, then I press iconst_1 into int 1, LDC retrieves bigKAI2 from the constant pool, and THEN I call aastore to pop two values of the stack and assign bigKAI2 to the second element of the array. This completes the assignment of all strings to the array.
Iconst_0 pushes int 0, istore_2 pushes int 2 to the top of the stack, iload_2 pushes int 2 to the top of the stack, bipush pushes single byte constants (-128~127) to the top, If_icmpge compares two int values at the top of the stack. If the result is greater than or equal to 0, it jumps to the 53rd instruction (return), then calls getStatic to get the static field of the specified class (num from the constant pool) and pushes its value to the top of the stack. Iconst_1 pushes 1 into int. Putstatic (num++); putStatic (num++); putstatic (num++); putstatic (num++);
Iload_2 pushes the second int to the top of the stack (pushing I =0), then iconst_5 is pushed to 5, and if_icmpne is used to compare the two ints at the top of the stack. If the result is not equal to 0, it jumps to instruction 39; otherwise, goto is called to instruction 47. The 47th instruction is iinc, which increments a variable of type int to the specified value. It requires two variables: index, const, index, local variable of type int, const, and goto to instruction 17. Here we implement if(I == 5) continue.
If if_icmpne is not equal to 0, skip to 39, which is getStatic, which gets Java /lang/ system. out from the constant pool, and then execute LDC, which sets HelloWorld! Push to the top of the stack and invoke the Invokevirtual directive, which invokes instance methods that are distributed according to the actual type of the object and supports polymorphism. Println (“HelloWorld!”) ).
Class file properties
The Class file also comes with a number of attributes, consisting of the length and content of the attributes. The main attributes are:
SourceFile
The SourceFile attribute describes the SourceFile from which the current Class file is compiled.
SourceFile_attribute {
u2 attribute_name_index; / / fixed SourceFile
u4 attribute_length; // Attribute length, fixed to 2
u2 sourcefile_index; // The source file name pointing to the constant pool index
}
Copy the code
BootstrapMethods
The BootstrapMethods attribute is used to support the invokeDynamic instruction, which describes and saves bootmethods.
InvokeDynamic is an instruction in JDK1.7 that supports dynamically typed languages, where the main process of type checking is done at run time rather than compile time, typically in Python.
A bootstrap method can simply be thought of as a method that finds a method.
BootstrapMethods_attribute {
u2 attribute_name_index; / / fixed BootstrapMethods
u4 attribute_length; // Total length of attributes (excluding the first 6 bytes)
u2 num_bootstrap_methods; // The number of bootstrap methods held in this class
{ u2 bootstrap_method_ref; // Specify the function
u2 num_bootstrap_arguments; // Specifies the number of bootstrap parameters
u2 bootstrap_arguments[num_bootstrap_arguments]; // Boot method parameters
} bootstrap_methods[num_bootstrap_methods];
}
Copy the code
InnerClasses
It is used to describe the relationship between an outer class and an inner class:
InnerClasses_attribute {
u2 attribute_name_index; / / fixed InnerClasses
u4 attribute_length; // Attribute length
u2 number_of_classes; // Inner class format
{ u2 inner_class_info_index; // Inner class type
u2 outer_class_info_index; // External class type
u2 inner_name_index; // Inner class name
u2 inner_class_access_flags; // Inner class access identifier
} classes[number_of_classes]; // Inner class content
}
Copy the code
Access identifiers for inner classes support the following:
Deprecated
Deprecated is used in a class, method, or field structure to indicate that the class, method, or field will be Deprecated in a future release. Its structure is as follows:
Deprecated_attribute {
u2 attribute_name_index; / / fixed Deprecated
u4 attribute_length; // set to 0
}
Copy the code
This property is generated when a class, method, or field is marked Deprecated.
conclusion
With Class files, languages can be compiled from source code into Class files and eventually executed on virtual machines, as long as they follow the Class file specification.