You may have written thousands of lines of code, and you may be fluent in a high-level language, but you may not know how it works. Java, for example, is popular.

Java claims to be a “compile once, run everywhere” language, but how far do we go with that statement? From the Java files we write to the Java bytecode files compiled by the compiler (i.e..class files), this is the Java compilation process; Our Java virtual machine executes bytecode files. It doesn’t matter where the bytecode file came from, what compiler it was compiled by, or even a handwritten bytecode file, as long as it complies with the Java Virtual machine’s specifications, it can execute the bytecode file. This article will focus on Java bytecode files. Next, let’s use a specific Demo to get a deeper understanding:

1 First let’s write a Java source file





javasrc.png

Here is a simple Java program that has only one member variable a and one method testMethod().

2 Next we compile this Java source file into Java bytecode file using javac command or IDE tool.





demo.png

Above is a compiled bytecode file, where we can see a bunch of hexadecimal bytes. If you open it using an IDE, you might see familiar Java code that has been decomcompiled, which is pure bytecode, which is the focus of today’s talk.

Maybe you’ll get a headache from all this bytecode, but that’s okay. We’ll try to make sense of it, and maybe you’ll get a different result. Let’s look at a picture before we start





java_byte.jpeg

This diagram is an overview of Java bytecode, and it is in this order that we interpret bytecode. There are 10 parts in total, including magic numbers, version numbers, constant pool, etc., and we will explain them step by step in order.

3.1 the magic number

From the overview diagram above we know that the first 4 bytes represent magic numbers, corresponding to our Demo 0XCAFE BABE. What is a magic number? Magic number is a symbol used to distinguish the type of file, usually represented by the first few bytes of the file. If 0XCAFE BABE is a class file, the type of file can be determined by the filename suffix. Yes, but the file name can be changed (including the suffix), so to ensure the security of the file, the file type is written inside the file to ensure that it can not be tampered with. From the Java bytecode file type, we see that CAFE BABE translates to coffee BABE, and then look at the Java icon.





java_icon.png

CAFE BABE = coffee.

3.2 the version number

Once we have identified the file type, the next step is to know the version number. The version number contains the major version number and the minor version number, each of which is 2 bytes. In this case the Demo species is 0X0000 0033. The preceding 0000 is the minor version number and the following 0033 is the major version number. The base conversion results in a minor version number of 0 and a major version number of 51. From the oracle official website we know that 51 corresponds to the official jdk1.7, and the second version is 0, so the file version is 1.7.0. If verification is required, you can run the Java –version command to output the version number, or modify the compilation target version –target to recompile and check whether the compiled bytecode file version number has been changed accordingly.

Now that we know what the first eight bytes mean, let’s talk about constant pools.

3.3 constant pool

Immediately after the main version number is the constant pool entry. Constant pools are repositories of resources in Class files, which we’ll find covered in many places, such as Class names, Interfaces, and so on. There are two main types of constants stored in the constant pool: literals and symbolic references. Literals such as text strings, constant values declared final in Java, and so on, while symbolic references such as globally qualified names of classes and interfaces, field names and descriptors, and method names and descriptors.

Why do you need globally qualified names for classes and interfaces? Doesn’t the system operate on memory addresses when referencing classes or interfaces? Here everybody to think of it, the Java virtual machine in the absence of class will be loaded into memory at all did not allocate memory address, does not apply to the operation of the memory, so the Java virtual machine class must first be loaded into the virtual machine, so the process design of class positioning (need to load A package of B class, cannot be loaded into other packages below the other class). So you need to check uniqueness by globally qualified names. That’s why it’s called global, finite, which means unique.

Before we go into the specific constant pool analysis, let’s take a look at the constant pool item type table:





jvm_constant.png

The structure of 11 data types is described in the table above, but three more have been added since jdk1.7 (CONSTANT_MethodHandle_info,CONSTANT_MethodType_info, and CONSTANT_InvokeDynamic_info). That adds up to 14. Next, we translate the Demo bytecode one by one.

0x0015: Since the number of constant pools is not fixed (n+2), we need to place an entry of type U2 to represent the number of constant pools at the entry of the constant pool. Therefore, the hexadecimal value is 21, indicating 20 constants with indexes ranging from 1 to 20. Why is it 20 when it’s 21? Because of the Class file format, the designer reserved item 0 for future problems. So we know from this that we’re going to have to translate 20 constants. Constant #1 (there are 20 constants, this is the first one, and so on…) 0x0a- : The first datum is a tag of u1 type. The hexadecimal 0a is a decimal 10, corresponding to MethodRef_info in the table. 0x-00 04- : Class_info index item #4 0x-00 11- : NameAndType index item #17 Constant #2 0X-09: FieldRef_info 0x0003 :Class_info index item #3 0x0012: NameAndType Index item #18 Constant #3 0x07-: Class_info 0X-00 13-: Globally qualified name Constant index is #19 Constant #4 0X-07 :Class_info 0x0014: Globally qualified name Constant index is #20 Constant #5 0x01: UTF-8_INFO 0X-00 01-: The string contains 1 characters (select the next byte length to escape) 0X-61 :” A “(hexadecimal to ASCII characters) Constant #6 0x01: UTF-8_info 0X-00 01: The value contains 1 0X-49 :”I” Constant #7 0x01: UTF-8_INFO 0X-00 06: the value contains 6 0X-3C 696E 6974 3E -:”

” Constant #8 0x01: UTF-8_info 0x0003: The string contains 3 characters 0x2829 56:”()V” Constant #9 0X-01: UTF-8_info 0x0004: The value contains 4 0x436F 6465:”Code” Constant #10 0x01: UTF-8_INFO 0x00 0F: The value contains 15 0x4C 696E 654E 756D 6265 7254 6162 6c65:”LineNumberTable” Constant #11 ox01: Utf-8_info 0x00 12 The value contains 18 0X-4C 6F63 616C 5661 7269 6162 6C65 5461 626C 65:”LocalVariableTable” Constant #12 0x01: UTF-8_INFO 0x0004 The value contains 4 characters 0x7468 6973 :”this” Constant #13 0x01: UTF-8_INFO 0x0f: The value contains 15 0x4C 636F 6D2f 6465 6D6F 2f44 656d 6f3b:”Lcom/demo/Demo;” Constant #14 0x01: UTF-8_info 0x00 0A: The value contains 10 characters ox74 6573 744D 6574 686F 64:”testMethod” Constant #15 0x01: UTF-8_info 0x000A: The value contains 10 characters 0x536F 7572 6365 4669 6C65 :”SourceFile” Constant #16 0x01: UTF-8_info 0x0009: The value contains 9 0X-44 656D 6F2E 6a61 7661 :” demo.java “Constant #17 0x0C :NameAndType_info 0x0007: Field or name Constant item index #7 0x0008: Field or method descriptor Constant index #8 Constant #18 0x0C :NameAndType_info 0x0005: Field or name Name Constant item index #5 0x0006: Field or method descriptor Constant index #6 Constant #19 0x01: UTF-8_INFO 0x00 0D: The value contains 13 0x63 characters 6f6d 2f64 656d 6f2f 4465 6D6F :”com/demo/ demo “Constant #20 0x01: UTF-8_info 0x00 10: The value contains 16 0x6a 6176 612f 6C61 6e67 2f4f 626A 6563 74 :” Java /lang/Object” so far we have resolved all constants. The next step is to parse the access flag bits.

3.4 Access_Flag Access flag

Access flags include whether the Class file is a Class or an interface, whether it is defined as public, whether it is abstract, and if it is a Class, whether it is declared final. From the source code above, we know that this file is a class and is public.





access_flag.png

0X 00 21: is the union of 0x0020 and 0x0001. The flag value 0x0020 involves bytecode instructions, which will be explained in a special topic later. Looking forward to…

3.5 class index

The class index is used to determine that the fully qualified name 0x00 03 of the class refers to the third constant, and the third constant refers to the 19th constant, looking for “com/demo/ demo “. # 3. # 19

3.6 Superclass index

#4.#20(Java /lang/Object)

3.7 Interface Index

The java_byte.jpeg diagram shows that the interface has 2+ N bytes, with the first two bytes representing the number of interfaces followed by the table of interfaces. Our class doesn’t have any interfaces, so it should be 0000. Sure enough, a search for a bytecode file yields 0000.

3.8 Set of field tables

Field tables are used to describe variables declared in classes and interfaces. The fields here contain class-level variables and instance variables, but not local variables declared inside the method. Again, this is followed by 2+ N field attributes. We only have one property, A, which logically should be 0001. The search for the file was, of course, 0001. The next step is to parse such fields. Attached is the field table structure diagram





Field table structure.png

0x00 02: Access flag is private (self search field access flag) 0x00 05: field name index is #5, corresponding to “A” 0x00 06: descriptor index is #6, corresponding to “I” 0x00 00: Number of property tables is 0, so there is no property table. Tips: Some of the less important tables (fields, method access flag tables) can be searched by themselves, so I won’t post them here to prevent too much space.

3.9 methods

We only have one method, testMethod, and the first two bytes should be 0001. A lookup found it to be 0x00 02. Why is that? Does that mean there are two ways? And go on to……





Method table structure.png

Above is a method table structure diagram, according to this diagram we analyze the following bytecode:

Method 1:

0x00 01: Access flag ACC_PUBLIC, indicating that the method is public. 0x00 07: method name index is #7, corresponding to ”

” 0x00 08: method descriptor index is #8, corresponding to “()V” 0x00 01: property list number is 1(one property list) What is a property sheet? It is intended to describe some proprietary information, and the above method has a property list. The structure of all property tables is shown as follows: a u2 property name index, a U2 property length plus an info of the property length. The vm specification has many predefined attributes, such as Code, LineNumberTable, LocalVariableTable, SourceFile, etc., which can be found online.





PNG property table structure.png

0x0009: The name index is #9(“Code”). 0x000 00038: The length of the attribute is 56 bytes. So next parse a Code property table as shown below





code.png

The first 6 bytes (name index 2 bytes + attribute length 4 bytes) have already been parsed, so the next step is to parse the remaining 56-6=50 bytes. 0x00 02: max_stack=2 0x00 01: max_locals=1 0x00 0000 0A: code_length=10 0x2a b700 012A 04B5 0002 B1: This is the code code, which can be looked up by virtual machine bytecode instructions. B7 = Invokespecial 00= do nothing 01 = push NULL to the top of the stack 2a= same as 04= iconST_1 push int 1 to the top of the stack b5=putfield B1 =return Return void from the current method, remove the non-action command to get the following 0: ALOad_0 1: invokespecial 4: Aload_0 5: iconst_1 6: putfield 9: return About the virtual machine bytecode instructions, the content will continue in depth…… For now, you just need to know. 0x00 00: Exception_table_length =0 0x00 02: Attributes_count =2 0x00 0A: The first property table is LineNumberTable





LineNumberTable.png

0x00 0000 0a : “Attribute length is 10” 0x00 02: Line_number_table_length =2 Line_number_table is a set of line_number_table_length and type line_number_info. The line_number_info table contains two u2-type data items, start_PC, which is the bytecode line number, and line_number, which is the Java source line number 0x00 00: start_PC =0 0x00 03: end_pc =3 0x00 04 : start_pc=4 0x00 04 : end_pc=4

0x00 0b Second property table is: “LocalVariableTable”





local_variable_table.png




local_variable_info.png

0x00 01: local_variable_table_length=1 Start_pc =0 0x00 0A: length=10 0x000c: name_index=”this” 0x000d: Descriptor_index # 13 (” Lcom/demo/demo “) 0000 index = 0 / / — — — — — — — here the first method is parsing is complete — — — — — — — / / Method (

)–1 property Code table -2 property tables (LineNumberTable, LocalVariableTable) next parse the second Method

Method 2:

0x00 04: “protected” 0x00 0e: #14 (“testMethod”) 0x00 08: “()V” 0x0001: Attribute number =1 0x0009: “Code” 0x0000 002b Attribute length = 43 Parse a Code table 0000: MAX_STACK =0 0001: max_local =1 0000 0001: Code_length =1 0xb1: return(void) 0x0000 error table length =0 0x0002 #10, LineNumberTable 0x0000 0006: Attribute length is 6 0x0001: line_number_length = 1 0x0000: start_PC =0 0x0008: End_pc =8 // second attribute table 0x000B: #11, LocalVariableTable 0x0000 000C: attribute length is 12 0x0001: local_variable_table_length =1 0x0000 :start_pc = 0 0x0001: length = 1 0x000c : name_index =#12 “this” 0x000d : Description index #13 “Lcom/demo/ demo;” 0000 index=0

// At this point, method parsing is complete. Looking back at the top parsing sequence diagram, we will now resolve Attributes.

3.10 the Attribute

0x0001: Again, indicates that there is an Attributes. 0x000f : #15(“SourceFile”) 0x0000 0002 attribute_length=2 0x0010 : Sourcefile_index = #16(” demo.java “) the SourceFile attribute records the name of the SourceFile that generated the Class file.





source_file.jpeg

4 the other words

Actually, it’s a lot of trouble to write all this, but it’s not the same when you go through the process yourself. Now, use Java’s own decomcompilers to parse bytecode files. Javap -verbose Demo // Do not use the suffix. Class





javap_result.png

5 concludes

At this point, the class file is parsed so that we can read bytecode files in the future. Understanding the structure of the class file is very important for further understanding of the virtual machine execution engine, so this is a basic and important step.

6 tools:Github.com/zxh0/classp…This is a bytecode file analysis tool, good to use.