Even for experienced Java developers, reading compiled Java bytecode can be tedious. Why do we need to understand this low-level stuff in the first place? Here’s a simple story that happened to me last week: A long time ago, I made some code changes on my machine, compiled a JAR, and deployed it to the server to test a potential fix for a performance problem. Unfortunately, the code was never checked into version control, and for some reason the local changes were removed without tracking. A few months later, I modified the source code again, but I couldn’t find the version I had changed last time! Fortunately, the compiled code still exists on the remote server. Relieved, I grab the JAR again and open it using the decompiler editor…… There was just one problem: the decompiler GUI was not a perfect tool, and for some reason, finding the particular class I wanted to decompiler among the many classes in that JAR caused an error in the UI when I opened it, and the decompiler crashed! Desperate times call for desperate measures. Fortunately, I’m familiar with raw bytecodes, and I’d rather spend some time decompiling some code manually than constantly changing and testing it. Because I still remember where I can look at the code, reading bytecode helps me pinpoint the specific changes and build them in source form. I must learn from my mistakes and cherish them this time! The nice thing about bytecode is that you can only learn its syntax once, and then it works for all Java supported platforms — because it’s an intermediate representation of the code, not the actual executable code of the underlying CPU. In addition, bytecode is simpler than native code because the JVM architecture is fairly simple, thus simplifying the instruction set. Another nice thing is that all instructions in this set are fully documented by Oracle. Before we get to the bytecode instruction set, however, let’s familiarize ourselves with a few things about the JVM as a prerequisite for moving forward. The JVM data type Java is statically typed, which affects the design of bytecode instructions so that they expect themselves to operate on values of a particular type. For example, there are several add instructions for adding two numbers: iadd, ladd, fadd, dadd. They expect the operands of the types to be int, long, float, and double. Most bytecodes have this feature, which has different forms of the same functionality, depending on the operand type. The data types defined by the JVM include: 1. Basic types: Numeric types: Byte (8-bit), short (16-bit), int (32-bit), Long (64-bit), CHAR (16-bit unsigned Unicode), float(32-bit IEEE 754 single-precision floating-point type), Double (64-bit IEEE 754 double-precision floating-point type) Boolean pointer Type: instruction pointer. 2. Reference types: Array-like interfaces have limited support for Booleans in bytecode. For example, no structure can manipulate booleans directly. The substitution of a Boolean value into an int is done by the compiler and is eventually converted to an int structure. Java developers should be familiar with all of the above types, except for returnAddress, which has no programming language equivalent. Stack-based architecture The simplicity of the bytecode instruction set is largely due to Sun’s design of a STACK-based VM architecture rather than a register-based one. There are a variety of processes that use JVM-based memory components, but basically only the JVM heap needs to scrutinise the bytecode instructions in detail: PC register: For each running thread in a Java program, there is a PC register that holds the address of the currently executing instruction. JVM stack: For each thread, a stack is allocated that holds local variables, method parameters, and return values. Here is an example of a stack showing three threads.
Heap: Memory and storage objects (class instances and arrays) shared by all threads. Object collection is managed by the garbage collector.
Method area: For each loaded class, it stores the code for the method along with a symbol table (such as references to fields or methods) and a constant pool.
The JVM stack is made up of frames, and each frame is pushed onto the stack when a method is called, and ejected from the stack when the method completes (either by a normal return or by throwing an exception). Each frame also contains: an array of local variables, indexed from 0 to its length -1. The length is calculated by the compiler. A local variable can hold any type of value, and values of type long and double occupy two local variables. A stack used to store intermediate values, which store the operands of instructions, or the parameters of method calls.
Bytecode exploration takes a look inside the JVM, and we can see some examples of basic bytecode being generated in the sample code. Each method in a Java class file has code snippet that contains a series of instructions in the following format: opCode (1 byte) operand1 (optional) operand2 (optional)… This instruction consists of a one-byte opcode and zero or more operand, which contains the data to be manipulated. In the stack frame of the currently executing method, an instruction can push a value on or off the stack of operations, and can silently load or store values in a local array of variables. Let’s look at an example:
To print the resulting bytecode in the compiled class (presumably in the test.class file), we run the Javap tool: javap -v test.class We get the following result:
We can see that the main method descriptor is a String array ([Ljava/lang/String;), and the return type is void (V). The flags line below indicates that the method is public (ACC_PUBLIC) and static (ACC_STATIC). The Code attribute is the most important part. It contains a set of instructions and information about the method, including the maximum depth of the stack of operations (2 in this case) and the number of local variables to be allocated in this frame of the method (4 in this case). All local variables are mentioned in the instruction above, except for the first variable (index 0), which holds the args parameter. The other three local variables are equivalent to a, B, and C. Instructions at addresses 0 through 8 perform the following operations: iconst_1: puts the integer constant 1 on the operand stack.
Istore_1: Removes the first operand (an int value) from the stack at index 1 and stores it in a local variable, equivalent to variable A.
Iconst_2: Puts the integer constant 2 on the operand stack.
Istore_2: Takes the first operand off the stack at index 2 and stores it in a local variable, equivalent to variable B.
Iload_1: Loads an int value from the local variable of index 1 into the operand stack.
Iload_2: Loads an int value from the local variable of index 2 into the operand stack.
Iadd: Removes the first two ints from the operand stack and adds them to the stack.
Istore_3: Removes the first operand from the stack at index 3 and stores it in a local variable, equivalent to c.
Return: Returns from the void method. These instructions contain only opcodes and are executed precisely by the JVM. The above example has only one method, the main method. Suppose we need to perform more complex calculations on variable C, written in the new calc method:
Take a look at the generated bytecode:
The only difference in the main method code is that the iadd directive is replaced by the Invokestatic directive, which is used to call the static method calc. Note that the key is the two parameters passed to the calc method in the operand stack. That is, the calling method needs to prepare all the arguments for the called method in the correct order and push them onto the operand stack in turn. Iinvokestatic (and other similar invocation instructions mentioned below) then pulls these parameters off the stack and creates a new environment for the invoked method, placing the parameters as local variables. We also notice that the invokestatic instruction takes up 3 bytes on the address, jumping from 6 to 9. Not as far as the other instructions, because the InvokeStatic instruction contains two extra bytes to construct a reference to the method to be invoked (in addition to opcode). This reference, shown by Javap as #2, is a symbol referring to a CALC method, resolved from the constant pool described earlier. The other new information is obviously the code for the CALC methods themselves. It first loads the first integer argument onto the operand stack (iloAD_0). The next instruction, i2D, converts it to a double by applying an extended conversion. The resulting double replaces the top of the operand stack. The next instruction pushes a double constant 2.0d(taken from the constant pool) onto the operand stack. The static method math.pow then calls the two operands prepared so far (the first argument is calc and the constant 2.0d). When math.pow returns, its results are stored on the operand stack of its calling program. Explained below.
The same procedure applies to math.pow (b,2):
The next instruction, dadd, takes the two intermediate results at the top of the stack, adds them up, and pushes the sum to the top. Finally, invokestatic calls math.sqrt on the sum, narrowing the result (d2i) from double (a double-precision floating-point type) to int (an integer). The integer result is returned to the main method, where it is saved to c (istore_3). Now modify this example by adding the Point class to encapsulate XY coordinates.
The compiled font for the main method looks like this:
New, DUP and Invokespecial are introduced here. The new directive is similar to the new operator in programming languages in that it creates an object (a symbolic reference to the Point class) of the type specified by the passed operand. Object memory is allocated on the heap, and object references are pushed onto the operand stack. The DUP instruction copies the stack value of the top operand, which means we now have two references to the Point object at the top of the stack. The next three instructions push the constructor’s arguments (used to initialize the object) onto the operand stack, and then call the special initialization method corresponding to the constructor. The x and Y fields in the next method will be initialized. Once this method is complete, the stack values for the first three operands are destroyed, leaving the original references to the created objects (successfully initialized so far). Next, astore_1 references the Point off the stack and assigns it to the local variable held by index 1 (the A in ASTore_1 indicates that this is a reference value).
The generic procedure is repeated to create and initialize a second instance of Point, which is assigned to variable B.
The final step is to load the references to the two Point objects in the local variable into indexes 1 and 2 (using ALOad_1 and ALOad_2, respectively) and use Invokevirtual to call the area method, which calls the appropriate method based on the actual type to complete the distribution. For example, if variable A contains an SpecialPoint instance extending from the Point class, and that subclass overrides the area method, the overridden method will be called. In this case, there are no subclasses, so only the area method is available.
Notice that even though the area method accepts a single argument, there are references to two points at the top of the stack. The first (pointA, from variable A) is actually an instance of calling the method (called this in programming languages), which for the Area method is passed into the first local variable of the new stack frame. The other operand (pointB) is the argument to the area method. The other way you don’t need to have a complete understanding of the exact flow of each instruction and execution to understand what the program does based on the bytecode at hand. For example, in my case, I want to check that the code drives Javastream to read files and that the stream is closed properly. Now, using the following bytecode as an example, it is easy to determine if a stream is being used and most likely closed as part of a try-with-resources statement. public static void main(java.lang.String[]) throws java.lang.Exception; descriptor: ([Ljava/lang/String;)V flags: (0x0009) ACC_PUBLIC, ACC_STATIC Code: stack=2, locals=8, args_size=1 0: ldc #2 // class test/Test 2: ldc #3 // String input.txt 4: invokevirtual #4 // Method java/lang/Class.getResource:(Ljava/lang/String;)Ljava/net/URL; 7: invokevirtual #5 // Method java/net/URL.toURI:()Ljava/net/URI; 10: invokestatic #6 // Method java/nio/file/Paths.get:(Ljava/net/URI;)Ljava/nio/file/Path; 13: astore_1 14: new #7 // class java/lang/StringBuilder 17: dup 18: invokespecial #8 // Method java/lang/StringBuilder.””:()V 21: astore_2 22: aload_1 23: invokestatic #9 // Method java/nio/file/Files.lines:(Ljava/nio/file/Path;)Ljava/util/stream/Stream; 26: astore_3 27: aconst_null 28: astore 4 30: aload_3 31: aload_2 32: invokedynamic #10, 0 // InvokeDynamic #0:accept:(Ljava/lang/StringBuilder;)Ljava/util/function/Consumer; 37: invokeinterface #11, 2 // InterfaceMethod java/util/stream/Stream.forEach:(Ljava/util/function/Consumer;)V 42: aload_3 43: ifnull 131 46: aload 4 48: ifnull 72 51: aload_3 52: invokeinterface #12, 1 // InterfaceMethod java/util/stream/Stream.close:()V 57: goto 131 60: astore 5 62: aload 4 64: aload 5 66: invokevirtual #14 // Method java/lang/Throwable.addSuppressed:(Ljava/lang/Throwable;)V 69: goto 131 72: aload_3 73: invokeinterface #12, 1 // InterfaceMethod java/util/stream/Stream.close:()V 78: goto 131 81: astore 5 83: aload 5 85: astore 4 87: aload 5 89: athrow 90: astore 6 92: aload_3 93: ifnull 128 96: aload 4 98: ifnull 122 101: aload_3 102: invokeinterface #12, 1 // InterfaceMethod java/util/stream/Stream.close:()V 107: goto 128 110: astore 7 112: aload 4 114: aload 7 116: invokevirtual #14 // Method java/lang/Throwable.addSuppressed:(Ljava/lang/Throwable;)V 119: goto 128 122: aload_3 123: invokeinterface #12, 1 // InterfaceMethod java/util/stream/Stream.close:()V 128: aload 6 130: athrow 131: getstatic #15 // Field java/lang/System.out:Ljava/io/PrintStream; 134: aload_2 135: invokevirtual #16 // Method java/lang/StringBuilder.toString:()Ljava/lang/String; 138: invokevirtual #17 // Method java/io/PrintStream.println:(Ljava/lang/String;)V 141: You can see that Java /util/stream/ stream fires InvokeDynamic to reference Consumer before executing forEach. At the same time you will find a lot of bytecode calls to stream. close and Throwable. AddSuppressed, which is the basic code for the compiler to implement the try-with-resources statement. In summary, the bytecode instruction set is concise, there are few compiler optimizations when generating instructions, and decomcompiling class files allows you to examine code without source code, which is also a requirement if you don’t have source code!