JAVA object serialization is the process of writing everything described by a JAVA object to a binary file in the form of file IO. There are two main streams involved in serialization, ObjectInputStream and ObjectOutputStream.

Most people’s knowledge of serialization is limited to calls to readObject and writeObject, but they don’t know why JAVA can “restore” an entire JAVA object from a binary file. Nor is it known exactly how an object is stored in a binary file.

This article takes you through binaries and serialization protocol rules to see what JAVA objects look like in files. It can be tedious, but it will definitely improve your understanding of serialization.

An ancient method of serialization

In previous articles on byte streams, we briefly mentioned DataInput/OutputStream as a decorator stream, which allows us to write and read files with basic data types as input.

Here’s an example:

Define a People type:

A slightly more complicated main function:

As you can see, this ancient method of serialization is actually using DataInput/OutputStream to write the values of the fields in the object to the file one by one, which is called “serialization operation”.

Object recovery must also be in accordance with the order of writing a field a field read, this way can be said to be very anti-human, if a class has a hundred fields, do not have to manually write a hundred times.

This is not exactly an implementation of serialization, it’s a pseudo-serialization, just so you know.

JAVA standard serialization

One reason to serialize an object to a disk directory is that some objects may be important but take up a lot of space and will not be needed for a while, so it is wasteful to put them in memory, and discarding them will result in additional operations to create them.

So, a compromise solution is to serialize these objects to a file and then read them from disk for use, which is called serialization.

To serialize an object, JAVA requires that the class inherit from the “java.io.Serializable” interface.Serializable has no methods defined in it, but is a “markup interface”.

When a virtual machine executes a serialization command, it checks whether the object to be serialized inherits the Serializable interface. If it does not, it rejects the serialization command and throws an exception.

java.io.NotSerializableException

The general use of serialization is as follows:

Output result:

single
23
Copy the code

ObjectOutputStream is also a decorator stream in a sense that all byte stream operations inside depend on the OutputStream instance we passed in when we constructed the instance.

The implementation of this class is very complicated, the inner class has a lot of definitions, and it also encapsulates our DataOutputStream, so DataOutputStream methods for writing basic data types are also here. In addition, it provides a writeObject method that DataOutputStream does not provide for writing a Java object that inherits the Serializable interface directly to disk.

Of course, ObjectInputStream is the opposite and is used to read and restore a Java object from disk.

The writeObject method takes an Object argument and serializes the Java Object it represents into a disk file. Instead of simply writing the field values to a file, it has a reference format, just as our compiler generates bytecode files in a certain format.

Following the same rules will make it easier to recover, so let’s look at the details of this rule.

Storage rules for serialization

In the previous section we serialized an instance of People into a file. Now we open the binary.

Serialized objects need to be stored in so many binary bits that they comply with JAVA’s serialization rules, which specify what each byte is used to store. Let’s take a look.

1, magic: this is almost all binary header, used to identify the current binary file type, our object serialization file magic is AC ED, two bytes.

Serialization protocol version number: This specifies which serialization rule JAVA uses to generate binaries, in this case 00 05, and possibly other protocols, usually protocol 5.

3. One byte: The next byte describes the current object type, 0x73 indicating that this is a normal Java object. Other optional values:

Note that string and array types are not classified into ordinary Java objects and have different numeric flags. Our People here is a normal Java object, so this is 0x73.

4. A byte: This byte specifies the data type of the current object, which is either a class or a reference, as distinct from Java reference Pointers. If you serialize the same object twice, Java does not repeat writing to the file, which is saved as a reference type, more on that later. People here is a class, so the value here is 0x72.

These two bytes describe the fully qualified name length of the current object, so the next 23 bytes are the fully qualified name of the current object. After conversion, these 23 bytes express the value: TestSerializable.People.

Then look at:

6. Serial Number Version: The next eight bytes, 3A -> B5, describe the serialized version number of the current class object. Since this value is not explicitly specified in the People class we define, the compiler will generate a serialVersionUID of eight bytes in some algorithm based on the information about the People class.

Serialization type: a byte that indicates the serialization type of the current object. 0x02 indicates that the current object is serializable.

Number of fields: two bytes, indicating the number of fields in the current object to be serialized, in this case, 0x0002, corresponding to our name and age fields.

The following is a description of the field:

9, field type: one byte, 0x4C corresponds to an ASCII value L, that is, the current field type is a common class type.

10, field name length: two bytes, 0x0003 indicates that the next three bytes represent the full name of the current field, 0x616765 corresponds to the character age.

11, field type name: three bytes, 0x740013, where 0x74 is the start of a field type. That is, the first byte of each field type name is 0x74, and the last two bytes describe the length of the field type name. 0x0013 corresponds to 19. So the next 19 bytes represent the full type name of the current field. Ljava/lang/Integer; .

The next step is to describe our second field name. The specific process is similar and will not be described here. We will continue to introduce the name field after it.

12. End of field description: one byte, fixed value 0x78 indicates the end of all field type information description.

Description of the parent class type: a byte, 0x70 represents null, that is, there is no parent class, not Object class.

This is how Java serializes an Integer object, and then reaches 0x7872, which means the Integer class has a parent class, and then serializes an instance of the parent class Number. Why? Well, I think you know that every subclass object is created for every superclass object.

So, until

The last 0x7870 indicates that all object information has been serialized. The following is the data portion of each field.

The first four bytes, 0x00000017, are the values of our first field age, which is 23. 0x74 indicates that the second field is of type String, the value is 0x0006 in length, and the last six bytes are exactly the String single.

So far, the format of the serialization file has been completely introduced. In summary:

The whole serialization file is divided into two parts, field type description and field data section. It is important to understand that if a field is of a normal JAVA type, it will continue to serialize its parent object. In this example, we serialized three objects: People, Integer, and Number. If any of their fields were externally assigned, These values also store this sort.

Some advanced ideas about serialization

Serialization of circular references

Consider two classes:

The definitions of these two classes are almost identical, with a People field defined internally.

Let’s say that ClassA and ClassB objects share the same instance of People, so the question is, if I serialize these two objects, will this common People object be serialized twice?

Let’s open the binary, which is a bit more complicated this time:

I circled 0x7870, which marks the end of serialization of an object type information. We’ll look at this briefly without going into details.

The first part is actually serializing the ClassA type. It indicates that the ClassA type has only one field, and that the field is an object type, recording information such as the type name of the field.

The second part serializes the People type, including the name field, and stores the externally assigned value of the name field, string: single.

The third part, serialize ClassB types. ClassB types are less serialized than ClassA, even though they have the same internal definition.

The shaded part is the fully qualified name of the ClassB class, and the red box is the version serial number of the class, which is generated automatically by the compiler because we don’t specify it explicitly. It then indicates that there is a field of object type with a six-byte name.

0x71 indicates that this field is a reference. By convention, this section should describe the type name of the field, but since the type has already been serialized, the reference points directly to the previously serialized People type.

The last section, by convention, describes the field data, the type of the data, the length of the value, and the value itself. But since our ClassB people value shares the same ClassA people value, the virtual machine won’t be so stupid as to reserialize the people object, but instead give the reference number of the people object above.

If you serialize multiple objects that have the same class type, Java will only describe that type once, and if there are multiple serializations of the same object in a serialization file, Java will only store one copy of the object’s data, and all the rest will point to it with references.

Custom serialization

For all classes that inherit the Serializable interface, the virtual machine serializes all the fields in the class, regardless of the access modifier, but sometimes it is not necessary to serialize all the fields and only selectively serialize some of them.

We just need to use the TRANSIENT keyword in front of the fields we don’t want serialized.

private transient String name;
Copy the code

Even if you assign a value to the name field of your object, it will not be saved to the file. When you deserialize the object, the name field will still be null by default.

In addition, JAVA allows us to override writeObject or readObject to implement our own serialization logic.

But the declarations of the two methods must be fixed.

private void writeObject(java.io.ObjectOutputStream s) 

private void readObject(java.io.ObjectInputStream s) 
Copy the code

When you serialize an object via ObjectOutputStream’s writeObject method, the virtual machine automatically checks if the object’s corresponding class has an implementation of either of these methods. If so, it will instead call our own method in the class. Discard the corresponding methods implemented by the JDK.

Let’s look at an example:

Name is modified by keyword TRANSIENT, that is, the default serialization mechanism does not serialize the field, and we overwrite writeObject and readObject to write and read the name field, respectively, after calling the default serialization method.

Output result:

single
20
Copy the code

If you are interested, look at the serialized binary file for yourself. There is no description of the name field, but the entire people object description is followed by our character “single”.

The deserialization process is similar. First, a people object is generated according to the DEFAULT deserialization mechanism of the JDK, and then the string at the end of the file is read and assigned to the current People object.

Serialized version problem

Serialized version ID, we’ve been talking about it for a long time, but we’ve never said what this version ID actually does. Good use can be used to implement permission management mechanism, bad use can cause you to deserialize failure.

JAVA recommends that every class that extends the Serializable interface should define a serialized version field.

private static final long serialVersionUID = xxxxL;
Copy the code

This value can be understood as a unique identifier for the current type. Each object is written to the version number of the external type when serialized. Deserialization first checks whether the version number in the binary file is the same as that in the target type, and if it is different, deserialization is rejected.

This value is not a must, if you don’t, then the compiler will according to the basic information of the current class in a certain algorithm to generate a unique serial number, but if you have taken place in the class a little bit of change, the value is changed, has been serialized good files will not be deserialized, because you don’t know what this value.

So, JAVA recommends that you define a version number for yourself so that you can control whether or not a serialized object is deserialized.

So far, we’ve taken a brief look at serialization. Much of it is described in conjunction with binary files, which can be tedious, but will hopefully improve your understanding of JAVA object serialization. Have what question, can leave a message to discuss communication together!


All the code, images and files in this article are stored in the cloud on my GitHub:

(https://github.com/SingleYam/overview_java)

Welcome to follow the wechat official account: OneJavaCoder. All articles will be synchronized on the official account.