In general, there are two types of data processing. One is in memory, like our usual structures, lists, arrays, etc. The other is to write data to a file or transfer it over the network. In this case, the data transfer is basically a bit stream. How does the receiver parse the received bit stream? At this point, it is necessary to serialize the data and convert the corresponding data into self-explanatory bitstream. The receiver can then deserialize the bitstream into the corresponding structure and so on.

Native formats for various languages

Many languages have built-in serialization methods, such as java.io.Serializable, Pickle in Python, and so on. They are convenient to use, but have certain limitations:

  1. If serialization is from a particular language, deserialization must also be in that language. This makes it difficult to communicate between different languages, such as clients and servers using different languages.
  2. Because it allows arbitrary classes to be instantiated during deserialization, it is easy to create vulnerabilities that open up the possibility of security attacks.
  3. These language-specific libraries generally have poor forward and backward compatibility.
  4. Performance is generally not very good, and their CPU utilization and compression ratio are generally not very good.

So the language’s native serialization and deserialization functions are generally not used, so what are the alternatives to the language’s native functions?

JSON, XML and CSV

Some of the most common language-independent serialization standards are JSON and XML. The former is popular because it’s a built-in browser support format, while the latter is sometimes considered too cumbersome and complex. And of course the CSV format is used by a lot of people. Each of these formats is actually human-readable, but each has its problems:

  1. The serialization of numbers is poor. XML vs. CSV It is almost impossible to distinguish numbers from numeric characters (unless special treatment is made). JSON is better, but it doesn’t distinguish integers from floating-point numbers.
  2. JSON and XML support Unicode strings but binary strings are not supported. There are ways to solve this problem, but there is a price to pay.
  3. CSV does not support schema, so it is up to the application to determine the contents of each row and column.

Despite these issues, JSON, XML, and CSV are all pretty good and relatively popular right now.

Binary coding

JSON and XML are fine, but their data is sometimes a bit redundant, which is not obvious at small scales, but becomes more so when the data is large. So on this basis, there are many binary coding technologies, such as JSON-based MeessagePack, BSON, BJSON, UBJSON and so on, as well as XML-based WBXML.

Let’s take a look at the following JSON document using MessagePack as an example.

  1. The first byte is 0x83, where the first 4bit 0x80 indicates that this is an object, and the following 4bit 0x03 indicates that there are three fields in this object.
  2. The second byte is 0xa8, where the first four bits 0xa0 indicate that this is a string, and the length is determined by the last four bits 0x08, which is the length of eight bytes.
  3. The following eight bytes are the ASCII encoding of userName.
  4. The following 0xa6 is similar to the previous 0xa8, but this time the length is 6.

After this encoding, the length is reduced from 81bytes to 66bytes, which has the benefit of size, but also sacrifices readability. Whether it is worth it is a matter of opinion. One of the big reasons why it’s worth hesitating here is that the size reduction isn’t so obvious. Here are some ways to make it more noticeable.

Thrift and Protocol Buffers

Thrift and Protocol Buffers are similar to the central idea above, but each has particular advantages. The Thrift

Is invented by Facebook and Protocol Buffers (Protobuf) was developed by Google.

Let’s look at Thrift first. It first needs to define a Schema as follows:

Thrift has two binary encoded formats, one is BinaryProtocol and the other is CompactProtocol. Let’s first look at how BinaryProtocol serializes the above example:

Here we can see that its first byte is a type, mainly used to indicate whether it is a string, an int, a list, etc. Instead of using key strings, such as userName, favoriteNumber, etc., there is a field tag. This tag is set to 1,2,3, and the number before the key string in the schema. So instead of having a specific key here, I’m going to reduce the total size. This format is only 59 bytes compressed.

Another encoding format is Thrift CompactProtocl. The same example results in the following, which has a much better compression ratio, with the same content at only 34 bytes. The first difference is that he put the field tag and type in a byte. Then the size of length varies depending on the previous type, and then the number 1337, instead of being stored in 8 bytes, is stored in 2 bytes, and the first bit of each byte indicates whether there are any more, so -64 to 63 is just one byte, -8192 to 8191 require only two bytes.

So after Thrift, let’s look at protobuf. First, its definition schema is similar:

Its binary encoding is as follows, which is similar to CompactProtocol, with a size of 33 bytes. It is important to note that in the schema above we set a required, which makes no difference in the code. The only difference is that at runtime we check for this required. This requires special attention in different versions of forward and backward compatibility, and can cause problems if you’re not careful.

Avro

Apache Avro is another binary encoded format that is slightly different from Protocol Buffers and Thrift. Avro’s schema is available in two languages, one of which is readable and the other is machine-friendly:

We can see that this schema does not have any tag values in it, and in fact there is no relevant information in the bitstream, as shown below, only length, not individual tags.

Without this information, how can the reading side of the schema be resolved? In fact, it requires that the reading side of the schema be the same as the writing side of the schema, so that there is no need to pass information. Another method is that the reader side knows the writer side’s schema, especially when transferring a large number of data with the same schema. We can write the reader side’s schema at the beginning, and then do not need to repeat the transfer. Or we can maintain a list of different versions of schemas so we can do simple queries.

conclusion

This article discusses various binary encoding methods for serialization, with emphasis on Thrift, Protobuf, and Avro.