ProtoBuf, as a cross-platform, language-independent and extensible method of serializing structured data, has been widely used in network data exchange and storage. With the development of the Internet, the heterogeneity of the system will become more prominent, cross-language needs will become more obvious, at the same time, gRPC is also a great potential to replace Restful, and ProtoBuf as G RPC cross-language, high-performance magic weapon, we technical people are necessary

Understanding ProtoBuf’s principle, laying the foundation for future technology update and selection.

I will be the past learning process and practical experience, summed up into a series of articles, with you to discuss learning, I hope you can gain, of course, there are not correct place also welcome to criticize.

This series of articles mainly includes:

  1. In-depth understanding of ProtoBuf principles and engineering practices (Overview)
  2. In-depth understanding of ProtoBuf principles and engineering practices (coding)
  3. In-depth understanding of ProtoBuf principles and engineering practices (serialization)
  4. In-depth understanding of ProtoBuf principles and engineering practices (Engineering practices)

What is a ProtoBuf

Protocol Buffers (ProtoBuf) is a cross-platform, language-independent, and extensible method of serializing structured data for network data exchange and storage.

ProtoBuf is a flexible, efficient, and automated mechanism for serializing structured data. Compared to XML and JSON, which describe the same information, ProtoBuf serializes data in a smaller amount, and serializes/deserializes faster and more easily.

Once you’ve defined the data structure for the data to be processed, you can use ProtoBuf’s code generation tools to generate the associated code. You can easily read and write your structured data in a variety of languages (Proto3 supports C++, Java, Python, Go, Ruby, Objective-C, C#) or from a variety of streams by describing the data structure once with Protobuf.

Why ProtoBuf

You might think That Google invented the ProtoBuf to solve serialization speed, but that’s not the real reason.

ProtoBuf was originally used by Google to solve the index server Request/Response protocol. Before ProtoBuf, Google already had a request/ Response format that handled the request/response codec manually. It also supports multiple versions of protocols, but the code is less elegant:

if (protocolVersion=1) {
    doSomething();
} else if (protocolVersion=2) { doOtherThing(); }...Copy the code

If the protocol is formatted explicitly, it will make the new protocol very complex. This is because the developer must ensure that all servers between the request originator and the actual server processing the request understand the new protocol before switching the switch to start using the new protocol.

This is a problem that every server developer has encountered with compatibility between old and new versions.

To solve these problems, ProtoBuf was born.

The ProtoBuf originally had the following two features:

  • It’s easier to introduce new fields, and an intermediate server that doesn’t need to check the data can simply parse and pass the data without knowing all the fields.
  • The data format is more self-descriptive and can be processed in a variety of languages (C++, Java, etc.).

This version of ProtoBuf still requires its own hand-parsing code.

But as the system slowly evolves, ProtoBuf has more features:

  • The automatically generated serialization and deserialization code avoids the need for manual parsing. (The official automatic code generation tool, each language platform has the basic).
  • In addition to being used for data exchange, ProtoBuf is used as a convenient self-describing format for persisting data.

ProtoBuf is now Google’s lingua franca for data exchange and storage. There are 48,162 different message types defined in the Google code tree, including 12,183.proto files. They are used both in RPC systems and to persist data in various storage systems.

ProtoBuf was originally designed to solve server-side compatibility issues between old and new protocols (and later and later versions). It was also thoughtfully named “Protocol Buffer”. But later it was developed to transmit data.

Protocol Buffers is named after:

Why the name “Protocol Buffers”?

The name originates from the early days of the format, before we had the protocol buffer compiler to generate classes for us. At the time, there was a class called ProtocolBuffer which actually acted as a buffer for an individual method. Users would add tag/value pairs to this buffer individually by calling methods like AddValue(tag, value). The raw bytes were stored in a buffer which could then be written out once the message had been constructed.

Since that time, the “buffers” part of the name has lost its meaning, but it is still the name we use. Today, people usually use the term “protocol message” to refer to a message in an abstract sense, “protocol buffer” to refer to a serialized copy of a message, and “protocol message object” to refer to an in-memory object representing the parsed message.

How to use ProtoBuf

3.1 ProtoBuf workflow

As you can see, for the serialization protocol, the user only needs to focus on the business object itself, the IDL definition, and the code for serialization and deserialization only needs to be generated by the tool.

3.2 ProtoBuf message definition

ProtoBuf messages are described in the IDL file (.proto). Here is the message descriptor customer.proto used in this example:

syntax = "proto3";

package domain;

option java_package = "com.protobuf.generated.domain";
option java_outer_classname = "CustomerProtos";

message Customers {
    repeated Customer customer = 1;
}

message Customer {
    int32 id = 1;
    string firstName = 2;
    string lastName = 3;

    enum EmailType {
        PRIVATE = 0;
        PROFESSIONAL = 1;
    }

    message EmailAddress {
        string email = 1;
        EmailType type = 2;
    }

    repeated EmailAddress email = 5;
}


Copy the code

Customers contains multiple Customers, which contains an ID field, a firstName field, a lastName field, and a collection of emails.

In addition to these definitions, there are three lines at the top of the file to help the code generator:

  1. First, syntax = “proto3” is used for the IDL syntax version. There are currently two versions, proto2 and proto3. The two versions are incompatible with each other. Since Proto3 supports more languages and has a more concise syntax than Proto2, this article uses Proto3.

  2. Second, there is a package domain; Definition. This configuration is used to nest generated classes/objects.

  3. There is an Option JAVA_package definition. The generator also uses this configuration to nest the generated sources. The difference here is that this applies only to Java. Two configurations are used to make the generator behave differently when creating code in Java and JavaScript. In other words, the Java class is in the package. Com. Protobuf generated. To create under the domain, and the the JavaScript object is created under the packet domain.

ProtoBuf provides more options and data types. This article will not cover them in detail.

3.3 Code generation

First install the ProtoBuf compiler protoc. Here is the detailed installation tutorial. Once installed, you can use the following command to generate the Java source code:

protoc --java_out=./src/main/java ./src/main/idl/customer.proto

Copy the code

Execute the command from the project’s root path and add two arguments: java_out, which defines./ SRC /main/ Java/as the output directory for Java code; The. / SRC/main/idl/customer. The proto is. Proto file directory.

The generated code is very complex, but fortunately its usage is very simple.

        CustomerProtos.Customer.EmailAddress email = CustomerProtos.Customer.EmailAddress.newBuilder()
                .setType(CustomerProtos.Customer.EmailType.PROFESSIONAL)
                .setEmail("[email protected]").build();

        CustomerProtos.Customer customer = CustomerProtos.Customer.newBuilder()
                .setId(1)
                .setFirstName("Lee")
                .setLastName("Richardson")
                .addEmail(email)
                .build();
        / / the serialization
        byte[] binaryInfo = customer.toByteArray();
        System.out.println(bytes_String16(binaryInfo));
        System.out.println(customer.toByteArray().length);
        // Deserialize
        CustomerProtos.Customer anotherCustomer = CustomerProtos.Customer.parseFrom(binaryInfo);
        System.out.println(anotherCustomer.toString());

Copy the code

3.4 Performance Data

We simply use Customers as the model to construct and select small objects, common objects, and large objects for performance comparison.

Serialization time and data size after serialization comparison

Deserialization time

Refer to the official Benchmark for more performance data

Four,

Now that we’ve explained what ProtoBuf is, the context in which it was created, and the basic use of it, let’s summarize.

Advantages:

1. High efficiency

ProtoBuf is encoded by T-(L) -v (tag-length-value), and does not require “, {,}, : to structure the information. ProtoBuf uses varint compression at the encoding level. Therefore, ProtoBuf serializes the same information in a much smaller volume and consumes less network traffic to transmit it over the network. Therefore, ProtoBuf is a good choice for scenarios with tight network resources and high performance requirements.

// Let's make a quick comparison
// To describe the following JSON data
{"id":1."firstName":"Chris"."lastName":"Richardson"."email": [{"type":"PROFESSIONAL"."email":"[email protected]"}]} # The size of the serialized data is 118byte 7b226964223a312c2266697273744e616d65223a224368726973222c226c6173744e616d65223a2252696368617264736f6e222c22656d61696c223a 5b7b2274797065223a2250524f46455353494f4e414c222c22656d61696c223a226372696368617264736f6e40656d61696c2e636f6d227d5d7d # Using ProtoBuf, the serialized data size is 48 bytes 0801120543687269731a0a52696368617264736f6e2a190a156372696368617264736f6e40656d61696c2e636f6d1001Copy the code

In terms of serialization/deserialization speed, ProtoBuf serializes/deserialization faster than XML and JSON, 20-100 times faster than XML.

2. Cross-platform and multi-language support

ProtoBuf is platform independent and can be used for barrier-free communication between Android and PC, C# and Java.

Proto3 supports C++, Java, Python, Go, Ruby, Objective-C, C#.

3. Good scalability and compatibility

The backward compatibility feature, which allows older versions of data structures to be updated and still be compatible, was the problem ProtoBuf was designed to solve when it was born. Because the compiler skips new fields that it does not recognize.

4. Easy to use

ProtoBuf provides a set of compilation tools that automatically generate serialized and deserialized boilerboard code so that developers can focus only on the idL of business data, simplifying the encoding and decoding work and the complexity of multi-language interaction.

Disadvantages:

Poor readability and lack of self-description

XML, JSON are self-describing, ProtoBuf is not.

ProtoBuf is a binary protocol. The encoded data is not readable. Without an IDL file, the binary data stream cannot be understood and is not debug friendly.

Charles already supports ProtoBuf, so import the data description file. For details, see Charles Protocol Buffers

In addition, because there is no IDL file to parse binary data streams, ProtoBuf protects data to a certain extent, raising the threshold for core data to be cracked and reducing the risk of core data being stolen.

Five, the reference

  1. wikipedia

  2. Serialization and deserialization

  3. The official Benchmark

  4. Charles Protocol Buffers

  5. choose-protocol-buffers

By Li Guanyun