preface

I have been busy with the open source work of mobile log SDK Trojan, which has been running stfully in ele. me team App, integrating log encryption and decryption functions. Trojan is actually a useful plaster, even an indispensable medicine, that helps us keep track of our online users and solve difficult problems.

Without further ado, enter today’s main topic, Protobuf, you may be very strange to this, have not touched, but it does not matter, after reading this blog, I believe you must have some feelings. At first, in order to save traffic, in our clary backend interface first use Protobuf instead of Json, support Java, C++, Python and other languages, taste the sweet, easy to use but also save memory flow, based on this feature, hero is not without users. Later, we extend to Sqlite and SharedPerference, and use Protobuf to replace Json or XML storage methods.

Protobuf

So what exactly is a Protobuf?

Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data — THINK XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the “old” format.

My English level is limited, here is a simple translation, the general idea is:

Protobuf is a flexible, efficient, serializable data protocol that is faster, simpler, and lighter than XML. Support a variety of languages, as long as the definition of a good data structure, using the Protobuf framework to generate source code, it can easily achieve data structure serialization and deserialization. As requirements change, the data structure can be updated without affecting the deployed program.

From the above we can conclude that Protobuf has the following advantages:

  1. Code generation mechanism
syntax = "proto3";
package me.ele.demo.protobuf;
option java_outer_classname = "LoginInfo";
message Login {
    string account = 1;
    string password = 2;
}
Copy the code

Protobuf provides Gradle Plugin that automatically generates LoginInfo classes in me.ele.demo.protobuf and has apis for serialization and deserialization.

  1. High efficiency

Data from the Clairvoyant project are more convincing.

Serialization time efficiency comparison:

The data format Article 1000 the data Article 5000 the data
Protobuf 195ms 647ms
Json 515ms 2293ms

Serialization space efficiency comparison:

The data format Article 5000 the data
Protobuf 22MB
Json 29MB

As you can see from the data above, Protobuf serialization is much more efficient in both time and space compared to Json. Deserialized data comparisons are not shown for space reasons.

  1. Supports backward compatibility and forward compatibility

When the client and the server use the same protocol, adding a byte to the protocol does not affect the client

  1. Support for multiple programming languages

The official source code released by Google includes c++, Java, Python three languages

As for the disadvantages, Protobuf is encoded in binary format, which directly leads to poor readability; Lack of self-description, Protobuf is protocol content in binary format, it doesn’t look anything at all without the proto structure.

Access to the

At the root of the project gradle is configured as follows

dependencies {
        classpath 'com. Google. Protobuf: protobuf - gradle - plugin: 0.8.0'
}
Copy the code

Gradle configuration is as follows:

apply plugin: 'com.google.protobuf'

android {
    sourceSets {
        main {
            // Define the proto file directory
            proto {
                srcDir 'src/main/proto'
                include '**/*.proto'
            }
        }
    }
}

dependencies {
    // Define protobuf dependencies, using the lite version
    compile "Com. Google. Protobuf: protobuf - lite: 3.0.0"
    compile ('com. Squareup. Retrofit2: converter - protobuf: 2.2.0') {
        exclude group: 'com.google.protobuf'.module: 'protobuf-java'
    }
}

protobuf {
    protoc {
        artifact = 'com. Google. Protobuf: protoc: 3.0.0'
    }
    plugins {
        javalite {
            artifact = 'com. Google. Protobuf: protoc - gen - javalite: 3.0.0'
        }
    }
    generateProtoTasks {
        all().each { task ->
            task.plugins {
                javalite {}
            }
        }
    }
}
Copy the code

Apply plugin: ‘com.google.protobuf’ is a Gradle plugin for Protobuf, which helps us to automatically generate source code at compile time through semantic analysis, and provides interfaces for initialization, serialization and anti-sequence of data structures.

The compile “com. Google. Protobuf: protobuf – lite: 3.0.0” is supported by protobuf library version, on the basis of the original, replace with the public set and get methods, reduce the protobuf method number of the generated code.

Defining data structures

Let’s start with the example above:

syntax = "proto3";
package me.ele.demo.protobuf;
option java_outer_classname = "LoginInfo";
message Login {
    string account = 1;
    string password = 2;
}
Copy the code

Here we define a LoginInfo, and we simply define the account and Password fields. Note here that in the above example, syntax = “proto3”; For proto2 and Proto3, there are some differences in defining data structures. Option JAVA_outer_className = “LoginInfo”; Define class name for Protobuf auto-generated classes, package me.ele.demo.protobuf; Defines the package name of the Protobuf auto-generated class.

With Android Studio Clean, the Protobuf plugin helps us automatically generate LoginInfo classes with the following structure:

Protobuf helps us generate LoginOrBuilder interface automatically, mainly declare each field set and get method; The core logic in this class is to serialize to CodedOutputStream via writeTo(CodedOutputStream) and deserialize from InputStream via ParseFrom(InputStream). The class diagram is as follows:

The principle of analysis

As mentioned above, how does a Protobuf do this regardless of whether it is more efficient in time and space?

After Protobuf serialization, the message becomes a binary data stream, which is written to the binary data stream by key-value composition, as shown in the figure below:

Key is defined as follows:

(field_number << 3) | wire_type
Copy the code

For example, the field account definition:

string account = 1;
Copy the code

In serialization, the field account is not written to the binary stream, but field_number=1 is computed to the binary stream using the above definition of Key. This is why Protobuf is not readable and is a major reason for its efficiency.

The data type

Int32, uint32, sint32, fixed32, and the corresponding 64-bit versions are all represented by int(long) in Java. Protobuf uses ZigZag coding internally to handle redundant symbols, but there is no validation logic in the compiled code, such as uint fields that cannot be passed negative numbers. In terms of encoding efficiency, fixed32 is more efficient than INT32 if the field value is greater than 2^28. The efficiency of SINT32 is higher than int32 in negative encoding. Uint32 Indicates that the field value is always a positive integer.

Coding principle

Protobuf uses CodedOutputStream for serialization and CodedInputStream for deserialization. These include write/read primitives and Message methods. The write method contains both fieldNumber and value parameters, Write a WireType tag to the fieldNumber (fieldNumber, fieldNumber) and WireType (WireType, fieldNumber, fieldNumber, fieldNumber). The tag value is a variable length int, where the most significant bit of a byte (MSB, most significant bit) is 1 to indicate that the next byte belongs to the current field, and the most significant bit is 0 to indicate that the current field encoding ends. After the tag value is written, the field value value is written. Different encoding methods are adopted for different field types:

  1. For int32/ INT64, if the value is greater than or equal to 0, the variable length encoding is used directly, otherwise, the 64-bit variable length encoding is used, so the encoding result is always 10 bytes, so the int32/ INT64 type is very inefficient in encoding negative numbers.

  2. The value is uint32/uint64, which is also expressed in variable length and does not authenticate negative numbers.

  3. For the SINt32 / SINT64 type, the value is ZigZag encoded first to preserve it, and then variable-length encoded. The so-called ZigZag coding is to convert negative numbers into positive numbers, and all positive numbers are multiplied by 2, such as 0 coding into 0, -1 coding into 1, 1 coding into 2, -2 coding into 3, and so on, so it still maintains relatively high efficiency for negative numbers coding.

  4. For fixed32 / sfixed32 / fixed64 / sfixed64 types, the value directly with small end model of fixed length coding.

  5. For double, double is converted to long and then written in 8-byte fixed-length small-endian mode.

  6. For float, the float type is converted to int and then written in 4-byte fixed-length small-endian mode.

  7. For bool, write a byte of 0 or 1.

  8. For String, the byte array is obtained using UTF-8 encoding, and variable-length encoding is first used to write the length of the byte array, and then all of the byte arrays.

  9. For bytes (ByteString), variable-length encoding is used to write the length first, and then the entire byte array.

  10. For enumeration types (type values WIRETYPE_VARINT), the INT32 encoding is used to write the values given when the enumeration item is defined (thus negative numbers are not recommended when assigning values to enumeration items because int32 encoding is inefficient for negative numbers).

  11. The inline Message type (type value WIRETYPE_LENGTH_DELIMITED) is written to the entire serialized Message length in bytes and then to the entire Message.

ZigZag encoding implementation: (n < < 1) ^ (n > > 31)/(n < < 1) ^ (n > > 63); There are also some compute static methods in the CodedOutputStream for calculating the number of bytes a field might occupy, which I won’t detail here.

In Protobuf serialization, all types are eventually converted to a variable-length int/long, a fixed-length int/long, a byte type, and a byte array. A write to byte is simply an assignment to the internal buffer:

public void writeRawByte(final byte value) throws IOException {
  if (position == limit) {
    refreshBuffer();
  }
  buffer[position++] = value;
}
Copy the code

The 32-bit variable-length integer is implemented as:

public void writeRawVarint32(int value) throws IOException {
  while (true) {
    if ((value & ~0x7F) = =0) {
      writeRawByte(value);
      return;
    } else {
      writeRawByte((value & 0x7F) | 0x80);
      value >>>= 7; }}}Copy the code

For fixed length, Protobuf uses small endian mode, such as 32-bit fixed length shaping implementation:

public void writeRawLittleEndian32(final int value) throws IOException {
    writeRawByte((value      ) & 0xFF);
    writeRawByte((value >>  8) & 0xFF);
    writeRawByte((value >> 16) & 0xFF);
    writeRawByte((value >> 24) & 0xFF);
}
Copy the code

For byte arrays, you can simply call the writeRawByte() method in turn, but CodedOutputStream is implemented with some performance optimization. I won’t go into details here. CodedInputStream is decoded according to the encoding mode of CodedOutputStream, so it is not detailed. ZigZag decoding is as follows:

(n >>> 1) ^ -(n & 1)
Copy the code

Repeated field encoding

A repeated field can be encoded in either of two ways:

  1. Each item is written to a tag and then to concrete data.

  2. First write the tag, the count, then write to count items, each item contains length | data data.

From the point of view of coding efficiency, personal feeling is more effective in the first 2, but don’t know for what reasons, Protobuf adopted the first way to encode, people can think of one reason is the first case, every news item is relatively independent, thus each during transmission the receiving end receives a message can be resolved, Instead of waiting for the entire repeated field message package. For primitive types, Protobuf also uses the first encoding method, which is later found to be inefficient and can be converted to a third encoding method with [Packed = True] descriptions (variations of the second encoding method are more efficient for primitive data types than the second)

  1. The tag is written first, then the total number of bytes for the field, and then the data for each item.

Currently Protobuf only supports the primitive type of Packed, so if you add Packed to a non-repeated field or a non-primitive repeated field, the compiler will report an error when compiling a Proto file.

The end of the

The above is Protobuf detailed introduction, based on the source code analysis is not developed here, please give more advice! Finally, thank you very much for your interest in this blog!

reference

https://developers.google.com/protocol-buffers/docs/overview http://www.blogjava.net/DLevin/archive/2015/04/01/424011.html