Efficient data compression coding method Protobuf

What is protocol buffers?

Protocol Buffers is a language-neutral, platform-independent, extensible format for serializing data that can be used for communication protocols, data storage, and more.

Protocol buffers are flexible and efficient in serializing data. Protocol Buffers are smaller, faster, and simpler than XML. Once the data structure of the data to be processed has been defined, the code generation tool for Protocol Buffers can be used to generate the relevant code. Data structures can even be updated without redeploying the program. Using Protobuf to describe the data structure once, you can easily read and write your structured data in different languages or from different data streams.

Protocol buffers are a good format for data storage or RPC data exchange. A language-independent, platform-independent, extensible serialized structured data format for communication protocols, data storage, etc.

2. Why invented protocol buffers?

You might think that Google invented Protocol Buffers to address serialization speed, but that’s not the real reason.

Protocol buffers were first used by Google to address the request/ Response protocol for indexing servers. Before protocol buffers, Google already had a request/ Response format for manual marshalling and unmarshalling of request/response. It also supports multiple versions of the protocol, but the code is ugly:

 if (version == 3) {... }else if (version > 4) {
   if (version == 5) {... }... }Copy the code

Formatting protocols very explicitly can make new protocols very complicated. The developer must ensure that all servers between the request originator and the actual server processing the request understand the new protocol before switching on and off to start using the new protocol.

This is the same issue that every server developer has encountered with respect to low version compatibility, old protocol compatibility and new protocol compatibility.

Protocol Buffers were born to address these problems. Protocol buffers are pinned to two characteristics:

New fields can be easily introduced, and intermediate servers that don’t need to examine the data can simply parse and pass the data without knowing all the fields.
Data formats are more self-descriptive and can be processed in a variety of languages (C++, Java, etc.)

This version of Protocol Buffers still requires your own hand-parsed code.

However, as the system evolves, protocol Buffers now have more characteristics:

Automatically generated serialization and deserialization code eliminates the need for manual parsing. (Automatic code generation tools are officially available for all language platforms)
In addition to being used for RPC (remote procedure call) requests, Protocol buffers are beginning to be used as a convenient self-describing format for persistent storage of data (for example, in Bigtable).
The server’s RPC interfaces can be declared as part of the protocol, then base classes can be generated using the Protocol Compiler, and users can override them using the actual implementation of the server interface.

Protocol Buffers are now Google’s common language for data. At the time of this writing, 48,162 different message types are defined in the Google code tree, including 12,183.proto files. They are used both for RPC systems and for persistent storage of data in various storage systems.

Summary:

Protocol Buffers were born to address server-side compatibility issues with old and new protocols (high and low versions), and are affectionately named “protocol buffers”. But later it was developed to transmit data.

Protocol Buffers

Why the name “Protocol Buffers”?

The name originates from the early days of the format, before we had the protocol buffer compiler to generate classes for us. At the time, there was a class called ProtocolBuffer which actually acted as a buffer for an individual method. Users would add tag/value pairs to this buffer individually by calling methods like AddValue(tag, value). The raw bytes were stored in a buffer which could then be written out once the message had been constructed.

Since that time, the “buffers” part of the name has lost its meaning, but it is still the name we use. Today, people usually use the term “protocol message” to refer to a message in an abstract sense, “protocol buffer” to refer to a serialized copy of a message, and “protocol message object” to refer to an in-memory object representing the parsed message.

This name comes from the early days of Format, before we had protocol Buffer compilers generating classes for us. At the time, there was a class called ProtocolBuffer that actually acted as a buffer for a single method. The user can add tag/value pairs to this buffer by calling a method like AddValue(tag,value). The raw bytes are stored in a buffer and can be written out once the message is built.

Since then, the part called “buffer” has lost its meaning, but it’s still the name we use. Today, the term “Protocol message” is commonly used to refer to a message in an abstract sense, with a “Protocol buffer” referring to a serialized copy of the message and a “Protocol Message object” referring to a message that represents object parsing in memory.

Proto3 Defines message

The latest version of protocol buffers is proto3, which has some differences from the old version, Proto2. The two versions of the API are not fully compatible.

The names of Proto2 and Proto3 may seem a bit confusing, because when we first opened source protocol Buffers, it was actually the second Version of Google and was called Proto2, which is why our open source version number started with v2. The initial version is called Proto1 and has been in development at Google since early 2001.

In PROto, all structured data is called a message.

message helloworld 
{ 
   required int32     id = 1;  // ID 
   required string    str = 2;  // str 
   optional int32     opt = 3;  //optional field 
}
Copy the code

The above lines define a message helloWorld with three members, id of type INT32 and STR of type string. Opt is an optional member that may not be included in the message.

Here are a few points to note in Proto3.

syntax = "proto3";

message SearchRequest {
  string query = 1;
  int32 page_number = 2;
  int32 result_per_page = 3;
}
Copy the code

If syntax = “proto3” is not declared on the first line; By default, proto2 is used for parsing.

1. Assign field numbers

Each field in each message definition has a unique number. These field numbers are used to identify fields in the binary format of the message and should not be changed after the message type is used. Note that the field numbers in ranges 1 through 15 require one byte to encode, including the field number and field type (see the Protocol Buffer encoding principles section for details). Field numbers in ranges 16 through 2047 require two bytes. So you should keep the numbers 1 through 15 as very frequent message elements. Remember to leave some room for frequent elements that may be added in the future.

The minimum field number that can be specified is 1, and the maximum field number can be 2^29^-1 or 536,870,911. Nor can you use numbers 19000 to 19999 (FieldDescriptor :: kFirstReservedNumber to FieldDescriptor ::) KLastReservedNumber) because they are reserved for the Protocol Buffers implementation.

If you use one of these reserved numbers in.proto, Protocol Buffers will be compiled with an error.

Again, you cannot use any of the field numbers reserved for previous Protocol Buffers. What a reserved field is is explained in the next section.

2. Reserve the fields

If you update a message type by completely removing a field or commenting it out, future users can reuse the field number when making their own updates to the type. If an older version of.proto file is loaded later, it can cause serious problems on the server, such as data clutter, privacy errors, and so on. One way to ensure that this doesn’t happen is to specify the field number (or name) of the deleted field as reserved, which can also cause JSON serialization problems. If any future users try to use these field identifiers, the Protocol Buffers compiler will report an error.

message Foo {
  reserved 2, 15, 9 to 11;
  reserved "foo", "bar";
}
Copy the code

Note that you cannot mix field names and field numbers in the same Reserved statement. So if you want to write it like this.

3. Default field rules

Field names cannot be repeated and must be unique.
Repeated fields: Any number (including 0) can be repeated multiple times in a message, but the order of the repeated values is preserved.

In Proto3, the default encoding of repeated fields of pure numeric types is packed (see the Protocol Buffer Encoding Principle for specific reasons)

4. Correspondence between scalar types of each language

5. The enumeration

Enumerated types can be embedded in message.

message SearchRequest {
  string query = 1;
  int32 page_number = 2;
  int32 result_per_page = 3;
  enum Corpus {
    UNIVERSAL = 0;
    WEB = 1;
    IMAGES = 2;
    LOCAL = 3;
    NEWS = 4;
    PRODUCTS = 5;
    VIDEO = 6;
  }
  Corpus corpus = 4;
}
Copy the code

The important thing about enumeration types is that they must have a value of 0.

Enumerations of 0 are used as zero, and are zero when not assigned.
For compatibility with Proto2. In Proto2, zero must be the first value.

In addition, enumerations that are not recognized during deserialization will be retained in messaage. Because how messages are represented when deserialized is language-dependent. In languages that support open enumeration types for values outside the specified symbol range, such as C++ and Go, unknown enumeration values are simply stored as their underlying integer representation. In languages with closed enumerated types, such as Java, enumerated values are used to identify unrecognized values, and special accessors can access the underlying integers.

In other cases, if the message is serialized, unrecognized values are still serialized with the message.

5. Reserved values in enumeration

If you update an enumeration type by removing the enumeration entry entirely or commenting it out, future users can reuse values when making their own updates to the type. If an older version of.proto file is loaded later, it can cause serious problems on the server, such as data clutter, privacy errors, and so on. One way to ensure that this doesn’t happen is to specify the numeric value (or name) of the deleted entry as reserved, which can also cause JSON serialization problems. If any future users try to use these field identifiers, the Protocol Buffers compiler will report an error. You can use the Max keyword to specify that your range of reserved values rise to the maximum possible value.

enum Foo {
  reserved 2, 15, 9 to 11, 40 to max;
  reserved "FOO", "BAR";
}
Copy the code

Note that you cannot mix field names and field numbers in the same Reserved statement. So if you want to write it like this.

6. Allow nesting

Protocol Buffers define message to allow nesting into more complex messages.

message SearchResponse {
  repeated Result results = 1;
}

message Result {
  string url = 1;
  string title = 2;
  repeated string snippets = 3;
}
Copy the code

In the example above, Result is nested in SearchResponse.

More examples:

message SearchResponse {
  message Result {
    string url = 1;
    string title = 2;
    repeated string snippets = 3;
  }
  repeated Result results = 1;
}

message SomeOtherMessage {
  SearchResponse.Result result = 1;
}
Copy the code

message Outer { // Level 0 message MiddleAA { // Level 1 message Inner { // Level 2 int64 ival = 1; bool booly = 2; } } message MiddleBB { // Level 1 message Inner { // Level 2 int32 ival = 1; bool booly = 2; }}}Copy the code

7. Enumeration incompatibility

Proto2 message types can be imported and used in Proto3 messages, and vice versa. However, Proto2 enumerations cannot be used directly in proto3 syntax (although this is possible if proto2 messages imported use them).

8. Update message

If you need to add a field to the message definition later, the Protocol Buffer can take advantage of the previous code does not need to change. However, the following 10 rules need to be met:

Do not alter the data structure of the original field.
If you add new fields, any messages serialized by code using the “old” message format can still be parsed with the newly generated code. You should remember the default values for these elements so that the new code can correctly interact with the messages generated by the old code. Similarly, messages created by new code can be parsed by old code: old binaries simply ignore the new fields when they are parsed. (See “Unknown Fields” for specific reasons)
Fields can be removed as long as the field number is no longer used in the updated message type. You may need to rename the field, possibly with the prefix “OBSOLETE_”, or with a reserved field numberreservedFor the future.protoUsers will not accidentally reuse this number.
Int32, uint32, INT64, uint64 Compatible with bool. This means that you can change a field from one of these types to another without breaking forward or backward compatibility. If a number is parsed out of a line that does not fit the corresponding type, it will have the same effect as converting the number to that type in C++ (for example, if a 64-bit number is read as int32, it will be truncated to 32 bits).
Sint32 and sint64 are compatible with each other, but not with other integer types.
Strings and bytes are compatible as long as bytes are valid UTF-8.
Embedded message is compatible with bytes if bytes contain the encoded version of message.
Fixed32 is compatible with SFixed32, and Fixed64 is compatible with SFixed64.
Enum arrays are compatible with INT32, uint32, INT64, and uint64 (note that the value is truncated if they do not fit). Note, however, that client code may treat messages differently when they are deserialized: for example, unrecognized Proto3 enumeration types will remain in the message, but how the message is represented when deserialized is language-dependent. (This is language-specific, as mentioned above.) Int fields always retain only their values.
Changing a single value to a new member is safe and binary compatible. If you are sure that no code is setting multiple fields at once, it may be safe to move multiple fields to the new field. It is not safe to move any field into an existing field. (Note the difference between field and value, field is field, value is value)

9. Unknown fields

The unknown fields are protocol buffers serialized data and represent fields that are not recognized by the parser. For example, when an old binary parses the data of the new data sent by the new binary, these new fields become unknown fields in the old binary.

Proto3 implementations can successfully parse messages with unknown fields; however, the implementation may or may not support the retention of these unknown fields. You should not rely on saving or deleting unknown fields. For most Google Protocol Buffers implementations, unknown fields cannot be accessed by the corresponding Proto runtime in Proto3 and are discarded and forgotten during deserialization. This is a different behavior from Proto2, where unknown fields are always saved and serialized with the message.

10. The Map types

A repeated type can be used to represent arrays, and a Map type can be used to represent dictionaries.

map<key_type, value_type> map_field = N;

map<string, Project> projects = 3;
Copy the code

Key_type can be any int or string (any scalar type, see table for scalar types above, but float, double, and bytes are excluded)

Enumeration values also cannot be used as keys.

Key_type can be any type except map.

Special attention should be paid to:

Maps cannot be repeated.
The order of linear arrays and map iterations is indeterminate, so you can’t rely on your map being in a particular order.
for.protoWhen the text format is generated, the map is sorted by key. The key of a number is sorted by number.
When parsing or merging from an array, if there are duplicate keys, the last key seen is used (override principle). When parsing a map from a text format, parsing may fail if there are duplicate keys.

The Protocol Buffer does not support arrays of the map type, but it can be converted to implement maps arrays:

message MapFieldEntry {
  key_type key = 1;
  value_type value = 2;
}

repeated MapFieldEntry map_field = N;
Copy the code

The above notation is exactly equivalent to the map array, so the requirements of maps array can be skillfully realized by repeated forms.

11. JSON Mapping

Proto3 supports canonical encoding in JSON, making it easier to share data between systems. The encodings are described type by type in the following table.

If a value is missing or empty in the JSON-encoded data, it will be interpreted as the appropriate default value when it is parsed as a Protocol Buffer. If a field has a default value in the protocol buffer, it is omitted from the JSON-encoded data by default to save space. The implementation of concrete Mapping can provide options to send fields with default values in jSON-encoded output.

The JSON implementation of Proto3 provides the following four options:

Send fields with Default values: By default, fields with default values are ignored in proto3 JSON output. An implementation can provide an option to override this behavior and output fields with their default values.
Ignore unknown fields: By default, the Proto3 JSON parser should reject unknown fields, but may provide an option to ignore unknown fields in the parse.
Use proto field name instead of lowerCamelCase name: By default, a Proto3 JSON printer converts the field name to lowerCamelCase and uses it as the JSON name. The implementation might provide an option to use the original field name as the JSON name. The Proto3 JSON parser needs to accept the converted lowerCamelCase name and the original field name.
Send enumerations as enumerations instead of strings: The names of enumerations are used by default in JSON output. You can provide an option to use the value of an enumerated value.

Proto3 defines Services

If you want to use the message type of the RPC (remote procedure Call) system, you can define the RPC service interface in the.proto file, and the Protocol Buffer compiler generates the service interface code and STUBS in the language of your choice. So, for example, if you define an RPC service with SearchRequest as the entry and SearchResponse as the return value, you can define it in your.proto file as follows:

service SearchService {
  rpc Search (SearchRequest) returns (SearchResponse);
}
Copy the code

The most straightforward RPC system to use with Protocol Buffer is gRPC: a language – and platform-neutral open source RPC system developed at Google. GRPC works very well in Protocol Buffers and allows you to generate RPC-related code directly from.proto files by using special Protocol Buffer compilation plug-ins.

If you do not want to use gRPC, you can also use Protocol buffers in your own RPC implementation. You can find more information about these in the Proto2 language guide.

There are also ongoing third-party projects to develop RPC implementations for Protocol Buffers.

Protocol Buffer naming conventions

Message uses hump nomenclature. “Message” starts with a capital letter. Field names are separated by underscores.

message SongServerRequest {
  required string song_name = 1;
}
Copy the code

Enumeration types use hump nomenclature. Enumeration type starts with a capital letter. Each enumerated value is capitalized and named with an underscore separator.

enum Foo {
  FIRST_VALUE = 0;
  SECOND_VALUE = 1;
}
Copy the code

Each enumerated value is terminated with a semicolon, not a comma.

Both service and method names are humped. And they all start with a capital letter.

service FooService {
  rpc GetSomething(FooRequest) returns (FooResponse);
}
Copy the code

Protocol Buffer Encoding principle

Before we discuss the principle of Protocol Buffer encoding, we must first talk about Varints encoding.

Base 128 Varints encoding

Varint is a compact way to represent numbers. It uses one or more bytes to represent a number, and the smaller the number, the fewer bytes it uses. This reduces the number of bytes used to represent numbers.

Every byte in Varint except the last byte is set to the most significant bit (MSB), which indicates that more bytes are coming. The lower 7 bits of each byte are used to store the binary complement representation of the number as a 7-bit group, the least significant group first.

If less than 1 byte is used, the most significant bit is set to 0, as in the following example where 1 can be represented by a single byte, so MSB is 0.

0000 0001
Copy the code

If multiple byte representations are required, MSB should be set to 1. For example, 300, if represented by Varint:

1010 1100 0000 0010
Copy the code

If calculated in normal binary, this represents 88068(65536 + 16384 + 4096 + 2048 + 4).

So how is Varint encoded?

The following code is used to calculate Varint int 32.

char* EncodeVarint32(char* dst, uint32_t v) {
  // Operate on characters as unsigneds
  unsigned char* ptr = reinterpret_cast<unsigned char*>(dst);
  static const int B = 128;
  if (v < (1<<7)) {
    *(ptr++) = v;
  } else if (v < (1<<14)) {
    *(ptr++) = v | B;
    *(ptr++) = v>>7;
  } else if (v < (1<<21)) {
    *(ptr++) = v | B;
    *(ptr++) = (v>>7) | B;
    *(ptr++) = v>>14;
  } else if (v < (1<<28)) {
    *(ptr++) = v | B;
    *(ptr++) = (v>>7) | B;
    *(ptr++) = (v>>14) | B;
    *(ptr++) = v>>21;
  } else {
    *(ptr++) = v | B;
    *(ptr++) = (v>>7) | B;
    *(ptr++) = (v>>14) | B;
    *(ptr++) = (v>>21) | B;
    *(ptr++) = v>>28;
  }
  return reinterpret_cast<char*>(ptr);
}
Copy the code

300 = 100101100
Copy the code

Since 300 is over seven bits (Varint only has seven bits to represent a number, MSB is the highest bit to indicate whether there are more bytes to follow), 300 requires two bytes to represent it.

The encoding of Varint, for example 300:

1. 100101100 | 10000000 = 1 1010 1100
2. 110101100 >> 7 = 1010 1100
3. 100101100 >> 7 = 10 = 0000 0010
4. 1010 1100 0000 0010(Final Varint result)Copy the code

Varint’s decoding algorithm should look like this :(it is actually the reverse of the encoding process)

If there are multiple bytes, remove the MSB for each byte first (by logic or operation), leaving only 7 bits for each byte.
Reverse the entire result, at most 5 bytes, in order 1-2-3-4-5, and then 5-4-3-2-1. The order of the bits inside the bytes remains the same, but the relative positions of the bytes change.

The decoding procedure calls GetVarint32Ptr or, if it is larger than one byte, GetVarint32PtrFallback.

inline const char* GetVarint32Ptr(const char* p,
                                  const char* limit,
                                  uint32_t* value) {
  if (p < limit) {
    uint32_t result = *(reinterpret_cast<const unsigned char*>(p));
    if ((result & 128) = =0) {
      *value = result;
      return p + 1; }}return GetVarint32PtrFallback(p, limit, value);
}

const char* GetVarint32PtrFallback(const char* p,
                                   const char* limit,
                                   uint32_t* value) {
  uint32_t result = 0;
  for (uint32_t shift = 0; shift <= 28 && p < limit; shift += 7) {
    uint32_t byte = *(reinterpret_cast<const unsigned char*>(p));
    p++;
    if (byte & 128) {
      // More bytes are present
      result |= ((byte & 127) << shift);
    } else {
      result |= (byte << shift);
      *value = result;
      return reinterpret_cast<const char*>(p); }}return NULL;
}

Copy the code

By now, the Varint process should be familiar to readers. The algorithm for Varint 32 is listed above, the same as the 64-bit algorithm, except instead of writing code with 10 branches, it’s ugly. (32-bit is 5 bytes, 64-bit is 10 bytes)

64-bit Varint encoding implementation:

char* EncodeVarint64(char* dst, uint64_t v) {
  static const int B = 128;
  unsigned char* ptr = reinterpret_cast<unsigned char*>(dst);
  while (v >= B) {
    *(ptr++) = (v & (B- 1)) | B;
    v >>= 7;
  }
  *(ptr++) = static_cast<unsigned char>(v);
  return reinterpret_cast<char*>(ptr);
}
Copy the code

The principle is the same, but it’s solved by a cycle.

64-bit Varint decoding implementation:

const char* GetVarint64Ptr(const char* p, const char* limit, uint64_t* value) {
  uint64_t result = 0;
  for (uint32_t shift = 0; shift <= 63 && p < limit; shift += 7) {
    uint64_t byte = *(reinterpret_cast<const unsigned char*>(p));
    p++;
    if (byte & 128) {
      // More bytes are present
      result |= ((byte & 127) << shift);
    } else {
      result |= (byte << shift);
      *value = result;
      return reinterpret_cast<const char*>(p); }}return NULL;
}
Copy the code

Varint is a compact int. That 300 could have been 2 bytes, but now it’s 2 bytes, so why is it tighter, and it takes the same amount of space? !

Varint is really a compact way to represent numbers. It uses one or more bytes to represent a number, and the smaller the number, the fewer bytes it uses. This reduces the number of bytes used to represent numbers. For example, int32 numbers generally require 4 bytes. But with Varint, small int32 numbers can be represented by 1 byte. Of course, there is always a good and bad side to everything, using Varint notation, large numbers require 5 bytes. From a statistical point of view, not all numbers in a message are large, so in most cases, with Varint, numeric information can be represented in fewer bytes.

300 would have taken 4 bytes if it had been int32, now it would have taken 2 bytes if it had been Varint. Shrunk in half!

1. Message Structure encoding

A message in a Protocol buffer is a series of key-value pairs. The binary version of message simply uses the field number (field’s number and Wire_type) as the key. The name and declaration type of each field can only be determined on the decoding side by referring to the definition of the message type (i.e., the.proto file). This is why people often say that protocol buffer is safer than JSON and XML. If there is no. Proto file describing the data structure, the data cannot be interpreted as normal data.

Since the tag-value format is used, the option field, if present, will be in the message buffer. If not, it will not be in the buffer, which also reduces the size of the message.

When a message is encoded, the key and value are connected into a byte stream. When a message is decoded, the parser needs to be able to skip fields that it does not recognize. This way, new fields can be added to the message without breaking older programs that don’t know about them. This is known as “backward” compatibility.

For this reason, the “key” in each pair of linear format messages is actually two values, one of which is the field number from the.proto file, plus the length that provides just enough information to find the next value. In most language implementations, this key is called a tag.

Note that in the figure above, 3 and 4 are deprecated, so wire_type is currently 0, 1, 2, and 5.

The calculation method of the key is (field_number < < 3) | wire_type, in other words, the last three said the key is to wire_type.

For example, the field number for message starts with 1, so the corresponding tag might look something like this:

000 1000
Copy the code

The last three bits represent the type of value, 000, or 0, for varint. Move it 3 bits to the right, 0001, which represents the field number. Tag = varint; tag = varint;

96 01 = 1001 0110  0000 0001
       → 000 0001  ++  001 0110 (drop the msb and reverse the groups of 7The bits) -10010110
       → 128 + 16 + 4 + 2 = 150
Copy the code

It could be 96 and 01 would be 150.

message Test1 {
  required int32 a = 1;
}
Copy the code

If there is a message structure like the one above, the binary displayed in the Protocol Buffer should be 08, 96 and 01 if stored in 150.

The tag contains the field number and the wire_type, and the length of the tag determines which segment to extract the value from. (See the Protocol Buffer Strings section for details.)

2. Signed Integers

Wire_type = 0 contains unsigned varints, but what if it is an unsigned number?

A negative number is usually represented as a large integer because the computer defines the sign bit of a negative number as the highest digit. If Varint is used to represent a negative number, it must be 10 bytes long. For this purpose, The Google Protocol Buffer defines sint32, which is zigZag encoded. Map all integers to unsigned integers, and then use varint encoding. In this way, integers with small absolute values will also have a small varint encoding value.

The Zigzag mapping function is:

Zigzag(n) = (n << 1) ^ (n >> 31Zigzag(n) = (n <<1) ^ (n >> 63), n is sint64Copy the code

In this way, -1 will be encoded as 1, 1 as 2, and -2 as 3, as shown in the following table:

Note that the second part of the transformation (n >> 31) is an arithmetic transformation. So, in other words, the result of a shift is either an all zeros (if n is positive) or all 1s (if n is negative).

When sint32 or sint64 is parsed, its value is decoded back to the original signed version.

3. Non-varint Numbers

Non-varint numbers are simpler, and the wire_type of double and fixed64 is 1, which tells the parser that data of this type requires a 64-bit block. In the same way, float and fixed32 have wire_type of 5 and can be given 32-bit data blocks. In both cases, the high is last, the low is first.

That’s why the Protocol Buffer doesn’t compress to the limit, because floating-point types like float and double aren’t compressed.

4. The string

Wire_type data of type 2, which is a specified length encoding: Key + Length + content are encoded in the same way. Length adopts varints encoding and content is the Bytes specified by length.

For example, suppose the following message format is defined:

message Test2 {
  optional string b = 2;
}
Copy the code

Set the value to “testing” in binary format:

12 07 74 65 73 74 69 6e 67
Copy the code

74 65 73 74 69 6e 67 is the UTF8 code for “testing”.

Here, key is represented in hexadecimal, so the expansion is:

12 -> 0001 0010: Wire type = 2. 0001 0010:0000 0010: Tag = 2.

Length is 7, followed by 7 bytes, which is our string “testing”.

Therefore, data of wire_type type 2 will be converted to the t-L-V (tag-length-value) format by default during encoding.

5. Embedded Message

Suppose we define the following nested message:

message Test3 {
  optional Test1 c = 3;
}
Copy the code

Set the field to the integer 150 and the encoded bytes to:

1a 03 08 96 01
Copy the code

08, 96, and 01 represent 150, which I didn’t want to repeat here.

1a -> 0001 1010: Wire type = 2 0001 1010:0000 0011: tag = 3.

Length is 3, indicating that there are three bytes following it, that is, 08, 96, 01.

String, bytes, Embedded Messages, and Packed repeated Fields will be converted to T-L-V.

6. Optional and Repeated codes

A field defined as repeated in Proto2 (without [Packed =true] option) in which the encoded message has one or more key-value pairs containing the same tag number. These repeated values do not need to occur consecutively; They may appear spaced from other fields. Although they are unordered, they need to be ordered when parsing. In Proto3 repeated Fields are encoded with Packed by default (see the packed repeated Fields section for details)

For any non-repeating field in Proto3 or optional field in Proto2, the encoded message may or may not have a key-value pair containing the field number.

Typically, encoded Message has at most one instance of the required and optional fields. But the parser needs to deal with many-to-one cases. For numeric and string types, if the same value occurs more than once, the parser accepts the last value it receives. For embedded fields, the parser merges multiple instances of the same field that it receives. Just like the MergeFrom method, all singular fields are replaced by previous ones, all singular embedded messages are merged, and all repeated fields are concatenated. The result of this rule is that parsing two concatenated encoded messages gives the same result as parsing two messages separately and then merging. Such as:

MyMessage message;
message.ParseFromString(str1 + str2);
Copy the code

Is equivalent to

MyMessage message, message2;
message.ParseFromString(str1);
message2.ParseFromString(str2);
message.MergeFrom(message2);
Copy the code

This method is sometimes very useful. For example, you can merge messages without knowing their type.

7. Packed Repeated Fields

After version 2.1.0, protocol buffers were introduced, which are the same as repeated fields except that [Packed =true] is declared at the end. Repeated fields are similar but different. In Proto3 Repeated fields are handled this way by default. For a Packed repeated field, if there is no assignment in message, it will not appear in the encoded data. Otherwise, all elements of the field are packaged into a single key-value pair, whose wire_type=2 and length are determined. Each element is encoded normally, except that it is preceded by no tag tag. For example, there are the following message types:

message Test4 {
  repeated int32 d = 4 [packed=true];
}
Copy the code

Select * from a repeated field (d) where (3,270) = 86942;

22 // tag 0010 0010(field number 010 0 = 4, wire type 010 = 2)

06 // Payload size (length = 6 bytes)
 
03 // first element (varint 3)
 
8E 02 // second element (varint 270)
 
9E A7 05 // third element (varint 86942)
Copy the code

Tag-length-value-value-value…… right

Only duplicate fields of the original numeric type (using VARint, 32-bit or 64-bit) can be declared as “Packed”.

One thing to note is that for a Packed repeated field, although there is usually no reason to encode it as multiple key-value pairs, the encoder must be prepared to receive multiple key-pair pairs. In this case, the payload must be in series, and each pair must contain complete elements.

The Protocol Buffer parser must be able to parse fields recompiled as packed as if they were not, and vice versa. This allows [Packed = True] to be added to an existing field in both a forward and reverse compatible manner.

8. Field Order

Encoding/decoding is independent of field order, which is guaranteed by the key-value mechanism.

If a message has unknown fields, current Java and C++ implementations write the known fields in any order after they are sorted in order. Current Python implementations do not track unknown fields.

Advantages and disadvantages of Protocol Buffers

Protocol Buffers have several advantages over XML in serialization:

More simple
Data size is 3 to 10 times smaller
Faster deserialization, 20-100 times faster
You can automate the generation of data access classes that are easier to code for use

Here’s an example:

If you want to encode a user’s name and email information, use XML as follows:

  <person>
    <name>John Doe</name>
    <email>jdoe@example.com</email>
  </person>

Copy the code

For the same requirement, if protocol buffers are used, the definition file is as follows:

# Textual representation of a protocol buffer.
# This is *not* the binary format used on the wire.
person {
  name: "John Doe"
  email: "jdoe@example.com"
}
Copy the code

Protocol buffers are encoded and transmitted in binary format, requiring up to 28 bytes of space and 100-200 ns of deserialization time. XML, on the other hand, requires at least 69 bytes of space (after compression, all whitespace is removed) and a deserialization time of 5000-10000.

That’s the performance advantage. Then there are the coding advantages.

Protocol Buffers comes with a code generation tool that generates a friendly data access storage interface. This makes it easier for developers to use it to code. For example, if you use C++ to read the user’s name and email, you can directly call the corresponding get method (all attribute get and set method code is automatically generated, just need to call).

  cout << "Name: " << person.name() << endl;
  cout << "E-mail: " << person.email() << endl;
Copy the code

XML is a bit more cumbersome to read:

  cout << "Name: "
       << person.getElementsByTagName("name")->item(0)->innerText()
       << endl;
  cout << "E-mail: "
       << person.getElementsByTagName("email")->item(0)->innerText()
       << endl;
Copy the code

Protobuf is semantically cleaner and doesn’t require anything like an XML parser (because the Protobuf compiler compilers.proto files to generate corresponding data access classes to serialize and deserialize Protobuf data).

With Protobuf there is no need to learn a complex document object model, Protobuf’s programming model is friendly, easy to learn, and it has good documentation and examples, making Protobuf more attractive than other technologies for people who like simple things.

A final nice feature of Protocol Buffers is that they are “backward” compatible, allowing people to upgrade data structures without breaking already deployed programs that rely on “older” data formats. This way, your program doesn’t have to worry about massive code refactoring or migration due to message structure changes. Adding a field in a new message does not cause any changes to the published program (because the storage is inherently unordered, k-V).

Of course protocol buffers are not perfect and have some limitations in their use.

Because text is not suitable for describing data structures, Protobuf is also not suitable for modeling text-based markup documents such as HTML. In addition, because XML is somewhat self-explanatory, it can be read and edited directly, whereas Protobuf is not, it’s stored in binary, and you can’t read anything directly with a Protobuf unless you have a.proto definition.

Eight. Finally

After reading this Protocol Buffer coding principle, you should understand the following points:

The Protocol Buffer uses varint to compress data, and binary data is very compact. Option is also a measure to compress the volume. Therefore, PB has a smaller volume. If it is used for network data transmission, it is bound to consume less network traffic with the same data. But it is not compressed to the limit. Float, double float are not compressed.
The Protocol Buffer has fewer {,}, and: symbols than JSON and XML, and is smaller in size. Plus varint compression, gzip compression is even smaller!
The Protocol Buffer implements tag-value (tag-length-value) encoding, reducing the use of delimiters and making data storage more compact.
Another core value of the Protocol Buffer is that it provides a set of tools, a compilation tool that automates the generation of GET /set code. It simplifies the complexity of multi-language interaction and makes coding and decoding work productive.
The Protocol Buffer is not self-describing and leaves the data description.protoFiles, you can’t understand binary data streams. This is the advantage, so that the data has a certain “encryption”, is also a disadvantage, data readability is very poor. So Protocol buffers are great for MAKING RPC calls and passing data between internal services.
The Protocol Buffer is backward compatible. After updating the data structure, the old version can still be compatible. This is also the problem that the Protocol Buffer was designed to solve at the beginning. The compiler skips new fields if it doesn’t know anything else.

This is the end of the Protocol Buffer coding principle part, and the reason for the fast packet performance of the Protocol Buffer antisequence is discussed in the next part.

Reference:

Thrift – protobuf-compare-benchmarking. Wiki JVin-serializers

Making Repo: Halfrost – Field

Follow: halfrost dead simple

Source: halfrost.com/protobuf_en…