What is the Protobuf

Protobuf, short for Protocol Buffers, is a data description language developed by Google to describe a lightweight and efficient structured data storage format and opened to the public in 2008. Protobuf can be used for structured data serialization, or serialization. Its design is very suitable for data carrier in network communication, and is very suitable for data storage or RPC data exchange format. It has a small amount of serialized data and stores data in the way of K-V, which makes it highly compatible with the version of messages. A language independent, platform independent, extensible serialized structured data format that can be used in communication protocol, data storage and other fields. Protobuf comes with tools that allow developers to generate code and implement the ability to serialize structured data.

The most basic unit of data in Protobuf is the message, which is similar to the structure in Go. Members of message or other underlying data types can be nested in message.

This tutorial will describe how to construct your protocol Buffer data using the Protocol Buffer language, including syntax for.proto files and how to generate data access classes from.proto files. The proto3 version of the Protocol Buffer language is used in this tutorial.

To define the Message

Let’s start with a simple example. Let’s say you define a message for a search request. Each search request will contain a search string, the page of results returned, and the size of the result set. The definition in the.proto file is as follows:

syntax = "proto3";

message SearchRequest {
  string query = 1;
  int32 page_number = 2;
  int32 result_per_page = 3;
}
Copy the code
  • .protoThe first line of the file specifies the use ofproto3Syntax. If protocol Buffer is omitted, the compiler uses it by defaultproto2Syntax. It must be the first non-empty, non-comment line in the file.
  • SearchRequestThe definition specifies three fields (name/value key-value pairs), each with a name and type.

Specifying the field type

In the above example, all the fields are two scalar integers (page_number and result_per_page) and a string (query). However, you can also specify complex types for fields, including enumerated types and other message types

Specify field number

Each field in the message definition has a unique number. These numbers are used to identify the fields you define in the body of the binary message. These numbers should not be changed once your message type is used. Note that field numbers 1-15 take up one byte and 16-2047 take up two bytes when encoding message as a binary message body. So in frequently used messages, you should always use the preceding 1-15 field numbers first.

The minimum number you can specify is 1 and the maximum is 2e29-1 (536,870,911). 19000 through 19999 are field labels reserved for protocol buffers implementations and cannot be used when defining messages. Also, you can’t reuse any field numbers that are already used or reserved in the current message definition.

Rules for defining fields

The message field must comply with the following rules:

  • Singular: A field that follows the Singular rule and can have zero or one (but not more) of this field in a well-formed message body (encoded message). This is the default field rule for proto3 syntax. (This is a bit difficult to understand, for example, in the example above, all three fields are of type Singular and there can be zero or one query field in the encoded body, but not more than one.)
  • Repeated: Fields that follow the repeated rule can have as many values as they want in the message weight, and the order of these values can be maintained in the message weight (that is, array type fields).

Add more message types

Multiple messages can be defined in a single.proto file, which is useful when defining multiple related messages. For example, let’s define the response to our SearchRequest, message SearchResponse, and add it to the preceding.proto file.

message SearchRequest {
  string query = 1;
  int32 page_number = 2;
  int32 result_per_page = 3;
}

message SearchResponse {
 ...
}
Copy the code

Add comments

Comments in.proto are made in the same style as C and C++ comments, using // and /*… * /

/* SearchRequest represents a search query, with pagination options to * indicate which results to include in the response. */

message SearchRequest {
  string query = 1;
  int32 page_number = 2;  // Which page number do we want?
  int32 result_per_page = 3;  // Number of results to return per page.
}
Copy the code

Keep field

When you delete or comment out a field in a message, other developers can reuse the field number when updating the message definition in the future. If they accidentally load an old.proto file, it can cause serious problems such as data corruption, privacy loss, etc. One way to avoid problems is to specify reserved field numbers and field names. If someone uses these field identifiers in the future, the compiler will report an error at compile time for the protocol buffer.

message Foo {
  reserved 2.15.9 to 11;
  reserved "foo"."bar";
}
Copy the code

What code does Proto generate

When compiling a.proto file using the Protocol buffer compiler, the compiler generates code for the specified programming language based on the message type you define in the.proto file. The generated code includes accessing and setting field values, formatting message types into the output stream, parsing messages out of the input stream, and so on.

  • For C++, the compiler generates a .h and .cc file from each .proto, with a class for each message type described in your file.
  • For Java, the compiler generates a .java file with a class for each message type, as well as a special Builderclasses for creating message class instances.
  • PythonIs a little different — The Python compiler generates a module with a static descriptor of each message type in your.proto, which is then used with a metaclass to create the necessary Python data access class at runtime.
  • For Go, the compiler generates a .pb.go file with a type for each message type in your file.
  • For Ruby, the compiler generates a .rb file with a Ruby module containing your message types.
  • For Objective-C, the compiler generates a pbobjc.h and pbobjc.m file from each .proto, with a class for each message type described in your file.
  • For C#, the compiler generates a .cs file from each .proto, with a class for each message type described in your file.
  • For Dart, the compiler generates a .pb.dart file with a class for each message type in your file.

Scalar type

.proto Type Notes C++ Type Java Type Python Type[2] Go Type Ruby Type C# Type PHP Type Dart Type
double double double float float64 Float double float double
float float float float float32 Float float float double
int32 Use variable length encoding. Encoding negative numbers is inefficient – if your field may have negative values, use sint32 instead. int32 int int int32 Fixnum or Bignum (as required) int integer int
int64 Use variable length encoding. Encoding negative numbers is inefficient – if your field may have negative values, use SINt64 instead. int64 long int/long[3] int64 Bignum long integer/string[5] Int64
uint32 Use variable length encoding uint32 int int/long uint32 Fixnum or Bignum (as required) uint integer int
uint64 Use variable length encoding. uint64 long int/long uint64 Bignum ulong integer/string[5] Int64
sint32 Use variable length encoding. The int value of the signature. These encode negative numbers more efficiently than the regular INT32. int32 int int int32 Fixnum or Bignum (as required) int integer int
sint64 Use variable length encoding. The int value of the signature. These encode negative numbers more efficiently than the regular INT64. int64 long int/long int64 Bignum long integer/string[5] Int64
fixed32 It’s always four bytes. If the value is usually greater than 228, it is more effective than uint32. uint32 int int/long uint32 Fixnum or Bignum (as required) uint integer int
fixed64 It’s always eight bytes. If the value is usually greater than 256, it is more efficient than uint64 uint64 long int/long[3] uint64 Bignum ulong integer/string[5] Int64
sfixed32 It’s always four bytes int32 int int int32 Fixnum or Bignum (as required) int integer int
sfixed64 It’s always eight bytes int64 long int/long int64 Bignum long integer/string[5] Int64
bool bool boolean bool bool TrueClass/FalseClass bool boolean bool
string Strings must always contain UTF-8 encoded or 7-bit ASCII text and cannot exceed 232. string String str/unicode string String (UTF-8) string string String
bytes Can contain any sequence of bytes up to 232. string ByteString str []byte String (ASCII-8BIT) ByteString string List

The default value

When the encoded message body does not contain a singular field in a message definition, the corresponding field is set to the default value of that field in the message definition in the object that the message body resolves. The default value depends on the type:

  • For strings, the default is an empty string.
  • For bytes, the default is null bytes.
  • For bools, the default is false.
  • For numeric types, the default is zero.
  • For enumerations, the default value is the first enumeration value defined, which must be 0.
  • For the message field, the field is not set. Its exact value depends on the language. See the Code Generation Guide for more information.

Enumerated type

When defining a message type, you might want one of the fields to have only one value from a predefined list of values. For example, suppose you want to add the Corpus field for each SearchRequest, where the corpus can be UNIVERSAL, WEB, IMAGES, LOCAL, NEWS, PRODUCTS, or VIDEO. You can do this very simply by adding an enumeration to the message definition and adding constants to each possible enumeration value value.

In the following example, we add an enumeration type named Corpus, and a field of the Corpus type:

message SearchRequest {
  string query = 1;
  int32 page_number = 2;
  int32 result_per_page = 3;
  enum Corpus {
    UNIVERSAL = 0;
    WEB = 1;
    IMAGES = 2;
    LOCAL = 3;
    NEWS = 4;
    PRODUCTS = 5;
    VIDEO = 6;
  }
  Corpus corpus = 4;
}
Copy the code

As you can see, the first constant of the Corpus enumeration maps to 0: All enumeration definitions need to include a constant that maps to 0 as the first line of the definition, because:

  • There must be a value of 0 so that we can use 0 as the default value for the enumeration.
  • The enumeration value on the first line in proto2 syntax is always the default, and the value 0 must be the first line of the definition for compatibility.

Use other Message types

You can use any other message type as the field type, assuming you want to carry a Result message in each SearchResponse message,

You can define a Result message type in the same.proto file, and then specify a Result field in the SearchResponse.

message SearchResponse {
  repeated Result results = 1;
}

message Result {
  string url = 1;
  string title = 2;
  repeated string snippets = 3;
}
Copy the code

Import message definition

In the example above, the Result message type is defined in the same file as the SearchResponse – what if the message type to be used as the field type is already defined in another.proto file?

You can use definitions in other.proto files by importing them. To import another.proto definition, add an import statement at the top of the file:

import "myproject/other_protos.proto";
Copy the code

By default, you can only use definitions from directly imported.proto files. However, sometimes you may need to move the.proto file to a new location. Instead of moving the.proto file directly and updating all call points in a single change, you can now place a virtual.proto file in the old location and use the import public syntax in the file to forward all imports to the new location. Anyone importing the proto file that contains the import public statement can pass the dependency import public dependencies. For example,

// new.proto
// All definitions are moved here
Copy the code
// old.proto
// This is the proto that all clients are importing.
import public "new.proto";
import "other.proto";
Copy the code
// client.proto
import "old.proto";
// You use definitions from old.proto and new.proto, but not other.proto
Copy the code

The compiler will search for. Proto files in folders specified by the command-line argument -i or –proto-path, or in directories where the compiler is called if no compiler is provided. In general you should set the –proto-path value to the root directory of your project and use fully qualified names for all imports.

Use proto2’s message type

You can import a proto2 message type into a Proto3 message type, or import a Proto3 message type into a Proto2 message type. But Proto2’s enumeration types do not apply directly to Proto3’s syntax.

Nested message types

Message types can be defined and used in other message types. In the following example, the Result message is defined in the SearchResponse message

message SearchResponse {
  message Result {
    string url = 1;
    string title = 2;
    repeated string snippets = 3;
  }
  repeated Result results = 1;
}
Copy the code

If you want to use child messages defined in Parent messages externally, refer to them using parent-type

message SomeOtherMessage {
  SearchResponse.Result result = 1;
}
Copy the code

You can nest any number of layers of messages

message Outer {                  // Level 0
  message MiddleAA {  // Level 1
    message Inner {   // Level 2
      int64 ival = 1;
      bool  booly = 2;
    }
  }
  message MiddleBB {  // Level 1
    message Inner {   // Level 2
      int32 ival = 1;
      bool  booly = 2; }}}Copy the code

Update Message

If an existing message type no longer meets your current needs — say you want to add an extra field to the message — but still want to use code generated by the old message format, don’t worry! Updating message definitions without breaking existing code is simple as long as you remember the following rules.

  • Do not change the field number of any existing fields.
  • If new fields are added, any messages serialized by code generated from the old message format can still be parsed by code generated against the new message format. You should remember the default values for these elements so that the newly generated code can properly interact with the messages created by the old code serialization. Similarly, messages created by the new code can be parsed by the old code: the old message (binary) simply ignores the new fields when parsed. See the unknown Fields section below for more information.
  • As long as the field number is no longer reused in the updated message type, the field can be removed. You can also rename fields, such as addOBSOLETE_Prefix or set the field number toreserved, so that other future users won’t accidentally reuse the field number.

The unknown fields

Unknown fields are well-formed protocol buffer serialized data and represent fields that are not recognized by the parser. For example, when the old binary parses data sent by a new binary with new fields, these new fields become unknown fields in the old binary.

Originally, Proto3 messages always discarded unknown fields during parsing, but in version 3.5 we reintroduced the retention of unknown fields to match Proto2 behavior. In version 3.5 and later, unknown fields are retained during parsing and included in the serialized output.

Mapping type

Protocol Buffers provides a handy syntax if you want to create a map as part of the message definition

map<key_type, value_type> map_field = N;
Copy the code

Key_type can be any integer or string (any scalar type other than floating point and bytes). Note that enum is not a valid key_type. Value_type can be any type other than a mapping (meaning that nested maps are not allowed in the body of protocol Buffers).

For example, if you wanted to create a map called projects, with each Project message associated with a string key, you could define it as follows:

map<string, Project> projects = 3;
Copy the code
  • A field in a map must not be presented as a follow repeated value.
  • The values in the map are unordered, so they cannot depend on the order of the elements in the map.
  • When generating text format for.proto, map key sort. Number keys sort by number.
  • When resolving or merging from a line, if there are duplicate mapping keys, the last seen key is used. When parsing a map from a text format, the parsing may fail if there are duplicate keys.
  • If no value is specified for a mapped field, the behavior of the field when serialized is language-dependent. Default values for field types are serialized as field values in C++, Java, and Python, but not in other languages.

Add the package name to Message

You can add an optional package character to.proto files to prevent name conflicts before message types.

package foo.bar;
message Open { ... }
Copy the code

Use the package name when defining the field of Message as follows

message Foo { ... foo.bar.Open open = 1; . }Copy the code

The impact of the package character on generated code depends on the programming language

Define the service

If you want to use message types with RPC (Remote Procedure call) systems, you can define an RPC service interface in the.proto file. Then the Protocol Buffer compiler will generate the service interface code and stub according to the programming language of your choice, adding that you want to define a service. One of its methods accepts the SearchRequest message and returns the SearchResponse message, which you can define in the.proto file as shown in the following example:

service SearchService {
  rpc Search (SearchRequest) returns (SearchResponse);
}
Copy the code

The simplest RPC system to use with the Protocol Buffer is gRPC: a language – and platform-neutral open source RPC system developed by Google. GRPC is particularly useful for protocol buffers and allows you to generate the associated RPC code directly from.proto files using the special Protocol Buffer compiler plug-in.

If you don’t want to use gRPC, you can use your own RPC system. More details on implementing AN RPC system can be found in the Proto2 Language Guide.

JSON codec

Proto3 supports canonical encoding in JSON, making it easier to share data between systems. The encoding rules are listed type by type in the following table.

If a value is missing from the JSON encoded data, or if its value is null, it will be interpreted as the corresponding default value when it is parsed as a protocol buffer. If a field has a default value in the Protocol buffer, it will be omitted from the JSON-encoded data by default to save space. Writing a codecs implementation can override this default behavior by leaving the field with the default value in the JSON-encoded output.

proto3 JSON JSON example Notes
message object {" fooBar ": v," g ", null,... } Generate a JSON object. The message field name is converted to a small hump and becomes a JSON object key. If you specifyjson_nameField option, the specified value is used as the key. The parser accepts the name of a small hump (or byjson_nameOption) and the original proto field name.nullIs an acceptable value for all field types and is considered the default value for the corresponding field type.
enum string "FOO_BAR" Use the name of the enumeration value specified in proto. The parser accepts enumeration names and integer values.
map<K,V> object {" k ": v,... } All keys will be converted to strings
repeated V array [v,... Null will be converted to an empty list []
bool true, false true, false
string string "Hello World!"
bytes base64 string "YWJjMTIzIT8kKiYoKSctPUB+" The JSON value will be data encoded as a string using the standard Base64 encoding with padding. Accepts standard or URL-secure Base64 encoding with/without padding.
int32, fixed32, uint32 number 1-10, 0 JSON value will be a decimal number. Either numbers or strings are accepted.
int64, fixed64, uint64 string "1", "to 10" JSON value will be a decimal string. Either numbers or strings are accepted.
float, double number 1.1, -10.0, 0, "NaN","Infinity" JSON value will be a number or one of the special string values “NaN”, “Infinity”, and “-Infinity”. Either numbers or strings are accepted. Exponent notation is also accepted.
Any object {"@type": "url", "f": v,... } If the Any contains a value that has a special JSON mapping, it will be converted as follows: {"@type": xxx, "value": yyy}. Otherwise, the value will be converted into a JSON object, and the "@type" field will be inserted to indicate the actual data type.
Timestamp string "The 1972-01-01 T10:00:20. 021 z" Uses RFC 3339, where generated output will always be Z-normalized and uses 0, 3, 6 or 9 fractional digits. Offsets other than “Z” are also accepted.
Duration string "1.000340012 s", "1 s" Generated output always contains 0, 3, 6, or 9 fractional digits, depending on required precision, followed by the suffix “s”. Accepted are any fractional digits (also none) as long as they fit into nano-seconds precision and the suffix “s” is required.
Struct object {... } Any JSON object. See struct.proto.
Wrapper types various types 2, "2", "foo", true,"true", null, 0... Wrappers use the same representation in JSON as the wrapped primitive type, except that null is allowed and preserved during data conversion and transfer.
FieldMask string "f.fooBar,h" See field_mask.proto.
ListValue array / foo and bar,...
Value value Any JSON value
NullValue null JSON null
Empty object {} An empty JSON object

The generated code

To generate Java, Python, C ++, Go, Ruby, Objective-C, or C # code, you need to use message types defined in.proto files, and you need to run the protocol buffer compiler Protoc on.proto. If the compiler is not already installed, download the package and follow the instructions in the README file. For Go, you also need to install a special code generator plugin for the compiler: you can find this plugin and installation instructions in the Golang/Protobuf project on GitHub.

The compiler evokes something like this:

protoc --proto_path=IMPORT_PATH --cpp_out=DST_DIR --java_out=DST_DIR --python_out=DST_DIR --go_out=DST_DIR --ruby_out=DST_DIR --objc_out=DST_DIR --csharp_out=DST_DIR path/to/file.proto
Copy the code
  • IMPORT_PATHSpecified in parsingimportWhere to search on command.protoFile, which if ignored will be searched in the current working directory, can be passed multiple times--proto-pathParameter to specify multiple import directories that will be searched by the compiler in order.-I=IMPORT_PATHis--proto_pathThe short form of.
  • You can provide one or more output commands:
    • --cpp_out generates C++ code in DST_DIR. See the C++ generated code reference for more.
    • --java_out generates Java code in DST_DIR. See the Java generated code reference for more.
    • --python_out generates Python code in DST_DIR. See the Python generated code reference for more.
    • --go_out generates Go code in DST_DIR. See the Go generated code reference for more.
    • --ruby_out generates Ruby code in DST_DIR. Ruby generated code reference is coming soon!
    • --objc_out generates Objective-C code in DST_DIR. See the Objective-C generated code reference for more.
    • --csharp_out generates C# code in DST_DIR. See the C# generated code reference for more.
    • --php_out generates PHP code in DST_DIR. See the PHP generated code reference for more.
  • You must provide one or more.proto files as input. Multiple.proto files can be specified at one time. Although the files are named relative to the current directory, each file must exist in one of the import_paths so that the compiler can determine its specification name.