Protobuf is mainly known for small data encoding, mainly used for data interaction and data storage, reduce bandwidth, disk, poor mobile network environment to reduce packet size and other scenarios, about the serialization speed mainly depends on the SDK you use, so this article will not care about the serialization speed! This article will be introduced in proto3 syntax!
Agreement on
- The difference between PB3 and PB2 is basically,
- Pb3 introduces default values for all primitive types, so the default values are not serialized during data transfer, which makes it difficult to tell whether the values are set or not! So then PB3 supports it again
optional
! - If your business needs it
required
Keyword, but PB3 is notrequired
Keyword! - And PB3 does not support default Settings, and does not support
group message
So I still recommend if you are doing business development or PB2 is better!
- Basically the syntax is as follows, specific to see the official documentation: developers.google.com/protocol-bu…
syntax = "proto3";
message TestData {
enum EnumType {
UnknownType = 0; // Must start with 0!
Test1Type = 1;
Test2Type = 2;
}
message TestObj {
int64 t_int64 = 1;
}
string t_string = 1;
int64 t_int64 = 2;
bool t_bool = 3;
fixed64 t_fix64 = 4;
repeated int64 t_list_i64 = 5;
map<int64.string> t_map = 6;
EnumType t_enum = 7;
TestObj t_obj = 8;
repeated TestObj t_list_obj = 9;
map<string, TestData> t_map_obj = 10;
}
Copy the code
Let’s not say one of them
Compile:
/ Users/fanhaodong/software/protoc - 3.17.3 - osx - x86_64 / bin/protoc \ - experimental_allow_proto3_optional \ --proto_path=/Users/fanhaodong/go/src/code.xxxxxxxxx/fanhaodong.516/tool/pb_idl/test \ --plugin=protoc-gen-go=/Users/fanhaodong/go/bin/protoc-gen-go \ --go_opt=Mtest.proto=github.com/fanhaodong.516/go-tool/pb_gen/test \ --go_out=/Users/fanhaodong/go/src \ --plugin=protoc-gen-go-grpc=/Users/fanhaodong/go/bin/protoc-gen-go-grpc \ --go-grpc_opt=Mtest.proto=github.com/fanhaodong.516/go-tool/pb_gen/test \ --go-grpc_out=/Users/fanhaodong/go/src \ test.protoCopy the code
-
Pb serialization used core thought is varint, specific can see official articles: developers.google.com/protocol-bu…
-
The goal of this article is to make it easy to serialize and deserialize messages!
Encoding + decoding
About the encoding logic for each type of message: github.com/protocolbuf…
A simple example
Here is a test case, and you can see the output below
func Test_Marshal_Data(t *testing.T) {
var request = test.TestData{
TString: "hello",
TInt64: 520,
}
marshal, err := proto.Marshal(&request)
iferr ! =nil {
t.Fatal(err)
}
t.Log(hex.Dump(marshal))
// 00000000 0a 05 68 65 6c 6c 6f 10 88 04 |.. hello... |
}
Copy the code
Message format introduction (Key encoding introduction)
In particular, how do we analyze the code
The message itself is a series of KV pairs. In PB, the key is the ID of the field and the value is the message body! But in fact, key is the field ID+ field type, each key is applicable to varint encoding, generally read pb source code will be called this tag to represent key!
Pb, provide many types it at the time of transmission for type and made a map classification, probably is the 6 classes, specific can see official articles: developers.google.com/protocol-bu… ! So you can use the three bits, says so tag = (field_number < < 3) | wire_type
The first three bits of the tag represent the type, and the rest represent the ID!
Here to talk about a little skill, actually see the agreement this source code, a lot of bit operations, but in general | said set bit operation, & said get bit operation!
Varint contains only 7 fields, but tag contains only 3 fields, so the remaining 4 bytes store number, i.e. 1<<4 -1 field ID; Official: article introduces developers.google.com/protocol-bu…
Second, the reason why the largest field is 2^29-1 is because nuber is ui32 at most, and there are only 29 bits left for 32-3 types, so it’s 2^29-1
So the code for this concrete implementation is as follows:
type pbCodec struct {
buffer []byte
}
func (p *pbCodec) EncodeTag(fieldNumber uint32, writeType uint8) error {
tag := fieldNumber<<3 | uint32(writeType)
return p.EncodeVarInt(int64(tag))
}
// Will be implemented later
func (*pbCodec) EncodeVarInt(data int64) error {
return nil
}
func (p *pbCodec) DecodeTag(a) (fieldNumber uint32, writeType uint8, err error) {
result, err := p.DecodeVarInt()
iferr ! =nil {
return 0.0, err
}
// select * from 1
writeType = uint8(result&1<<3 - 1)
// 2 removes the lower 3 bits
fieldNumber = uint32(result >> 3)
return
}
Copy the code
Varint coding
The wiki introduce en.wikipedia.org/wiki/Variab…
The minimum value is 1 byte, and the maximum value is int64
- Case1:
For example: data=15 -> 0000 1111,
Encoding logic:
Varint is 0000 1111 because it can be represented in 7 bytes! So there is no need to set up MSB!
Analytic logic:
For example, 0000 1111&1 <<7 == 0 indicates that the MSB is not set. Then, the lower 7 bits are the true data. We can ignore this operation because 8 bits are also 0!
- case2:
For example, data=520 -> 0000 0010 0000 1000
Encoding logic:
Data & (1<<7) -1) = 000 1000, then set MSB 1000 1000, this is the first round;
The remaining bytes in the second round are 0000 0010 0= 4, which can be put down with 7 bytes, so 0000 0100
So the final result is 1000, 1000, 0000, 0100, which is [136,4], and you’ll see that it actually outputs a small endian representation!
Analytic logic:
First, varint actually prints a small endian representation, so we need to start with the low one!
The first byte is fetched 1000 1000, MSB is found, and the result is 000 1000 = 8
Then fetch the second byte, 0000 0100, find it is not MSB, and get the result 000 0100, we need to put it after 000 1000! How to do, actually with simple 000 0100 < < 7 | 000 1000 can get the result is 000, 0100, 000, 1000 = 0000, 0010, 0000, 1000
- Code implementation
func (p *pbCodec) EncodeVarInt(data int64) error {
// 1. Fetch the lower 7 bits (if 7 bytes cannot be dropped!)
// 2. Then set the high 8-bit identifier
// 3. Then move right
for data > (1<<7 - 1) {
p.buffer = append(p.buffer, byte(data&(1<<7- 1) | (1<<7)))
data >>= 7
}
p.buffer = append(p.buffer, byte(data))
return nil
}
func (p *pbCodec) DecodeVarInt(a) (int64, error) {
var (
x int64
n = 0
)
defer func(a) {
p.buffer = p.buffer[n:]
}()
for shift := uint(0); shift < 64; shift += 7 { // The offset starts at 0 and +7 each time
if n >= len(p.buffer) {
return 0, fmt.Errorf("not enough buffer")}// 1. Remove the first self
// 2. Then extract the lower 7 bits
// 3. Since the data is small endian, the retrieved data needs to move the offset
// 4. Then set it to the original data!
b := int64(p.buffer[n])
n++
x |= (b & 0x7F) << shift
if (b & 0x80) = =0 {
return x, nil}}return 0, fmt.Errorf("not support")}Copy the code
Non-varint encoding type
- Fixed type 64
Encoding: Encodes bytes from the lowest to the highest, output the lowest endian
Decoding: Reverse as opposed to encoding
Such as:
520 INT64 00 00 00 00 00 02 08 => 08 02 00 00 00 00 00 00 00 00 00 00 00 00 00Copy the code
- Fixed 32
- string
The encoding, since string is variable length, is len + content, len is varint, and content is UTF8
Field_number + field_type (proto.wirevarint) content_len (varint) contentCopy the code
- embed message
We need to know how big the message body is (like string), so it’s len+payload, and len is also varint.
Field_number + field_type (proto.wirevarint) payload_len (Varint encoding) payloadCopy the code
Zigzag encoding
In fact, we can find that the high bit of varint is set to MBS, and then we know that the highest bit of the negative number must be 1 (because of the inverse code), and then we can find a problem, the varint of the negative number can be very large, at least 10 bytes, so pb proposed sint32, sint64 codes, to solve this problem. The core is zigZag coding
For example, this code:
func Test_Marshal_Data(t *testing.T) {
t.Run("number".func(t *testing.T) {
var request = test.TestData{
TInt64: 520,
}
marshal, err := proto.Marshal(&request)
iferr ! =nil {
t.Fatal(err)
}
t.Log(hex.Dump(marshal))
t.Logf("len: %d\n".len(marshal))
})
t.Run("negative number".func(t *testing.T) {
var request = test.TestData{
TInt64: - 520.,
}
marshal, err := proto.Marshal(&request)
iferr ! =nil {
t.Fatal(err)
}
t.Log(hex.Dump(marshal))
t.Logf("len: %d\n".len(Marshal))})} == output, you can find a large gap!! === RUN Test_Marshal_Data === RUN Test_Marshal_Data/number demo_test.go:45: 00000000 10 88 04|... | demo_test.go:46: len: 3
=== RUN Test_Marshal_Data/negative_number
demo_test.go:56: 00000000 10 f8 fb ff ff ff ff ff ff ff 01|... | demo_test.go:57: len: 11
Copy the code
It’s actually quite simple
And we can see that this is actually the unsigned half for the positive and half for the negative!
For example, 32-bit: (n << 1) ^ (n >> 31)
For example, 64-bit: (n << 1) ^ (n >> 63)
Unique or: The value is 0 for the same and 1 for different
Such as
#- 11111 1111 1111 1111 1111 1111 1111 1111 1111
# d1=uint32(n) << 11111 1111 1111 1111 1111 1111 1111 1111 1110#D2 =uint32(N >> 31) (If negative numbers are moved to the left, 1 is added)1111 1111 1111 1111 1111 1111 1111 1111 1111# d1 ^ d20000 0000 0000 0000 0000 0001
# 10000 0000 0000 0000 0000 0001#n<<10000 0000 0000 0000 0000 10#n>>310000 0000 0000 0000 0000 0000#The output0000 0000 0000 0000 0000 10Copy the code
Code implementation (the 32-bit implementation may be important to note here)
// EncodeZigZag64 does zig-zag encoding to convert the given
// signed 64-bit integer into a form that can be expressed
// efficiently as a varint, even for negative values.
func EncodeZigZag64(v int64) uint64 {
return (uint64(v) << 1) ^ uint64(v>>63)}// EncodeZigZag32 does zig-zag encoding to convert the given
// signed 32-bit integer into a form that can be expressed
// efficiently as a varint, even for negative values.
func EncodeZigZag32(v int32) uint64 {
return uint64((uint32(v) << 1) ^ uint32((v >> 31)))}// DecodeZigZag32 decodes a signed 32-bit integer from the given
// zig-zag encoded value.
func DecodeZigZag32(v uint64) int32 {
return int32((uint32(v) >> 1) ^ uint32((int32(v&1) < <31) > >31))}// DecodeZigZag64 decodes a signed 64-bit integer from the given
// zig-zag encoded value.
func DecodeZigZag64(v uint64) int64 {
return int64((v >> 1) ^ uint64((int64(v&1) < <63) > >63))}Copy the code
Repeated (list)
Protbuf provides a repeated keyword to provide a list of types.
There are two specific encoding implementations of repeated
- Packed (Pb3 default, PB2 v2.1.0 introduced)
- unpacked
The official document: developers.google.com/protocol-bu…
packed
You can use the demo provided on the official website as an example:
message Test4 {
repeated int32 d = 4 [packed=true];
}
Copy the code
Coded output
22 // key (field number 4, wire type 2)
06 // payload size (6 bytes)
03 // first element (varint 3)
8E 02 // second element (varint 270)
9E A7 05 // third element (varint 86942)
Copy the code
In fact, this is our normal idea, ID-type-data! However, there are few supported types, only the following types are supported!
WireVarint = 0
WireFixed32 = 5
WireFixed64 = 1
Copy the code
unpacked
Again, the above example, if unpacked
message Test5 {
repeated int32 d = 4 [packed = false];
}
Copy the code
Output:
00000000 20 03 20 8e 02 20 9e a7 05|... . |20 // key (field number 4, wire type=proto.WireVarint)
03 // (varint 3)
20 // key (field number 4, wire type=proto.WireVarint)
8e 02 // (varint 270)
20 // key (field number 4, wire type=proto.WireVarint)
9e a7 05 // (varint 86942)
Copy the code
map
As can be seen in file desc, a MAP is a repeated KV message.
{
"name": "TestData"."field": [{"name": "t_map"."number": 6."label": 3."type": 11."type_name": ".TestData.TMapEntry"."json_name": "tMap"},]."nested_type": [{"name": "TMapEntry"."field": [{"name": "key"."number": 1."label": 1."type": 3."json_name": "key"
},
{
"name": "value"."number": 2."label": 1."type": 9."json_name": "value"}]."options": {
"map_entry": true}}]."enum_type": []}Copy the code
So it’s easy to code, for example
message TestMapData1 {
// You can't define TMapEntry.
map<int64.string> t_map = 6; } ==> actually generates this code!message TestMapData2 {
message TMapEntry {
int64 key = 1;
string value = 2;
}
repeated TMapEntry t_map = 6;
}
Copy the code
So the code can be roughly:
t_map= {1:"1".2:"2"} = >32 05 08 01 12 01 31 32 05 08 02 12 01 32= >32 // field_number=6 and write_type=proto.WireBytes
05 // entry data length=5
08 // entry data key field_number=1 and write_type=proto.WireVarint
01 // entry data key_value=varint(1)
12 // entry data value field_number=2 and write_type=proto.WireBytes
01 // entry data value len= varint(1)
31 // entry data value="1"
32 // field_number=6 and write_type=proto.WireBytes
05 // entry data length=5
08 02 12 01 32 / / same as above!
Copy the code
field order
Pb coding does not care about the coding sequence of the field, that is, encode when the order is not the same will lead to the output of data is not the same!
The key order of the map is also affected!
So most apis specify whether they support Deterministic or not, and if set to true, the result is always guaranteed to be the same, otherwise it might be different!
But you know, the actual effect is that it must be slower than not to open it, because it needs to order!
Pb protocol overview
This is a similar to BNF paradigm, specific can consult: developers.google.com/protocol-bu…
message: = (tagValue)* You can think of this as "key value" tag := (field << 3) BIT_OR wire_type, encoded as varint value := (varint|zigzag) for wire_type==0 | fixed32bit for wire_type==5 | fixed64bit for wire_type==1 2 | | delimited for wire_type = = group_start for wire_type = = 3 | This is like "open parenthesis" group_end for wire_type = = 4 This is like "close parenthesis" varint: = int32 | int64 | uint32 | uint64 | bool | enum. encoded as varints zigzag := sint32 | sint64, encoded as zig-zag varints fixed32bit := sfixed32 | fixed32 | float, encoded as 4-byte little-endian; memcpy of the equivalent C types (u? int32_t, float) fixed64bit := sfixed64 | fixed64 | double, encoded as 8-byte little-endian; memcpy of the equivalent C types (u? int64_t, double) delimited := size (message | string | bytes | packed), size encoded as varint message := valid protobuf sub-message string := valid UTF-8 string (often simply ASCII); max 2GB of bytes bytes := any sequence of 8-bit bytes; max 2GB packed := varint* | fixed32bit* | fixed64bit*, consecutive values of the type described in the protocol definition varint encoding: Sets MSB of 8-bit byte to indicate "No more bytes" zigzag encoding: SINt32 and SINt64 types use zigzag encoding.Copy the code
Protoc command description
Here is a description of the protoc architecture diagram, generate architecture diagram
The first node protoc is actually written in c++, github.com/protocolbuf…
The second node is the protoc output binary CodeGeneratorRequest, see github.com/protocolbuf…
The third node is the business logic, the back end, we can according to the request to output products to the PB file
The fourth node outputs the CodeGeneratorResponse
At present, although there are a lot of projects in the resolution of protoc AST, are their own implementation of lexical parsing and syntax parsing, so if you can put protoc encapsulation lib libraries directly use go/ Java docking on the line!
Pb some other terms
- Package, the same message and enumeration and enumeration fields cannot exist under the same package
- Include (import), relative pata based on include_path as the root directory
- Options, which can be interpreted as annotations, can mark fields, messages, enumerations, methods!
Refer to the article
- Protobuf coding introduction
- Varint/Zigzag protobuf encoding