0 foreword

ProtoBuf (Protocol Buffer) is a lightweight and efficient structured data storage format produced by Google. Its performance is stronger than JSON and XML, and it is widely used in data transmission. However, there are many data types in ProtoBuf, which scene to apply which data type is the most reasonable, the most space-saving, has become the problem that every user should consider. In order to fully understand and use Protobuf, this article will focus on the basic data types of Protobuf, and analyze the usage scenarios and precautions of different data types.

Note: It is good to have some understanding of ProtoBuf’s syntax and serialization principles before reading this article.


Recommended literature:


[1]
Serialization: This is a bona fide explanation of the Protocol Buffer syntax


[2]
Protocol Buffer Serialization – Why does the Protocol Buffer perform so well?


[3]
Learn the principle of ProtoBuf serialization thoroughly with a complete example

1 The range of basic data types

The range of values for the data type:

In Java:

Floating-point range

  • Float (32 bits) Float (32 bits) Float (32 bits) Float (32 bits) Float (32 bits) Float (32 bits) Float (32 bits) But the absolute guarantee is 6 digits, meaning float accuracy is 6 or 7 significant digits;
  • Double (64bit) = 1bit (symbol bit) + 11bits (exponent bit) + 52bits (mansa bit) Double has an accuracy of 15 to 16 bits.

The storage method of floating point numbers is as follows:
How are floating-point types (float, double) stored in memory?

2 ProtoBuf data type

The mapping between the basic data types of ProtoBuf and the Java data types is as follows:

The map is from the ProtoBuf website,
Language Guide (proto3)

Note that there is no distinction between unsigned and signed integers in Java. ProtoBuf’s int and uint map uniformly to Java’s int/long data types.

There are roughly two ways to serialize a PROTOBUF datatype. One is variable length encoding (e.g., VARINT). PROTOBUF allots space to store the datatype, saving memory with as little space as possible without losing precision (e.g., integer 1, if the datatype is defined as INT32, Instead of 8 bytes, ProtoBuf only takes 1 byte. Note that Protobuf only saves units of bytes (1 byte per 8 words), but not the exact number of units (you can save more bits per byte). The other is fixed-length encoding (e.g., 64-bit, 32-bit), which takes up as much space as the type of data defined, regardless of whether there is waste or not. In fact, there is a more special method (Length- Dimited), this method is mainly similar to the array of data, add a field to record the Length of the array, and then the array contents are combined in order, the detailed principle is not described, we can see the literature recommended above.

3 Data Experiment

In order to verify the serialization effect of ProtoBuf data types, the following data experiments are designed.

1. First of all, the Proto file is customized to contain the basic data types, and the Proto class is generated (how to edit and compile Proto file based on IDEA one-stop, please refer to the previous topic article Protobuf) (1) Proto one-stop editing and compilation based on IDEA.

The contents of the proto file are as follows:

// Google Protocol Buffers Version 3.
syntax = "proto3";
option java_package = "learnProto.selfTest";
option java_outer_classname = "MyTest";

import "google/protobuf/timestamp.proto";

message Data{
 uint32 uint32 = 1;
 uint64 uint64 = 2;
 int32 int32 = 3;
 int64 int64 = 4;
 sint32 sint32 = 5;
 sint64 sint64 = 6;
 fixed32 fixed32 = 7;
 fixed64 fixed64 = 8;
 bool bool=9;
 string str = 10;
 float float=11;
 double double=12;
 google.protobuf.Timestamp time = 13;
}

2. Secondly, assign different values to each data class and serialize them, and observe the number of bytes taken after serialization of different data. 3. Finally, summarize and conclude to form suggestions for use.

3.1 Integer data experiment

The test code is as follows:

public class demoTest { public void convertUint32(int value) { //1. MyTest.data.builder dataBuilder = myTest.data.newBuilder (); myTest.data.newBuilder (); // set the value of databuilder.setuint32 (value); MyTest.data Data = databuilder.build (); myTest.data = databuilder.build (); //4. SerializeByte [] Bytes = Data.toByteArray (); Println (value+" + Arrays.toString(bytes)+", "+ Arrays.toString(bytes)+", "+ Bytes.length); }... The convertInt32 method is similar to the convertInt32 method, but you only need to modify the set method. @Test public void test32(){ System.out.println("=================uint32================"); convertUint32(1); convertUint32(1000); convertUint32(Integer.MAX_VALUE); convertUint32(-1); convertUint32(-1000); convertUint32(Integer.MIN_VALUE); System.out.println("=================int32================"); convertInt32(1); convertInt32(1000); convertInt32(2147483647); convertInt32(-1); convertInt32(-1000); convertInt32(-2147483648); System.out.println("=================sint32================"); convertSint32(1); convertSint32(1000); convertSint32(2147483647); convertSint32(-1); convertSint32(-1000); convertSint32(-2147483648); System.out.println("=================fix32================"); convertFixed32(1); convertFixed32(1000); convertFixed32(2147483647); convertFixed32(-1); convertFixed32(-1000); convertFixed32(-2147483648); }

The running results are as follows:

= = = = = = = = = = = = = = = = = uint32 = = = = = = = = = = = = = = = = 1 serialized data: [8, 1], the number of bytes: 2 1000 serialized data: [8, - 24, 7], the number of bytes: 3 2147483647 serialized data: [8, 1, 1, 1, 1, 7], the number of bytes: 6-1 the serialized data: [8, 1, 1, 1, 1, 15], the number of bytes: 6-1000 serialized data: [8, 104-8, 1, 1, 15], the number of bytes: 6-2147483648 serialized data: [8, 128-128, 128, 128, 8], the number of bytes: 6 = = = = = = = = = = = = = = = = = int32 = = = = = = = = = = = = = = = = 1 serialized data: [24, 1], the number of bytes: 2 1000 Serialized data: [24, -24, 7], Number of bytes: 3 2147483647 Serialized data: [24, -1, -1, -1, -1, 7], Number of bytes: 6 -1 Serialized data: [24, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], the number of bytes: 11-1000 serialized data: [24, 104, 8, 1, 1, 1, 1, 1, 1, 1, 1], the number of bytes: Serialized data: [24, -128, -128, -128, -128, -8, -1, -1, -1, -1, 1], number of bytes: 11 = = = = = = = = = = = = = = = = = sint32 = = = = = = = = = = = = = = = = 1 serialized data: [40, 2], the number of bytes: 2 1000 serialized data: [40, - 48, 15], the number of bytes: 3 2147483647 serialized data: [40, -2, -1, -1, -1, 15], Number of bytes: 6-1 Serialized data: [40, 1], Number of bytes: 2-1000 Serialized data: [40, -49, 15], Number of bytes: 3-2147483648 [40, 1, 1, 1, 1, 15], the number of bytes: 6 = = = = = = = = = = = = = = = = = fix32 = = = = = = = = = = = = = = = = 1: serialized data (61, 1, 0, 0, 0], the number of bytes: 5 1000 serialized data: [61, -24, 3, 0, 0], Number of bytes: 5 2147483647 Serialized data: [61, -1, -1, -1, -1], Number of bytes: Serialized data: [61, 24, -4, -1, -1], Number of bytes: 5-2147483648 Serialized data: [61, 0, 0, 0, -128], Number of bytes: 5

1, UINT32 type: the value range is equivalent to the range of INT32 (can store negative numbers, because proto does not judge and limit negative numbers). Positive numbers take up to 5 bytes, negative numbers must take up to 5 bytes. The first byte stores the data type and the number of the field in the proto, known as the tag in the principles. The reason 32-bit data is stored in a maximum of 5 bytes is that the highest bit of each byte records whether the data is spawn to the next byte (for variable length storage), with 1 indicating spawn and 0 indicating no spawn. So the actual number of bytes stored in each byte is 7, so 4*7<32, so 5 bytes are needed.)

2. INT32: Maximum 5 bytes for a positive number and 10 bytes for a negative number. (Because when the number is negative, the 32 bits are expanded to 64 bits, the specific reason is not clear for now, if you know, please comment)

3. SINT32 type: Zigzag encoding is introduced when data is stored (Zigzag(n) = (n << 1) ^ (n >> 31). When n is SINT32, the symbol is removed and converted to positive number), in order to solve the problem that negative number takes up too much space. Positive and negative numbers take up to 5 bytes, which is memory efficient.

4, Fixed32 type: fixed use of 4 bytes, that is, positive and negative numbers must occupy 4 bytes. Because the variable length storage strategy was abandoned. Suitable for storing fields with large data value ratio.

The rules of 64-bit are similar to those of 32-bit.

3.2 Experiment with string type data

The test code is as follows:

@Test public void testStr() { System.out.println("=================string================"); convertStr(""); convertStr("a"); convertStr("abc"); ConvertStr (" "); ConvertStr ("... "); }

The running results are as follows:

= = = = = = = = = = = = = = = = = string = = = = = = = = = = = = = = = = serialized data: [], the number of bytes: 0 a serialized data: [82, 1, 97], the number of bytes: 3 ABC serialized data: [82, 3, 97, 98, 99], number of bytes: 5 serialized data: [82, 3, -27, -107, -118] [82, 6, -27, -107, -118, -27, -107, -118], number of bytes: 8

String type: The default value of the string in Proto3 is an empty string, which does not occupy memory space after serialization. A single English character is 1 byte, and a single Chinese character is 3 bytes (PROTO uses UTF-8 encoding).

3.3 Boolean data experiment

The test code is as follows:

@Test
public void testbool() {
    System.out.println("=================bool================");
    convertBool(false);
    convertBool(true);
}

The running results are as follows:

= = = = = = = = = = = = = = = = = bool = = = = = = = = = = = = = = = = false serialized data: [], the number of bytes: 0 true serialized data: [72, 1], the number of bytes: 2

Proto3: The default Boolean value is fasle, so if the value is false, the serialization does not take up any memory. When a Boolean value is true, it takes 1 byte.

3.4 Floating point data experiment

Floating-point data USES fixed-length coding, its itself no test is necessary, but in practice, a lot of floating-point data (such as latitude and longitude coordinates) actually can be converted into a certain precision integer (allowing certain accuracy loss), under this scenario, is the use of integer or continue to use floating-point good?

The test code is as follows:

Public void convertAndValidDint (long value) {// The set and get methods are convertAndValidDint (long value); MyTest.data.builder dataBuilder = myTest.data.newBuilder (); myTest.data.newBuilder (); // set the value of databuilder.setint64 (value); MyTest.data Data = databuilder.build (); myTest.data = databuilder.build (); //4. SerializeByte [] Bytes = Data.toByteArray (); Println (value+" + Arrays.toString(bytes)+", "+ Arrays.toString(bytes)+", "+ Bytes.length); // Try {myTest.data.ParseFrom = myTest.data.ParseFrom (bytes); System.out.println(" deserialized data ="+ Parsefrom.getInt64 ()); } catch (InvalidProtocolBufferException e) { e.printStackTrace(); {}} @ Test public void Test () System. The out. Println (" = = = = = = = = = = = = = = = = if retained seven decimal (accurate to cm) = = = = = = = = = = = = = = = "); System.out.println("--> is converted to an integer, encoded with int64: "); convertAndValiddInt(1700000001); System.out.println("--> is still a decimal, and is coded in float: "); ConvertAndValiddFloat (170.0000001 f); System.out.println("--> is still a decimal, double: "); ConvertAndValiddDouble (170.0000001); System. The out. Println (" = = = = = = = = = = = = = = = = if keep 8-bit decimal (submillimeter accuracy) = = = = = = = = = = = = = = = "); System.out.println("--> is converted to an integer, encoded with int64: "); convertAndValiddInt(Long.valueOf("17000000001")); System.out.println("--> is still a decimal, and is coded in float: "); ConvertAndValiddFloat (170.00000001 f); System.out.println("--> is still a decimal, double: "); ConvertAndValiddDouble (170.00000001); }

The running results are as follows:

= = = = = = = = = = = = = = = = if retained seven decimal places (accurate to cm) = = = = = = = = = = = = = = = -- -- > converted to an integer, use int64 code: 1700000001 serialized data: [32, 127-30, - 49-86, 6], the number of bytes: [93, 0, 0, 42, 67]. Number of bytes: 5. Deserialized data =170.0 -- Double: Serialized data: [97, -27, -81, 53, 0, 0, 64, 101, 64], bytes: 9 after deserialization data = 170.0000001 = = = = = = = = = = = = = = = = if keep 8-bit decimal (submillimeter accuracy) = = = = = = = = = = = = = = = -- -- > converted to an integer, use int64 code: 17000000001 serialized data: [32, -127, -44, -99, -86, 63], number of bytes: 6 deserialized data =17000000001 --> still using decimal, with float: 170.0 Serialized data: [93, 0, 0, 42, 67], number of bytes: [97, 100, 94, 5, 0, 0, 64, 101, 64] =170.0 --> with double: 170.00000001 9 deserialized data =170.00000001

1, There is a loss of latitude and longitude (at least 7 decimal places). 2, for longitude and latitude and other floating point numbers, it is converted to integer data, using INT64 encoding more space saving.

3.5 Time-stamp data experiment

Many scenarios use timestamps. What type should you choose? The test code is as follows:

@ Test public void testTime () {System. Out. Println (" = = = = = = = = = = = = = = = = Test time stamp (accurate to seconds) = = = = = = = = = = = = = = = "); System.out.println("--> with int64 encoding: "); convertInt64(Long.valueOf("1600229610283")); System.out.println("--> with uint64 encoding: "); convertUint64(Long.valueOf("1600229610283")); System.out.println("-- --> with fixed64 code: "); convertFixed64(Long.valueOf("1600229610283")); Println ("--> with timeStamp code: "); convertTimeNanos(Long.valueOf("1600229610283")); System. The out. Println (" = = = = = = = = = = = = = = = = test time stamp (accurate to milliseconds) = = = = = = = = = = = = = = = "); System.out.println("--> with int64 encoding: "); convertInt64(Long.valueOf("1600229610283000")); System.out.println("--> with uint64 encoding: "); convertUint64(Long.valueOf("1600229610283000")); System.out.println("-- --> with fixed64 code: "); convertFixed64(Long.valueOf("1600229610283000")); Println ("--> with timeStamp code: "); convertTimeNanos(Long.valueOf("1600229610283000")); }

The running results are as follows:

= = = = = = = = = = = = = = = = test time stamp (accurate to seconds) = = = = = = = = = = = = = = = -- -- > use int64 code: 1600229610283 serialized data: [32, 85-90, 8 -, - 88-55, 46], the number of bytes: 7 --> : 1600229610283 Serialized data: [16, -85, -90, -8, -88, -55, 46], number of bytes: 7 --> : 1600229610283 Serialized data: [65, 43, 19, 30, -107, 116, 1, 0, 0], Number of bytes: 9 --> Serialized data with timeStamp code: 1600229610283: [106, 8, 8-64, 12, 16, 85, 90, 66, 109], the number of bytes: 10 = = = = = = = = = = = = = = = = test timestamp (accurate to milliseconds) = = = = = = = = = = = = = = = -- -- > use int64 code: Serialized data: [32, -8, -65, -21, -21, -25, -20, -21, 2], number of bytes: 9 --> encoded with uint64:1600229610283000 Serialized data: [16, -8, -65, -21, -21, -25, -20, -21, 2], number of bytes: 9 --> with fixed64 encoding: 1600229610283000 [65, -8, -33, 122, 125, 102, -81, 5, 0], number of bytes: 9 --> with timeStamp code: 1600229610283000 [27, 106, 10, 8 -, -, 43, 97, 16, 8, 37, 128, 93, 2], the number of bytes: 12

[Summary] In terms of memory occupation, the timeStamp provided by Google cannot reduce the memory occupation. After comprehensive comparison, it is suggested to use INT64 encoding for the timeStamp data.

4 summarizes

For integer data:

1. If there are negative numbers, it is recommended to use sint. 2, if all are positive, then uint, int, sint can be, but sint overcalculation Zigzag code, increase the calculation. It is recommended to use int by default, most likely sint for negative numbers. Note: There is no difference between Uint and int in Java, but there is a difference in C, so proto is designed with multiple platforms in mind. 3, if the proportion of large values is large, use fixed32 or fixed64.

For string data: Avoid Chinese characters.

For the timestamp: int64 encoding is recommended.

For floating point numbers such as coordinates: it is recommended to convert them to integer data, encoded with INT64

5 References

[2] Protobuf encoding implementation parsing (Java) [3] serialization: Protocol Buffer Serialization Protocol Buffer Serialization Protocol Buffer Serialization Protocol Buffer Serialization Protocol Buffer Serialization Protocol Buffer Serialization Protobuf: Protobuf: Protobuf: Protobuf: Protobuf: Protobuf: Protobuf: Protobuf: Protobuf: Protobuf: Protobuf: Protobuf: Protobuf: Protobuf [7] Language Guide (Proto3) [8] Float type (float, double) stored in memory?