define
- Serialization: The process of converting data structures or objects into binary strings
- Deserialization: The process of converting binary strings generated during serialization into data structures or objects
Serialization reasons:
- Save objects permanently, saving the byte sequence of the object to a local file or database;
- To transmit and receive objects in the network in the form of byte stream;
- Passing objects between processes;
Typical C/S serialization and deserialization
- Interface Description Language (IDL) : Parties involved in communication need to make agreements on communication content. Conventions are described in a language independent of the specific development language or platform. This language is called Interface Description Language (IDL)
- IDL Compiler: Converts IDL files into dynamic libraries corresponding to each language.
- Stub/Skeleton Lib: Working code responsible for serialization and deserialization. Stub is a piece of code deployed on the client of a distributed system. On the one hand, it receives the parameters of the application layer, serializes them, and sends them to the server through the underlying protocol stack. On the other hand, it receives the serialized result data of the server, and delivers them to the client application layer after deserialization. Skeleton is deployed on the server side and functions as the opposite of the Stub. It receives serialization parameters from the transport layer, deserializes them and passes them to the server application layer. The execution results of the application layer are serialized and finally transmitted to the client Stub.
Ps: In general, serialization and deserialization frameworks need to share IDL files
Typical serialization solution
Examples are XML, JSON, Protobuf, Thrift, and Avro
JSON
Note: Parsing JSON is similar to XML; JSON is used as an example
JSON data type:
- string
- Number: indicates the number, including integer and floating point types
- Boolean: Boolean
- null
JSON structure:
- Key /value pairs. Analogies object, struct, dictionary, hash table in other languages
- Ordered list of values. Array, vector, list, or sequence in other languages.
JSON example:
{
"key1": 1,
"key2": ["value2"]}Copy the code
For efficiency’s sake, using streams is almost the only option, where the parser simply scans the JSON string from scratch to parse out the entire data structure.
Analytical steps:
-
Step 1: Character parsing example is as follows: for JSON string: {“name”: “Mary”, “age”: 18} Parsing result (Token stream) : {“name”: “m ary”,” A g E “: 18}
-
Step 2: Parse to JSON object/array based on Token stream
Token flow
token | meaning |
---|---|
NULL | null |
NUMBER | digital |
STRING | string |
BOOLEAN | true/false |
SEP_COLON | : |
SEP_COMMA | . |
BEGIN_OBJECT | { |
END_OBJECT | } |
BEGIN_ARRAY | [ |
END_ARRAY | ] |
END_DOCUMENT | JSON Document End |
JSON state machine
The JSON parser is essentially a state machine.
The JSON state machine is as follows:
The explanation is as follows:
-
‘{‘ : expects a JSON object;
-
‘:’ : expects a JSON object value;
-
‘,’ : expects the next set of key-values of a JSON object, or the next element of a JSON array;
-
‘[‘ : expects a JSON array;
-
‘t’ : expects a true;
-
‘f’ : expect a false;
-
‘n’ : expects a null;
-
‘”‘ : expects a string;
-
0 to 9: Expect a number.
Protobuf
Liverpoolfc.tv: developers.google.com/protocol-bu…
Example:
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
}
Copy the code
T - L - V
The data storage mode of
Definition: tag-length-value: identifier – Length (optional) – Storage mode of field values
Advantages:
-
You do not need a delimiter to separate fields
-
Compact storage
-
The field is not set to a field value, so no encoding is required and the corresponding field will be set to the default value when decoding
Analytical principle
-
The Protocol Buffer encodes each field in the message and stores the data in the T-L-V storage mode, resulting in a binary byte stream
-
The Protocol Buffer adopts different serialization methods for different data types, as shown in the following figure:
Note: the storage
Varint
No byte length is stored when encoding dataLength
, so in factProtocol Buffer
Is stored inT - V
Varint coding
Definition: a variable length encoding method
Coding steps:
- Fetch the last 7 bits of the byte string
-
- If it is the last fetch, 0 is added to the highest bit to form 1 byte
-
- Otherwise, add 1 to the highest bit to form a byte
-
Continue to pick 7 bits from the end of the byte string by moving the entire byte string 7 bits to the right until it runs out
-
Concatenate each of the above formed bytes into a byte string in order
When Varint is decoded, the last byte of Varint is read as long as the highest byte is 0
Effect: The smaller the value, the fewer bytes used in the representation
Eg: Any number less than 128 can be represented by 1 byte. For other encodings, int32 numbers generally require 4 bytes.
Here is:
Code:
Left: 296, right: 104
Decoding:
Disadvantage: Using Varint will treat negative numbers as large integers (the highest bit is 1)
Solution: The Protocol Buffer defines sint32 / SINt64 for negative numbers, Zigzag encoding (converting signed numbers to unsigned numbers), and Varint encoding to reduce the number of encoded bytes
Zigzag encoding
Definition: a variable length encoding method
Principle: Use unsigned numbers to represent signed numbers;
Effect: The number with small absolute value can be represented by fewer bytes.
Sint 32 code:
(n <<1) ^ (n >>31)
-
Move the binary representation to the left by 1 bit (move left = move the whole binary left, fill the low position with 0)
-
Moving the binary representation 31 bits to the right of the binary (signed number) with its first digit of 1 is an arithmetic right shift, that is, moving the binary (unsigned number) with its first digit of 0 to the left is a logical left shift, that is, moving the binary representation right to the left of 0
-
Xor the above two
Sint 63 is just going to be shifted 31 places to 63 places to the right
Note: the binary = sign bit of negative number is 1, and the remaining bits are the source code of the absolute value of the number is reversed by bits; And then the whole binary number plus 1
Decode :(n >>> 1) ^ -(n & 1)
Note: >>> Unsigned right shift
Illustration: Example number -2
T - V
storage
Protocol Buffer uses Varint and Zigzag encoding and stores data in T-V mode.
Tag: indicates the id of the message field
-
The identification number (field_number) and data type (wire_type) of the field are stored, that is
Tag = (field_number << 3) | wire_type Copy the code
Field_number: The identification number corresponding to the message field in the.proto file, indicating the number of fields in the message
Wire_type: the value is 0 to 5, and only three characters are required
enum WireType { WIRETYPE_VARINT = 0, WIRETYPE_FIXED64 = 1, WIRETYPE_LENGTH_DELIMITED = 2, WIRETYPE_START_GROUP = 3, WIRETYPE_END_GROUP = 4, WIRETYPE_FIXED32 = 5 }; Copy the code
-
Occupy one byte of space (if the id number exceeds 16, occupy one more byte of space)
-
When decoded, the Protocol Buffer corresponds values to fields in the message based on the Tag
eg:
message person
{
required int32 id = 1; // wire typeField_number = 1 required string name = 2; // wiretype= 2, Field_number = 2} // If a Tag binary = 0001 0010 // Id = field_number = field_number << 3 = move 3 bits to the right = 0000 0010 = 2 // Data type = Wire_type = Lowest three digit representation = 010 = 2Copy the code
Value:
The value of a Varint or Zigzag encoded message field encoded by the Protocol Buffer.
Here is:
Message Test {required INT32 ID1 = 1; Required INT32 ID2 = 2; } // add a value to id1:296 test.setid1 (300); // Add a value to id2:296 test.setid1 (300); Test.setid2 (296); Binary byte stream = [8, -84, 2, 16, -88, 2]Copy the code
The coding process is as follows:
Encoding of floating point numbers
Floating-point 64 (32) -bit encoding is simple: the encoded data has a fixed size = 64 bits (8 bytes) / 32 bits (4 bytes)
Data is stored in T-V mode, as above.
Wire Type = 2
Data storage mode: T-L-V
The Tag code is the same as above
Three data types of Value:
-
Type String
-
Nested Message type (Message) The V of the Message is the field of the nested Message
-
Prevent Tag redundancy by enclosing a repeat field (i.e. a packed repeated field)
conclusion
Application scenario: Data storage with a small amount of data to be transferred and an unstable network environment, or RPC data exchange, for example, instant IM
Note: Big data is not suitable for protobuf storage, mainly because Tag reuse in big data is unnecessary. See Avro below for a solution
Advantages:
-
Serialized data is very compact and compact, with about 1/3 to 1/10 of the serialized data compared to XML
-
Parsing is very fast, about 20-100 times faster than the corresponding XML
-
Standard IDL and IDL compilers, very engineer friendly
-
Cross-platform, cross-language
-
Good encryption, HTTP packet capture can only see bytecode
-
Provides a validation mechanism that is easier to extend
Disadvantages:
-
Unreadable by humans
-
Poor versatility, mainly used for internal transmission
-
Poor self-interpretation, need to use.proto file to understand the data structure
Thrift
Website: thrift.apache.org
Thrift request response model:
Messages and structs can be likened to headers and loads in TCP. A Message is a transmitted metadata, and a Struct is a transmitted data payload.
The Message:
-
Name: indicates the Name of the invoked method
-
Message Type: there are four types: Call, OneWay, Reply and Exception. In actual transmission, Type ID is transmitted. The corresponding Type IDS of these four types are as follows
Call ---> 1 OneWay ---> 2 Reply ---> 3 Exception ---> 4 Copy the code
Call and OneWay are used in Request, Reply and Exception are used in Response.
The meanings of the four are as follows:
-
- Call: Invokes a remote method and expects a response.
-
- OneWay: Calls a remote method without expecting a response. There are no steps 3 and 4.
-
- Reply: Indicates that the processing is complete and the response is returned normally.
-
- Exception: indicates a processing error.
- Sequence ID: Indicates the Sequence number, which is a signed four-byte integer. All outstanding requests on a transport layer connection must have a unique sequence number, which is used by the client to handle the out-of-order arrival of the response, matching the request and response. The server does not need to check the sequence number, nor does it have any logical dependence on the sequence number, but simply returns it as it is when it responds. Note here that the Thrift sequence number is distinguished from the unique ID we commonly use to prevent multiple submissions from non-idempotent requests.
Struct:
Example:
struct Person {
1: required i32 age;
2: required string name;
}
Copy the code
Thrift supports multiple serialization protocols, such as Binary, Compact, and JSON.
Binary serialization
Message
Message is encoded in two ways:
The first is strict coding
Binary protocol Message, strict encoding, 12+ bytes:
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|1vvvvvvv|vvvvvvvv|unused |00000mmm| name length | name | seq id |
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
Copy the code
-
VVVVVVVVVVVVV aN unsigned 15 bit number fixed to 1 (in binary: 000 0000 0000 0001). The leading bit is 1
-
unused
is an ignored byte. -
mmm
is the message type, an unsigned 3 bit integer. The 5 leading bits must be0
as some clients (checked for java in 0.9.1) take the whole byte. -
name length
is the byte length of the name field, a signed 32 bit integer encoded in network (big endian) order (must be >= 0). -
name
is the method name, a UTF-8 encoded string. -
seq id
is the sequence id, a signed 32 bit integer encoded in network (big endian) order.
The second kind: not strict coding
Binary protocol Message, old encoding, 9+ bytes: +--------+--------+--------+--------+--------+... +--------+--------+--------+--------+ | name length | name |00000mmm| seq id | + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- +... +--------+--------+--------+--------+Copy the code
Where name length
, name
, mmm
, seq id
are as above.
Because name length
must be positive (therefore the first bit is always 0
), the first bit allows the receiver to see whether the strict format or the old format is used. Therefore a server and client using the different variants of the binary protocol can transparently talk with each other. However, when strict mode is enforced, the old format is rejected.
There are four types of Message types:
-
Call: 1
-
Reply: 2
-
Exception: 3
-
Oneway: 4
Struct
Type name | The idl type name | Occupied bytes | Type the ID |
---|---|---|---|
byte | byte | 1 | 3 |
short | i16 | 2 | 6 |
int | i32 | 4 | 8 |
bool | bool | 1 | 2 |
long | i64 | 8 | 10 |
double | double | 8 | 4 |
string | string | 4+N | 11 |
[]byte | binary | 4+N | |
list | list | 1+4+N | 15 |
set | set | 1+4+N | 14 |
map | map | 1+1+4+NX+NY | 13 |
field | 1+2+X | ||
struct | struct | N*X | 12 |
enum | |||
union | |||
exception |
Fixed length encoding: bool, byte, short, int, long, double are all fixed byte encoding
struct ::= ( field-header field-value )* stop-field
field-header ::= field-type field-id
Copy the code
-
field id
the field-id, a signed 16 bit integer in big endian order. -
field-value
the encoded field value. -
Stop-field: 00000000, which marks the end of a Thrift message
Length prefix encoding (4+N) :
+--------+----------+
|size(4) |content(N)|
+--------+----------+
Copy the code
Map (1+1+4+NX+NY):
List and set encoding (1+4+N*X)
Note: Key and value are of a certain type
Compact serialization
Similar to Binary serialization, zigzag and Varint are used to compress integer types. Zigzag and Varint are described in Protobuf.
Data example: Person(age:18, name:yano)
Generate: [8, 0, 1, 0, 0, 0, 18, 11, 0, 2, 0, 0, 0, 4, 121, 97, 110, 111, 0]
Explanation:
8 // Data type is I32 0, 1 // Field ID is 1 0, 0, 0, 18 // Field ID is 1 (age), 4 bytes 11 // The data type is String 0, 2 // The field ID is 2 (name) 0, 0, 0, 4 // The length of the string name is 4 bytes 121, 97, 110, 111 //"yano"4 ASCII codes (utF-8 encoding) 0 // endCopy the code
Avro
Liverpoolfc.tv: avro.apache.org/docs/curren…
Overview: Avro is a subproject of Hadoop and an independent project of Apache. Avro is a high-performance middleware based on binary data transfer. Avro was designed to support data-intensive applications, suitable for remote or local storage and exchange of large-scale data.
Features:
-
Rich data structure types;
-
Fast and compressible binary data form, binary data serialization can save data storage space and network transmission bandwidth;
-
A file container for persistent data;
-
Remote procedure call RPC can be implemented;
-
Simple dynamic language combination features.
Avro relies on schemas, which dynamically load related data. Avro reads and writes data frequently, and these operations use schemas, which reduce the overhead of writing to each data file and make serialization fast and light. This self-description of data and its schemas facilitates the use of dynamic scripting languages. When Avro data is stored in a file, its schema is stored with it, so that any program can process the file.
Data structure:
For schema, see the use of Node Mongoose.
Storage mode:
Container file structure:
Comparison and application scenarios
-
JSON is suitable for HTTP-based projects with no extreme performance requirements and easy debugging, eg: Web platform;
-
PB has the characteristics of cross-platform, fast parsing speed, small serialized data volume, high scalability and easy to use. It is suitable for scenarios with small amount of data transmission and high requirements on delay and speed, eg: real-time communication;
-
Avro is suitable for dynamic language scenarios and big data transmission and storage scenarios.
-
Thrift is a framework, not just a serialization solution, with the advantage of language support and relative maturity.
Analytical performance:
Serialization space overhead: