Summary: FlatBuffers is an open source, cross-platform, efficient serialization tool library that provides multiple language interfaces. A serialization format similar to Protocal Buffers is implemented. Written primarily by Wouter van Oortmerssen and open-source by Google. This paper uses the FlatBuffers serialization tool to share the principle of FlatBuffers based on amap data compilation and incremental publishing.

The author to source | | large ali technology to the public

A preface

FlatBuffers is an open source, cross-platform, efficient serialization tool library that provides multiple language interfaces. A serialization format similar to Protocal Buffers is implemented. Written primarily by Wouter van Oortmerssen and open-source by Google. Oortmerssen, which originally developed FlatBuffers for Android games and performance-focused applications, now has interfaces for C ++, C #, C, Go, Java, PHP, Python, and JavaScript.

The FlatBuffers serialization tool is used in amap data compilation and incremental publishing, and the principle of FlatBuffers is studied and shared here. This paper briefly introduces the FlatBuffers Scheme. By analyzing the principle of FlatBuffers serialization and deserialization, the following questions are answered:

  • Question 1: How does FlatBuffers deserialize extremely quickly (or without decoding)?
  • Question 2: How does the default value of FlatBuffers not take up storage space (variables in the Table structure)?
  • Question 3: How FlatBuffers are byte aligned.
  • Question 4: How do FlatBuffers become forward-backward compatible (except for Struct structures)?
  • Question 5: Does FlatBuffers have an order requirement for the add field?
  • Question 6: How do FlatBuffers automatically generate codecs based on Scheme?
  • Question 7: How do FlatBuffers automatically generate Json based on Scheme?

Two FlatBuffers Scheme

FlatBuffers define data structures through Scheme files, and Schema definitions are straightforward, similar to the Interface Description Language (IDL) used by other frameworks. The Scheme of FlatBuffers is a kind of C language (although FlatBuffers has its own interface definition language, Scheme, to define the data to serialize with, it also supports the.proto format in Protocol Buffers). Monster. FBS in the official Tutorial is used as an example to illustrate:

// Example IDL file for our monster's schema.
namespace MyGame.Sample;
enum Color:byte { Red = 0, Green, Blue = 2 }
union Equipment { Weapon } // Optionally add more tables.
struct Vec3 {
  x:float;
  y:float;
  z:float;
}
table Monster {
  pos:Vec3;
  mana:short = 150;
  hp:short = 100;
  name:string;
  friendly:bool = false (deprecated);
  inventory:[ubyte];
  color:Color = Blue;
  weapons:[Weapon];
  equipped:Equipment;
  path:[Vec3];
}
table Weapon {
  name:string;
  damage:short;
}
root_type Monster;
Copy the code

namespace MyGame.Sample;

Namespace defines a namespace. You can define nested namespaces using. Segmentation.

enum Color:byte { Red = 0, Green, Blue = 2 };

Enum Defines the enumeration type. A slight difference from regular enumerated classes is that you can define types. For example, Color is byte. The enum field can only be added and cannot be discarded.

union Equipment {Weapon} // Optionally add more tables

A union is similar to the concept in C/C++ where multiple types can be placed in a union, sharing a single memory area. The use here is mutually exclusive, meaning that the memory region can only be used by one of the types. Compared to structs, memory is more efficient. Union is similar to enum, except that union contains table, and enum contains Scalar or struct. A union can only be part of a table, not root type.

struct Vect3{ x : float; y : float; z : float; };

Struct All fields are required, so there are no default values. Fields cannot be added or discarded, and can only contain scalars or other structs. Structs are mainly used in scenarios where data structures do not change and use less memory than tables and are faster to lookup (structs are stored in the parent table and do not need to use vtable).

table Monster{};

Table is the primary way of defining an object in FlatBuffers, and consists of a name (in this case, Monster) and a list of fields. It can contain all the types defined above. Each Field consists of three parts: name, type and default value. Each field has a default value, 0 or null if not explicitly written out. Each field is not required, and you can select fields to omit for each object, which is a forward – and backward-compatible mechanism of FlatBuffers.

root_type Monster;

Root table for specifying serialized data.

Scheme design needs special attention:

  • New fields can only be added to the end of the table. The old code ignores this field and still executes. The new code reads the old data and takes the default value of the new field.
  • Fields cannot be deleted from Scheme even if they are no longer in use. It can be marked as deprecated, and accessors for this field will not be generated when code is generated.
  • If you need a nested vector, wrap the vector in a table. String can use [byte] or [ubyte] support for other encodings.

Serialization of FlatBuffers

Simply put, FlatBuffers store object data in a one-dimensional array and cache the data in a ByteBuffer, in which each object is divided into two parts. The metadata part: is responsible for storing the index. Real data section: Stores the actual values. However, FlatBuffers are different from most in-memory data structures in that they use strict alignment rules and byte order to ensure that buffers are cross-platform. In addition, for table objects, FlatBuffers provide forward/backward compatibility and optional fields to support the evolution of most formats. In addition to parsing efficiency, binary formats offer another advantage, as binary representations of data are often more efficient. We can use 4-byte UInt instead of 10 characters to store 10-digit integers.

Basic principles of use of FlatBuffers for serialization:

  • Small endian mode. FlatBuffers are stored in a small-endian mode for various basic data, because this mode is currently consistent with most processors and speeds up data reading and writing.
  • The direction of writing data is different from that of reading data.

The sequence of FlatBuffers writing data to ByteBuffer is from the tail of ByteBuffer to the head. Since this growth direction is different from the default growth direction of ByteBuffer, Therefore, FlatBuffers cannot rely on the position of ByteBuffer to mark the location of valid data when writing data to ByteBuffer. Instead, they maintain a space variable to specify the location of valid data. Special attention should be paid to the growth characteristics of this variable when analyzing FlatBuffersBuilder. However, in contrast to the direction in which data is written, FlatBuffers parse data from byteBuffers in the normal order of Bytebuffers. The benefit of the FlatBuffers approach is that, when parsed from left to right, the entire ByteBuffer profile (for example, the VTABLE field of the Table type) will be read first.

Serialization for each data type:

1 Scalar type

Scalar types are basic types such as int, double, bool, etc. Scalar types use direct addressing to access data.

Example: short mana = 150; 12 bytes, the storage structure is as follows:

Scalars defined in schema can be set to default values. The default value of FlatBuffers does not occupy storage space. For scalars in table, the default value is not stored. If the variable value does not need to be changed, the corresponding offset value of this field in Vtable can be set to 0, and the default value is recorded in the decoding interface. When the offset of the field is 0, the decoding interface returns the default value. Since the Vtable structure is not used for struct structures, internal scalars have no default values and must be stored (the serialization of struct and table types is explained below).

// Computes how many bytes you'd have to pad to be able to write an
// "scalar_size" scalar if the buffer had grown to "buf_size" (downwards in
// memory).
inline size_t PaddingBytes(size_t buf_size, size_t scalar_size) {
    return ((~buf_size) + 1) & (scalar_size - 1);
}
Copy the code

Scalar data types are aligned by their own size in bytes. Evaluated by the PaddingBytes function, which is called by all scalars for byte alignment.

2 Struct type

In addition to basic types, only Struct types in FlatBuffers use direct addressing for data access. FlatBuffers specify that the Struct type is used to store data that is by convention and never changes. Once this type of data structure is determined, it never changes, no fields are optional (and there is no default value), and fields may not be added or deprecated, so structs do not provide forward/backward compatibility. Under this rule, FlatBuffers use direct addressing for structs alone to speed up data access. The order of the fields is the order of storage. Structs have features that are generally not root to schema files.

Struct Vec3(16, 17, 18); Twelve bytes

A struct defines a fixed memory layout in which all fields are aligned with its size and the struct is aligned with its largest scalar member.

Three types of vector

The vector type is actually the array type declared in the schema. There is no separate type of FlatBuffers, but it has its own storage structure. When serializing data, it stores the data in the vector from high to low. We then write the number of members of the Vector after the data is serialized. The data storage structure is as follows:

Example: byte[] treasure = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};

Vector size is of type int, so the vector is four-byte byte aligned on initialization requests.

4 type String

The FlatBuffers string is encoded according to UTF-8, and the encoding array of the string is used as a one-dimensional vector for string writing. A string is essentially a vector of bytes, so it is created in much the same way as a vector, except that the string is null-terminated, with a 0 at the end. String writes data to the following structure:

Example: String name = “Sword”;

Vector size is of type int, so the strings are four-byte byte aligned when initializing the memory request.

5 the Union type

The Union type is special, and FlatBuffers specify that this type has the following two limitations on its use:

  • Members of type Union can only be of type Table.
  • The Union type cannot be the root of a schema file.

There is no specific type of union in FlatBuffers; instead, a separate class is generated for the member types of the union. The main difference with other types is that the type needs to be specified first. When serializing a Union, the type of the Union is generally written first, and then the data offset of the Union is written. When deserializing the Union, the type of the Union is usually precipitated first, and then the corresponding data of the Union is resolved according to the Table type corresponding to the type.

6 Enum type

The enum type in FlatBuffers is stored in the same way as the byte type. Because, like the Union type, there is no separate class for the enum type in FlatBuffers, classes declared as enum in the Schema will be compiled to generate a separate class.

  • The enum type cannot be the root of a schema file.

7 Table type

Table is the cornerstone of FlatBuffers. To solve the problem of data structure changes, table indirectly accesses fields through VTable. Each table comes with a VTable, which can be shared between multiple tables with the same layout, and contains information about the fields that store this particular type of VTable instance. Vtable may also indicate that the field does not exist (because this FlatBuffers was written using an older version of the code, simply because the information is not required for this instance or is considered deprecated), in which case the default is returned.

Tables have a small memory overhead (because Vtables is small and shared) and a small access cost (indirect access), but provide a great deal of flexibility. Tables in special cases may cost less memory than equivalent structs because the fields do not need to be stored in buffer when they are equal to default values. This structure determines that members of some complex types are accessed using relative addressing, that is, the offset of the member constant is retrieved from the Table, and then the real data is retrieved from the address where the constant is stored.

In terms of structure, the Table can be divided into two parts. The first part is the summary of the variables stored in the Table, named vtable, and the second part is the data part of the Table, which stores the values of the various members of the Table, named table_data. Note that if a Table member is of a simple type or Struct type, the value of that member is stored directly in table_data. If the member is of a complex type, then table_data stores only the offset of the member’s data relative to the address at which it was written, that is, to get the real data of the member, the data in table_data must be retrieved for a relative address.

  • Vtable is an array of short types. Its length is (number of fields +2) *2 bytes. The first field is the size of the Vtable, including the size itself. The second field is the size of the object corresponding to vtable, including offset to vtable; Next is the offset for each field relative to the start of the object.
  • Table_data starts with an INT offset from the start of the vtable minus the start of the current table object. Since the Vtable can be anywhere, this value can be negative. Table_data starts by storing vtable offset with int, so it’s four-byte aligned.

The operation of add is to add table_data, because the Table data structure is stored through the vtable-table_data mechanism, this operation does not require the order of the fields, there is no requirement for the order. Since vTable records the offset of each field relative to the starting position of the object in the order defined in the schema, it can obtain the correct value based on the offset even without the order when adding fields. Note that FlatBuffers do byte alignment each time a field is added.

std::string e_poiId = "1234567890";
double e_coord_x = 0.1; 
double e_coord_y = 0.2;
int e_minZoom = 10;
int e_maxZoom = 200;

//add
featureBuilder.add_poiId(nameData);
featureBuilder.add_x(e_coord_x);
featureBuilder.add_y(e_coord_y);
featureBuilder.add_maxZoom(e_maxZoom);
featureBuilder.add_minZoom(e_minZoom);
auto rootData = featurePoiBuilder.Finish();
flatBufferBuilder.Finish(rootData);
blob = flatBufferBuilder.GetBufferPointer();
blobSize = flatBufferBuilder.GetSize();
Copy the code

Add order 1: Final binary size is 72 bytes.

std::string e_poiId = "1234567890";
double e_coord_x = 0.1; 
double e_coord_y = 0.2;
int e_minZoom = 10;
int e_maxZoom = 200;

//add
featureBuilder.add_poiId(nameData);
featureBuilder.add_x(e_coord_x);
featureBuilder.add_minZoom(e_minZoom);
featureBuilder.add_y(e_coord_y);
featureBuilder.add_maxZoom(e_maxZoom);
auto rootData = featurePoiBuilder.Finish();
flatBufferBuilder.Finish(rootData);
blob = flatBufferBuilder.GetBufferPointer();
blobSize = flatBufferBuilder.GetSize();
Copy the code

Add order 2: The final binary size is 80 bytes.

The schema files corresponding to add sequence 1 and Add sequence 2 are the same, and the data expressed are the same. Does the Table structure have any order requirements for adding fields? The serialized data size difference is 8 bytes due to byte alignment. Therefore, when adding fields, try to add fields of the same type together to avoid unnecessary byte alignment and obtain smaller serialization results.

Forward and backward compatibility of FlatBuffers refers to the table structure. The table structure has a default value for each field, which defaults to 0 or NULL if not explicitly written. Each field is not required, and you can select fields to omit for each object, which is a forward – and backward-compatible mechanism of FlatBuffers. Note that:

  • New fields can only be added to the end of the table. The old code ignores this field and still executes. The new code reads the old data, and the new fields return the default values.
  • Fields cannot be removed from the schema even if they are no longer in use. This field can be marked as deprecated and does not generate the interface for the field when code is generated.

Deserialization of FlatBuffers

The deserialization of FlatBuffers is straightforward. Since the offset of each field is preserved during serialization, the deserialization process actually reads data from the specified offset. The deserialization process is to read the binary stream backwards from the root table. Read the corresponding offset from vTABLE, and then find the corresponding field in the corresponding object. If it is a reference type, string/vector/table, read the offset, find the corresponding value of offset again, and read it. If the type is not a reference, read the corresponding position according to offset in vtable. For scalars, there are two cases, default and non-default. Fields with default values, when read, are read directly from the default values recorded in the flatC compiled file. The offset of a non-default field will be recorded in the binary stream, and the value will be stored in the binary stream. The field value can be read directly from the offset during deserialization.

The entire deserialization process is zero copy and does not consume any memory resources. And FlatBuffers can read arbitrary fields, as opposed to Json and Protocol buffers, which need to read the entire object to get a field. The main advantage of FlatBuffers is here in deserialization. So FlatBuffers can be decoded extremely fast, or read directly without decoding.

Automation of FlatBuffers

The automation of FlatBuffers includes automatic generation of codec interface and automatic generation of Json, and automatic generation of codec interface and automatic generation of Json, both of which depend on schEM’s parsing.

1 schema Describes file parsing

The FlatBuffers description file parser identifies the data structures supported by FlatBuffers in a cursor sequence. Gets the field name, field type, field default value, whether to deprecate, and other properties. Supported keywords: scalar type, non-scalar type, include, namespace, and root_type.

If you need a nested vector, wrap the vector in a table.

Automatic generation of coding and decoding interface

FlatBuffers are programmed using templates, and the encoding and decoding interface generates only H files. Realize the definition of data structure, and specialized variable Add function, Get function, check function interface. The corresponding filename is filename_generated.

3 Automatically generates Json

The main goal of FlatBuffers is to avoid deserialization. By defining a binary data protocol, a method of converting defined data into binary data. Binary structures created by the protocol can be read without further decoding. So when automatically generating JSON, you only need to provide the binary data stream and binary definition structure to read the data and convert it to JSON.

  • The Json structure is consistent with the FlatBuffers structure.
  • The default value does not output Json.

Advantages and disadvantages of FlatBuffers

FlatBuffers define data structures through Scheme files, and Schema definitions are straightforward, similar to the Interface Description Language (IDL) used by other frameworks. The Scheme of FlatBuffers is a kind of C language (although FlatBuffers has its own interface definition language, Scheme, to define the data to serialize with, it also supports the.proto format in Protocol Buffers). Monster. FBS in the official Tutorial is used as an example to illustrate:

1 the advantages

  • The decoding speed is extremely fast, storing serialized data in the cache, which can either be written to a file, transferred over the network as is, or read directly without any parsing overhead. The only memory requirement for accessing data is the buffer, and no additional memory allocation is required.
  • Extensibility, flexibility: The optional fields it supports mean good forward/back compatibility. FlatBuffers support selective writing of data members, which not only provides compatibility between different versions of an application for a data structure, but also gives programmers the flexibility to choose whether to write certain fields and to design the data structure for transmission.
  • Cross-platform: supports C++11, Java without requiring any dependent libraries, and works well on the latest editors such as GCC, clang, vs2010, etc. Simple to use, requiring only a small amount of automatically generated code and a single header file dependency, it is easy to integrate into existing systems. The generated C++ code provides a simple access and construction interface that is compatible with parsing in other formats such as Json.

Two shortcomings

  • The data is unreadable and must be visualized to understand the data.
  • Backward compatibility is limited, and you must be careful when adding or removing fields from a schema.

Seven summarizes

The biggest advantage of FlatBuffers over other serialization tools is that deserialization is extremely fast, or that no decoding is required. It is possible to benefit from the properties of FlatBuffers in scenarios where serialized data is decoded frequently.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.