Protocol buffers serialization

In fact, the process of encode has been talked about in the previous article. This article takes Golang as an example to talk about the process of serialization and deserialization from the level of code implementation.

This article starts with an example of go using Protobuf for data serialization and deserialization.

Create a new message for example:

syntax = "proto2"; package example; enum FOO { X = 17; }; message Test { required string label = 1; optional int32 type = 2 [default=77]; repeated int64 reps = 3; optional group OptionalGroup = 4 { required string RequiredField = 5; }}Copy the code

Protoc -gen-go is used to generate corresponding GET/set methods. The generated code can then be used for serialization and deserialization.

	package main

	import (
		"log"

		"github.com/golang/protobuf/proto"
		"path/to/example"
	)

	func main(a) {
		test := &example.Test {
			Label: proto.String("hello"),
			Type:  proto.Int32(17),
			Reps:  []int64{1.2.3},
			Optionalgroup: &example.Test_OptionalGroup {
				RequiredField: proto.String("good bye"),
			},
		}
		data, err := proto.Marshal(test)
		iferr ! =nil {
			log.Fatal("marshaling error: ", err)
		}
		newTest := &example.Test{}
		err = proto.Unmarshal(data, newTest)
		iferr ! =nil {
			log.Fatal("unmarshaling error: ", err)
		}
		// Now test and newTest contain the same data.
		iftest.GetLabel() ! = newTest.GetLabel() { log.Fatalf("data mismatch %q ! = %q", test.GetLabel(), newTest.GetLabel())
		}
		// etc.
	}
Copy the code

In the code above, Proto.marshal () is the serialization process. Proto.unmarshal () is the deserialization process. This section looks at the implementation of the serialization process and the next section looks at the implementation of the deserialization process.

// Marshal takes the protocol buffer
// and encodes it into the wire format, returning the data.
func Marshal(pb Message) ([]byte, error) {
	// Can the object marshal itself?
	if m, ok := pb.(Marshaler); ok {
		return m.Marshal()
	}
	p := NewBuffer(nil)
	err := p.Marshal(pb)
	if p.buf == nil && err == nil {
		// Return a non-nil slice on success.
		return []byte{}, nil
	}
	return p.buf, err
}
Copy the code

As soon as the serialization function comes in, it first calls the serialization method implemented by the Message object itself.

// Marshaler is the interface representing objects that can marshal themselves.
type Marshaler interface {
	Marshal() ([]byte, error)
}
Copy the code

Marshaler is an interface that is reserved for custom serialization of objects. If there is an implementation, return implements its own method. If not, default serialization is next.

	p := NewBuffer(nil)
	err := p.Marshal(pb)
	if p.buf == nil && err == nil {
		// Return a non-nil slice on success.
		return []byte{}, nil
	}
Copy the code

Create a new Buffer and call the Marshal() method of Buffer. After message is serialized, the data stream is placed into the BUF stream of Buffer. Serialization finally returns a BUF byte stream.

type Buffer struct {
	buf   []byte // encode/decode byte stream
	index int    // read point

	// pools of basic types to amortize allocation.
	bools   []bool
	uint32s []uint32
	uint64s []uint64

	// extra pools, only used with pointer_reflect.go
	int32s   []int32
	int64s   []int64
	float32s []float32
	float64s []float64
}
Copy the code

Buffer is a Buffer manager for serializing and deserializing Protocol Buffers. It can be reused at the time of invocation to reduce memory usage. Internal maintains 7 pools, 3 for base data types and 4 for pointer_reflect only.

func (p *Buffer) Marshal(pb Message) error {
	// Can the object marshal itself?
	if m, ok := pb.(Marshaler); ok {
		data, err := m.Marshal()
		p.buf = append(p.buf, data...)
		return err
	}

	t, base, err := getbase(pb)
	// Exception handling
	if structPointer_IsNil(base) {
		return ErrNil
	}
	if err == nil {
		err = p.enc_struct(GetProperties(t.Elem()), base)
	}

	// set the Encode number to Encode
	if collectStats {
		(stats).Encode++ // Parens are to work around a goimports bug.
	}
	// maxMarshalSize = 1<< 31-1, which is the maximum protobuf can be encoded.
	if len(p.buf) > maxMarshalSize {
		return ErrTooLarge
	}
	return err
}
Copy the code

The Buffer’s Marshal() method still calls to see if the object implements the Marshal() interface. If so, it is left to serialize itself, and the serialized binary stream is added to the BUF stream.

func getbase(pb Message) (t reflect.Type, b structPointer, err error) {
	if pb == nil {
		err = ErrNil
		return
	}
	// get the reflect type of the pointer to the struct.
	t = reflect.TypeOf(pb)
	// get the address of the struct.
	value := reflect.ValueOf(pb)
	b = toStructPointer(value)
	return
}
Copy the code

The getBase method uses the reflect method to get the type of message and the structure pointer to the corresponding value. Get the structure pointer and do exception handling first.

P.nc_struct (GetProperties(t.lem ()), base)

// Encode a struct.
func (o *Buffer) enc_struct(prop *StructProperties, base structPointer) error {
	var state errorState
	// Encode fields in tag order so that decoders may use optimizations
	// that depend on the ordering.
	// https://developers.google.com/protocol-buffers/docs/encoding#order
	for _, i := range prop.order {
		p := prop.Prop[i]
		ifp.enc ! =nil {
			err := p.enc(o, p, base)
			iferr ! =nil {
				if err == ErrNil {
					if p.Required && state.err == nil {
						state.err = &RequiredNotSetError{p.Name}
					}
				} else if err == errRepeatedHasNil {
					// Give more context to nil values in repeated fields.
					return errors.New("repeated field " + p.OrigName + " has nil element")}else if! state.shouldContinue(err, p) {return err
				}
			}
			if len(o.buf) > maxMarshalSize {
				return ErrTooLarge
			}
		}
	}

	// Do oneof fields.
	ifprop.oneofMarshaler ! =nil {
		m := structPointer_Interface(base, prop.stype).(Message)
		if err := prop.oneofMarshaler(m, o); err == ErrNil {
			return errOneofHasNil
		} else iferr ! =nil {
			return err
		}
	}

	// Add unrecognized fields at the end.
	if prop.unrecField.IsValid() {
		v := *structPointer_Bytes(base, prop.unrecField)
		if len(o.buf)+len(v) > maxMarshalSize {
			return ErrTooLarge
		}
		if len(v) > 0 {
			o.buf = append(o.buf, v...) }}return state.err
}

Copy the code

As you can see in the code above, except oneof fields and unrecognized fields are processed separately, the other types are serialized by calling p.enc(o, p, base).

The data structure for Properties is defined as follows:

type Properties struct {
	Name     string // name of the field, for error messages
	OrigName string // original name before protocol compiler (always set)
	JSONName string // name to use for JSON; determined by protoc
	Wire     string
	WireType int
	Tag      int
	Required bool
	Optional bool
	Repeated bool
	Packed   bool   // relevant for repeated primitives only
	Enum     string // set for enum types only
	proto3   bool   // whether this is known to be a proto3 field; set for []byte only
	oneof    bool   // whether this is a oneof field

	Default     string // default value
	HasDefault  bool   // whether an explicit default was provided
	CustomType  string
	StdTime     bool
	StdDuration bool

	enc           encoder
	valEnc        valueEncoder // set for bool and numeric types only
	field         field
	tagcode       []byte // encoding of EncodeVarint((Tag<<3)|WireType)
	tagbuf        [8]byte
	stype         reflect.Type      // set for struct types only
	sstype        reflect.Type      // set for slices of structs types only
	ctype         reflect.Type      // set for custom types only
	sprop         *StructProperties // set for struct types only
	isMarshaler   bool
	isUnmarshaler bool

	mtype    reflect.Type // set for map types only
	mkeyprop *Properties  // set for map types only
	mvalprop *Properties  // set for map types only

	size    sizer
	valSize valueSizer // set for bool and numeric types only

	dec    decoder
	valDec valueDecoder // set for bool and numeric types only

	// If this is a packable field, this will be the decoder for the packed version of the field.
	packedDec decoder
}

Copy the code

In the Properties structure, you define an encoder named ENC and a decoder named Dec.

Encoder and decoder functions are defined exactly the same.

type encoder func(p *Buffer, prop *Properties, base structPointer) error
Copy the code
type decoder func(p *Buffer, prop *Properties, base structPointer) error

Copy the code

The encoder and decoder functions are initialized in Properties:

// Initialize the fields for encoding and decoding.
func (p *Properties) setEncAndDec(typ reflect.Type, f *reflect.StructField, lockGetProp bool) {
	// The following code has been deleted, similar parts have been omitted
	// proto3 scalar types
	
	case reflect.Int32:
		if p.proto3 {
			p.enc = (*Buffer).enc_proto3_int32
			p.dec = (*Buffer).dec_proto3_int32
			p.size = size_proto3_int32
		} else {
			p.enc = (*Buffer).enc_ref_int32
			p.dec = (*Buffer).dec_proto3_int32
			p.size = size_ref_int32
		}
	case reflect.Uint32:
		if p.proto3 {
			p.enc = (*Buffer).enc_proto3_uint32
			p.dec = (*Buffer).dec_proto3_int32 // can reuse
			p.size = size_proto3_uint32
		} else {
			p.enc = (*Buffer).enc_ref_uint32
			p.dec = (*Buffer).dec_proto3_int32 // can reuse
			p.size = size_ref_uint32
		}
	case reflect.Float32:
		if p.proto3 {
			p.enc = (*Buffer).enc_proto3_uint32 // can just treat them as bits
			p.dec = (*Buffer).dec_proto3_int32
			p.size = size_proto3_uint32
		} else {
			p.enc = (*Buffer).enc_ref_uint32 // can just treat them as bits
			p.dec = (*Buffer).dec_proto3_int32
			p.size = size_ref_uint32
		}
	case reflect.String:
		if p.proto3 {
			p.enc = (*Buffer).enc_proto3_string
			p.dec = (*Buffer).dec_proto3_string
			p.size = size_proto3_string
		} else {
			p.enc = (*Buffer).enc_ref_string
			p.dec = (*Buffer).dec_proto3_string
			p.size = size_ref_string
		}

	case reflect.Slice:
		switch t2 := t1.Elem(); t2.Kind() {
		default:
			logNoSliceEnc(t1, t2)
			break

		case reflect.Int32:
			if p.Packed {
				p.enc = (*Buffer).enc_slice_packed_int32
				p.size = size_slice_packed_int32
			} else {
				p.enc = (*Buffer).enc_slice_int32
				p.size = size_slice_int32
			}
			p.dec = (*Buffer).dec_slice_int32
			p.packedDec = (*Buffer).dec_slice_packed_int32
		
			default:
				logNoSliceEnc(t1, t2)
				break}}case reflect.Map:
		p.enc = (*Buffer).enc_new_map
		p.dec = (*Buffer).dec_new_map
		p.size = size_new_map

		p.mtype = t1
		p.mkeyprop = &Properties{}
		p.mkeyprop.init(reflect.PtrTo(p.mtype.Key()), "Key", f.Tag.Get("protobuf_key"), nil, lockGetProp)
		p.mvalprop = &Properties{}
		vtype := p.mtype.Elem()
		ifvtype.Kind() ! = reflect.Ptr && vtype.Kind() ! = reflect.Slice {// The value type is not a message (*T) or bytes ([]byte),
			// so we need encoders for the pointer to this type.
			vtype = reflect.PtrTo(vtype)
		}

		p.mvalprop.CustomType = p.CustomType
		p.mvalprop.StdDuration = p.StdDuration
		p.mvalprop.StdTime = p.StdTime
		p.mvalprop.init(vtype, "Value", f.Tag.Get("protobuf_val"), nil, lockGetProp)
	}
	p.setTag(lockGetProp)
}

Copy the code

In the above code, each type is respectively switch-case enumeration, each case is set corresponding encode encoder, decode decoder, size size. The differences between a proto2 and a proto3 are also treated in two different cases.

There are several types, Reflect. Bool, Reflect. Int32, Reflect. Int64, Reflect. Int64, Reflect. Uint64, Reflect. Float32, Reflect. Float64, Reflect. String, re Struct, reflect.Ptr, reflect.Slice, reflect.Map

The following three categories are mainly selected, Int32, String, Map code implementation for analysis.

1. Int32

func (o *Buffer) enc_proto3_int32(p *Properties, base structPointer) error {
	v := structPointer_Word32Val(base, p.field)
	x := int32(word32Val_Get(v)) // permit sign extension to use full 64-bit range
	if x == 0 {
		return ErrNil
	}
	o.buf = append(o.buf, p.tagcode...)
	p.valEnc(o, uint64(x))
	return nil
}
Copy the code

Processing the Int32 code is relatively simple. First put the tagcode into the BUF binary dataflow buffer, then serialize Int32, and then put the tagcode into the buffer immediately after the serialization.

// EncodeVarint writes a varint-encoded integer to the Buffer.
// This is the format for the
// int32, int64, uint32, uint64, bool, and enum
// protocol buffer types.
func (p *Buffer) EncodeVarint(x uint64) error {
	for x >= 1<<7 {
		p.buf = append(p.buf, uint8(x&0x7f|0x80))
		x >>= 7
	}
	p.buf = append(p.buf, uint8(x))
	return nil
}
Copy the code

Int32 encoding processing method in the previous chapter, the inside of the Varint processing method. This function also applies to INT32, INT64, uint32, uint64, bool, enum.

You can also take a look at sint32, Fixed32 specific code implementation.

// EncodeZigzag32 writes a zigzag-encoded 32-bit integer
// to the Buffer.
// This is the format used for the sint32 protocol buffer type.
func (p *Buffer) EncodeZigzag32(x uint64) error {
	// use signed number to get arithmetic right shift.
	return p.EncodeVarint(uint64((uint32(x) << 1) ^ uint32((int32(x) >> 31))))}Copy the code

For signed sint32, Zigzag and then Varint are used.

// EncodeFixed32 writes a 32-bit integer to the Buffer.
// This is the format for the
// fixed32, sfixed32, and float protocol buffer types.
func (p *Buffer) EncodeFixed32(x uint64) error {
	p.buf = append(p.buf,
		uint8(x),
		uint8(x>>8),
		uint8(x>>16),
		uint8(x>>24))
	return nil
}
Copy the code

For Fixed32, it’s just a displacement operation, not a compression operation.

2. String

func (o *Buffer) enc_proto3_string(p *Properties, base structPointer) error {
	v := *structPointer_StringVal(base, p.field)
	if v == "" {
		return ErrNil
	}
	o.buf = append(o.buf, p.tagcode...)
	o.EncodeStringBytes(v)
	return nil
}
Copy the code

Serializing the string also takes two steps, putting tagcode in first and then serializing the data.

// EncodeStringBytes writes an encoded string to the Buffer.
// This is the format used for the proto2 string type.
func (p *Buffer) EncodeStringBytes(s string) error {
	p.EncodeVarint(uint64(len(s)))
	p.buf = append(p.buf, s...)
	return nil
}
Copy the code

When serializing a string, the length of the string is first written to buF by encoding Varint. The length is followed by the string. This is the implementation of tag-length-value.

3. Map

// Encode a map field.
func (o *Buffer) enc_new_map(p *Properties, base structPointer) error {
	var state errorState // XXX: or do we need to plumb this through?

	v := structPointer_NewAt(base, p.field, p.mtype).Elem() // map[K]V
	if v.Len() == 0 {
		return nil
	}

	keycopy, valcopy, keybase, valbase := mapEncodeScratch(p.mtype)

	enc := func(a) error {
		iferr := p.mkeyprop.enc(o, p.mkeyprop, keybase); err ! =nil {
			return err
		}
		iferr := p.mvalprop.enc(o, p.mvalprop, valbase); err ! =nil&& err ! = ErrNil {return err
		}
		return nil
	}

	// Don't sort map keys. It is not required by the spec, and C++ doesn't do it.
	for _, key := range v.MapKeys() {
		val := v.MapIndex(key)

		keycopy.Set(key)
		valcopy.Set(val)

		o.buf = append(o.buf, p.tagcode...)
		iferr := o.enc_len_thing(enc, &state); err ! =nil {
			return err
		}
	}
	return nil
}
Copy the code

The code above can also serialize dictionary arrays, for example:

map<key_type, value_type> map_field = N;
Copy the code

Convert it to the corresponding repeated message form and then serialize it.

message MapFieldEntry {
		key_type key = 1;
		value_type value = 2;
}
repeated MapFieldEntry map_field = N;
Copy the code

Map serialization is for each K-V, putting tagcode first and then serializing k-V. The enc_len_thing() method is called when a struct of unknown length is needed.

// Encode something, preceded by its encoded length (as a varint).
func (o *Buffer) enc_len_thing(enc func(a) error.state *errorState) error {
	iLen := len(o.buf)
	o.buf = append(o.buf, 0.0.0.0) // reserve four bytes for length
	iMsg := len(o.buf)
	err := enc()
	iferr ! =nil && !state.shouldContinue(err, nil) {
		return err
	}
	lMsg := len(o.buf) - iMsg
	lLen := sizeVarint(uint64(lMsg))
	switch x := lLen - (iMsg - iLen); {
	case x > 0: // actual length is x bytes larger than the space we reserved
		// Move msg x bytes right.
		o.buf = append(o.buf, zeroes[:x]...)
		copy(o.buf[iMsg+x:], o.buf[iMsg:iMsg+lMsg])
	case x < 0: // actual length is x bytes smaller than the space we reserved
		// Move msg x bytes left.
		copy(o.buf[iMsg+x:], o.buf[iMsg:iMsg+lMsg])
		o.buf = o.buf[:len(o.buf)+x] // x is negative
	}
	// Encode the length in the reserved space.
	o.buf = o.buf[:iLen]
	o.EncodeVarint(uint64(lMsg))
	o.buf = o.buf[:len(o.buf)+lMsg]
	return state.err
}
Copy the code

The enc_len_thing() method prestores a 4-byte space of length. After serialization, figure out the length. If the length is longer than 4 bytes, right-shift the serialized binary data, filling in the length between tagcode and the data. If the length is less than 4 bytes, the corresponding shift is left.

4. slice

Finally, an array example. Take [] INT32 for example.

// Encode a slice of int32s ([]int32) in packed format.
func (o *Buffer) enc_slice_packed_int32(p *Properties, base structPointer) error {
	s := structPointer_Word32Slice(base, p.field)
	l := s.Len()
	if l == 0 {
		return ErrNil
	}
	// TODO: Reuse a Buffer.
	buf := NewBuffer(nil)
	for i := 0; i < l; i++ {
		x := int32(s.Index(i)) // permit sign extension to use full 64-bit range
		p.valEnc(buf, uint64(x))
	}

	o.buf = append(o.buf, p.tagcode...)
	o.EncodeVarint(uint64(len(buf.buf)))
	o.buf = append(o.buf, buf.buf...)
	return nil
}
Copy the code

Serialize the array in three steps: first put tagcode in, then serialize the entire array length, and finally serialize each data in the array. Finally, the tag – length-value – value – value format is formed.

This is how the Protocol Buffer is serialized.

Serialization summary:

Protocol Buffer serialization uses Varint and Zigzag methods to compress ints and signed integers. Floating-point numbers are not compressed (further compression can be done here, and the Protocol Buffer has room for improvement). If an optional or repeated field is not set to a value, then the field will not exist in the serialized data.

These two points compress data and reduce serialization.

Serialization is all binary displacement, very fast. Data is stored in binary data streams in the form of tag-length-value (or tag-value). After using TLV structure to store data, also get rid of JSON {,},; , these delimiters, the absence of these delimiters is again reducing a part of the data.

This makes serialization very fast.

Protocol buffers deserialization

The realization of deserialization is completely the reverse of the realization of serialization.

func Unmarshal(buf []byte, pb Message) error {
	pb.Reset()
	return UnmarshalMerge(buf, pb)
}
Copy the code

Before deserialization begins, reset the buffer.

func (p *Buffer) Reset(a) {
	p.buf = p.buf[0:0] // for reading/writing
	p.index = 0        // for reading
}
Copy the code

Clear all data in buF and reset index.

func UnmarshalMerge(buf []byte, pb Message) error {
	// If the object can unmarshal itself, let it.
	if u, ok := pb.(Unmarshaler); ok {
		return u.Unmarshal(buf)
	}
	return NewBuffer(buf).Unmarshal(pb)
}
Copy the code

Deserialization starts with the above function, and if the result of the message passed in does not match the result of the BUF, the resulting result is unpredictable. Before deserialization, you also call your own custom Unmarshal() method.

type Unmarshaler interface {
	Unmarshal([]byte) error
}
Copy the code

Unmarshal() is a self-implementing interface.

The Unmarshal(PB Message) method is called in UnmarshalMerge.

func (p *Buffer) Unmarshal(pb Message) error {
	// If the object can unmarshal itself, let it.
	if u, ok := pb.(Unmarshaler); ok {
		err := u.Unmarshal(p.buf[p.index:])
		p.index = len(p.buf)
		return err
	}

	typ, base, err := getbase(pb)
	iferr ! =nil {
		return err
	}

	err = p.unmarshalType(typ.Elem(), GetProperties(typ.Elem()), false, base)

	if collectStats {
		stats.Decode++
	}

	return err
}
Copy the code

Unmarshal(pb Message) the Unmarshal(pb Message) function has only one input parameter, which is different from the proto.unmarshal () function signature (the former function has only one input parameter, the latter has two). The difference between the two is that the buF buffer is not reset in the implementation of a function with one input. The buF buffer is reset first in the implementation of a function with two inputs.

Both of these functions will eventually call the unmarshalType() method, which is the final function to support deserialization.

func (o *Buffer) unmarshalType(st reflect.Type, prop *StructProperties, is_group bool, base structPointer) error {
	var state errorState
	required, reqFields := prop.reqCount, uint64(0)

	var err error
	for err == nil && o.index < len(o.buf) {
		oi := o.index
		var u uint64
		u, err = o.DecodeVarint()
		iferr ! =nil {
			break
		}
		wire := int(u & 0x7)
		
		// The following code is omitted
		
		dec := p.dec
		
		// The intermediate code is omitted
		
		decErr := dec(o, p, base)
		ifdecErr ! =nil && !state.shouldContinue(decErr, p) {
			err = decErr
		}
		if err == nil && p.Required {
			// Successfully decoded a required field.
			if tag <= 64 {
				// use bitmap for fields 1-64 to catch field reuse.
				var mask uint64 = 1 << uint64(tag- 1)
				if reqFields&mask == 0 {
					// new required field
					reqFields |= mask
					required--
				}
			} else {
				// This is imprecise. It can be fooled by a required field
				// with a tag > 64 that is encoded twice; that's very rare.
				// A fully correct implementation would require allocating
				// a data structure, which we would like to avoid.
				required--
			}
		}
	}
	if err == nil {
		if is_group {
			return io.ErrUnexpectedEOF
		}
		ifstate.err ! =nil {
			return state.err
		}
		if required > 0 {
			// Not enough information to determine the exact field. If we use extra
			// CPU, we could determine the field only if the missing required field
			// has a tag <= 64 and we check reqFields.
			return &RequiredNotSetError{"{Unknown}"}}}return err
}
Copy the code

The unmarshalType() function is long and handles a lot of cases, such as oneof and WireEndGroup. The function that actually handles deserialization is on the line decErr := dec(o, p, base).

The dec function is initialized in the Properties setEncAndDec() function. We talked about that function in serialization above, but we won’t go into it here. The dec() function has a deserialization function for each different type.

Again, here are four examples of deserialization in action.

1. Int32

func (o *Buffer) dec_proto3_int32(p *Properties, base structPointer) error {
	u, err := p.valDec(o)
	iferr ! =nil {
		return err
	}
	word32Val_Set(structPointer_Word32Val(base, p.field), uint32(u))
	return nil
}
Copy the code

Deserialization Int32 code is relatively simple, the principle is to restore the original data according to encode’s reverse process.

func (p *Buffer) DecodeVarint(a) (x uint64, err error) {
	i := p.index
	buf := p.buf

	if i >= len(buf) {
		return 0, io.ErrUnexpectedEOF
	} else if buf[i] < 0x80 {
		p.index++
		return uint64(buf[i]), nil
	} else if len(buf)-i < 10 {
		return p.decodeVarintSlow()
	}

	var b uint64
	// we already checked the first byte
	x = uint64(buf[i]) - 0x80
	i++

	b = uint64(buf[i])
	i++
	x += b << 7
	if b&0x80= =0 {
		goto done
	}
	x -= 0x80 << 7

	b = uint64(buf[i])
	i++
	x += b << 14
	if b&0x80= =0 {
		goto done
	}
	x -= 0x80 << 14

	b = uint64(buf[i])
	i++
	x += b << 21
	if b&0x80= =0 {
		goto done
	}
	x -= 0x80 << 21

	b = uint64(buf[i])
	i++
	x += b << 28
	if b&0x80= =0 {
		goto done
	}
	x -= 0x80 << 28

	b = uint64(buf[i])
	i++
	x += b << 35
	if b&0x80= =0 {
		goto done
	}
	x -= 0x80 << 35

	b = uint64(buf[i])
	i++
	x += b << 42
	if b&0x80= =0 {
		goto done
	}
	x -= 0x80 << 42

	b = uint64(buf[i])
	i++
	x += b << 49
	if b&0x80= =0 {
		goto done
	}
	x -= 0x80 << 49

	b = uint64(buf[i])
	i++
	x += b << 56
	if b&0x80= =0 {
		goto done
	}
	x -= 0x80 << 56

	b = uint64(buf[i])
	i++
	x += b << 63
	if b&0x80= =0 {
		goto done
	}
	// x -= 0x80 << 63 // Always zero.

	return 0, errOverflow

done:
	p.index = i
	return x, nil
}
Copy the code

After Int32 serialization, the first byte must be 0x80, so after removing this byte, every subsequent binary byte is data, and the remaining step is to add each digit by displacement operation. The deserialization function also applies to INT32, INT64, uint32, uint64, bool, and enum.

You can also take a look at sint32, Fixed32 deserialization specific code implementation.

func (p *Buffer) DecodeZigzag32(a) (x uint64, err error) {
	x, err = p.DecodeVarint()
	iferr ! =nil {
		return
	}
	x = uint64((uint32(x) >> 1) ^ uint32((int32(x&1) < <31) > >31))
	return
}
Copy the code

For signed sint32, the deserialization process is to deserialize the sequence Varint and Zigzag.

func (p *Buffer) DecodeFixed32(a) (x uint64, err error) {
	// x, err already 0
	i := p.index + 4
	if i < 0 || i > len(p.buf) {
		err = io.ErrUnexpectedEOF
		return
	}
	p.index = i

	x = uint64(p.buf[i4 -])
	x |= uint64(p.buf[i- 3]) < <8
	x |= uint64(p.buf[i2 -]) < <16
	x |= uint64(p.buf[i- 1]) < <24
	return
}
Copy the code

The process of Fixed32 deserialization is to restore the original data by shifting the contents of each byte. Note that the tag position is also skipped first.

2. String

func (p *Buffer) DecodeRawBytes(alloc bool) (buf []byte, err error) {
	n, err := p.DecodeVarint()
	iferr ! =nil {
		return nil, err
	}

	nb := int(n)
	if nb < 0 {
		return nil, fmt.Errorf("proto: bad byte length %d", nb)
	}
	end := p.index + nb
	if end < p.index || end > len(p.buf) {
		return nil, io.ErrUnexpectedEOF
	}

	if! alloc {// todo: check if can get more uses of alloc=false
		buf = p.buf[p.index:end]
		p.index += nb
		return
	}

	buf = make([]byte, nb)
	copy(buf, p.buf[p.index:])
	p.index += nb
	return
}
Copy the code

DecodeVarint DecodeVarint DecodeVarint DecodeVarint Once you get the length, all you have to do is copy it. In the previous article, we know that strings are not processed, directly into the binary stream, so deserialization can be directly retrieved.

3. Map

func (o *Buffer) dec_new_map(p *Properties, base structPointer) error {
	raw, err := o.DecodeRawBytes(false)
	iferr ! =nil {
		return err
	}
	oi := o.index       // index at the end of this map entry
	o.index -= len(raw) // move buffer back to start of map entry

	mptr := structPointer_NewAt(base, p.field, p.mtype) // *map[K]V
	if mptr.Elem().IsNil() {
		mptr.Elem().Set(reflect.MakeMap(mptr.Type().Elem()))
	}
	v := mptr.Elem() // map[K]V

	// Some code is omitted here, mainly for double-indirection placeholders for key-values, as shown in the enc_new_map function in the serialization code

	// Decode.
	// This parses a restricted wire format, namely the encoding of a message
	// with two fields. See enc_new_map for the format.
	for o.index < oi {
		// tagcode for key and value properties are always a single byte
		// because they have tags 1 and 2.
		tagcode := o.buf[o.index]
		o.index++
		switch tagcode {
		case p.mkeyprop.tagcode[0] :iferr := p.mkeyprop.dec(o, p.mkeyprop, keybase); err ! =nil {
				return err
			}
		case p.mvalprop.tagcode[0] :iferr := p.mvalprop.dec(o, p.mvalprop, valbase); err ! =nil {
				return err
			}
		default:
			// TODO: Should we silently skip this instead?
			return fmt.Errorf("proto: bad map data tag %d", raw[0])
		}
	}
	keyelem, valelem := keyptr.Elem(), valptr.Elem()
	if! keyelem.IsValid() { keyelem = reflect.Zero(p.mtype.Key()) }if! valelem.IsValid() { valelem = reflect.Zero(p.mtype.Elem()) } v.SetMapIndex(keyelem, valelem)return nil
}
Copy the code

Deserializing a map requires that each tag be taken out, followed by deserializing each key-value. Finally, keyelem and valelem are determined to be Zero, and if they are, reflects. Zero is called to handle the Zero case.

4. slice

Finally, let’s take an array example. Take [] INT32 for example.

func (o *Buffer) dec_slice_packed_int32(p *Properties, base structPointer) error {
	v := structPointer_Word32Slice(base, p.field)

	nn, err := o.DecodeVarint()
	iferr ! =nil {
		return err
	}
	nb := int(nn) // number of bytes of encoded int32s

	fin := o.index + nb
	if fin < o.index {
		return errOverflow
	}
	for o.index < fin {
		u, err := p.valDec(o)
		iferr ! =nil {
			return err
		}
		v.Append(uint32(u))
	}
	return nil
}
Copy the code

Deserialize this array in 2 steps, skip tagcode to get length, deserialize length. Deserialize each value in sequence in length.

This is the deserialization of the Protocol Buffer.

Deserialization summary:

Protocol Buffer deserialization reads binary byte data streams directly. Deserialization is the reverse process of encode, as well as some binary operations. When deserializing, you usually only need length. Tag values are only used to identify types. The Properties setEncAndDec() method already initializes the decode for each type, so when deserializing, tag values can be skipped and processed with length instead.

Parsing XML is a bit more complicated. XML needs to read strings from files and convert them into an XML document object structure model. It then reads the string for the specified node from the XML document object structure model, and finally converts the string to a variable of the specified type. This process is very complex, and the process of converting AN XML file into a document object structure model often requires complex CPU intensive computations such as lexical analysis.

Serialization/deserialization performance

Protocol buffers have long been considered high performance. And there are a lot of implementations that prove this. Take the jVM-serializers in this link for example.

Before we look at the data, we can analyze the advantages of Protocol Buffer versus JSON and XML.

  1. Protobuf uses Varint and Zigzag to compress integer types significantly, and doesn’t have {,}, and} in JSON. These data separators, identified by the option field, will not be deserialized without data. These measures result in the overall amount of pb data being much less than that of JSON.
  2. So Protobuf is in TLV form, JSON and all of that is in string form. String alignment should be more time-consuming than number-based field tags. A Protobuf has a size or length marker in front of the text, whereas JSON must be scanned in full and cannot skip unwanted fields.

Here’s a graph from the reference link: Is Protobuf 5 times Faster than JSON?

From this experiment, Protobuf is indeed very good at serializing numbers.

Serialization/deserialization of numbers is certainly an advantage of Protobuf for JSON and XML, but it has some disadvantages as well. Like strings. Strings are basically not handled in Protobuf except for the tag-length prefix. In serializing/deserializing a string, the speed at which the string is copied determines the actual speed.

As can be seen from the figure above, the speed of encode string is basically the same as JSON.

3. The last

At this point, the reader should know all about Protocol buffers.

Protocol Buffers did not originally exist for data transmission, but for server protocol compatibility. In fact, the invention of a new cross-language unambiguous Interface Description Language (IDL). It was only later that people found it a good way to transmit data and started using protocol buffers.

Use protocol buffers to replace JSON.

  1. Protocol buffers: The amount of data transmitted is smaller than that of JSON. After gzip or 7ZIP compression, network transmission consumes less.
  2. Protocol buffers are not self-descriptive and are missing.protoAfter the file, there is a certain degree of encryption, data transmission process is binary stream, not plaintext.
  3. Protocol Buffers provides a set of tools to automate code generation.
  4. Protocol Buffers are backward compatible and have no effect on older versions after data structures have been changed.
  5. Protocol Buffers native are perfectly compatible with RPC calls.

If integer numbers are rarely used, floating-point numbers are all string data, then JSON and Protocol buffers performance should not differ much. For purely front-end interactions, the choice of JSON or Protocol buffers is not very different.

In the process of interaction with the back end, Protocol buffers are used a lot. The author believes that in addition to the strong performance of Protocol buffers, the perfect compatibility of RPC calls is also an important factor.


Reference:

Can thrift-protobuf serializers be 5 times faster than JSON? Debunk pb performance myths with code

Making Repo: Halfrost – Field

Follow: halfrost dead simple

Source: halfrost.com/protobuf_de…