It mainly tells the difference between V2 and V3, and explores the internal implementation principle of ETCD.

ETCD 3.0

Release notes

At present, ETCD has gone through three major versions, namely etCD 0.4, ETCD 2.0 and ETCD 3.0. For ETCD 2.0, the initial requirements of ETCD have been well met, including:

  • Focus on key-value storage rather than a complete database;
  • Exposed to external apis via HTTP + JSON;
  • The Watch mechanism provides the function of continuously listening for the change of a key and the automatic expiration mechanism of ttL-based keys.

But in the actual process, we also found some problems, such as the client need to frequently communicate with the server, cluster in space and time needs to be under high pressure, and garbage collection of key time is not stable, etc., at the same time “micro” service architecture request etcd ChanJiQun to support a larger concurrency, As a result, ETCD 3.0 mainly optimizes HTTP + JSON communication, key auto-expiration mechanism, watch mechanism, data persistence, etc. Let’s take a look at etCD 3.0 for each module.

Client communication mode

GRPC is a high-performance, cross-language RPC framework based on HTTP/2 protocol. It uses Protobuf as a serialization and deserialization protocol, which is based on protobuf to declare data models and RPC interface services. Protobuf is much more efficient than JSON, and although etCD V2 clients have done a lot of optimization on JSON serial numbers and deserialization, ETCD V3 gRPC serial numbers and deserialization are still more than twice as fast as ETCD V2.

The CLIENT of etcDV3 uses gRPC to communicate with the server. The message protocol of communication uses protobuf instead of HTTP+JSON format of V2 version. Binary is used instead of text, which saves more space. At the same time, gRPC uses HTTP/2 protocol, the same connection can process multiple requests, not like HTTP1.1 protocol, multiple requests need to establish multiple connections. At the same time, HTTP/2 compresses and encodes the requested headers and the requested Data. Header frames are commonly used to transmit Header content, and Data frames are commonly used to transmit body entities. Client can put multiple requests in different flow, then the flow in the form of split into the frame for binary transmission, the transmission will also have a number of frames, and so on a connection in the client can send multiple requests, reduced the number of connections, reduces the pressure on the server, the binary data transmission format will be faster.

To sum up, there are two main points of optimization:

  • Binary instead of string: gRPC for communication, instead of THE V2 version of HTTP+JSON format;
  • Reduce TCP connections: With HTTP/2, the same connection can process multiple requests at the same time, eliminating the need to establish multiple connections for multiple requests.

Expiration mechanism for keys

Keys in etcDV2 are valid using the TTL mechanism. For each key with a lifetime, the client must periodically refresh and reset to ensure that it is not automatically deleted, and each refresh is followed by a new connection to update the key. That is, even when the entire cluster is idle, many clients communicate with the server periodically to ensure that a key is not deleted automatically.

Etcdv3 uses a lease mechanism. Each lease has a TTL and some keys are attached to the lease. When the lease expires, any keys attached to it are deleted. The service registration function can be realized by using the expiration mechanism of keys. We can register the domain name, IP and other information of a service into ETCD, set the lease for the corresponding keys, and periodically maintain a heartbeat to refresh within the TTL time. When the service fails, the heartbeat disappears and the corresponding key is automatically deleted, thus realizing the service registration function and service health check function.

To sum up, v2 version is more stupid, need to maintain the communication of each key, v3 is more intelligent, the whole uniform code of expired keys, we call the code “lease”, we only need to maintain this code, avoid the client to maintain all keys.

Watch mechanism

In order to keep track of key changes after keys are abolished in etcDV2, an event mechanism is used to maintain the state of the key and prevent deleted keys from being restored and watched, but there is a sliding window size limit that makes it impossible to retrieve keys 1000 years ago. Therefore, synchronizing data via watch in etcDV2 is not very reliable, and it is possible that changes to the middle key may not be available after a period of disconnection. Arbitrary version history of the GET and Watch keys is supported in etcDV3.

In addition, the watch in V2 essentially establishes many HTTP connections, and each watch establishes a TCP socket connection. When there are too many clients of the watch, the resources of the server will be greatly consumed. If there are thousands of clients of the watch with thousands of keys, Etcd V2 server socket and memory resources will soon be exhausted. V3 Watch can reuse connections. Multiple clients can share the same TCP connection, greatly reducing the pressure on the server.

To sum up, there are two main points of optimization:

  • Monitor key update in real time: solve the problem that the customer end will not perceive key data update during v2.
  • Multiplexing: This can be thought of in the SELECT and Epool models, where a customer who previously needed to establish multiple TCP connections now only needs to establish one.

Data storage model

Etcd is a key-value database. Ectd V2 only stores the latest value of a key, and the previous value will be overwritten directly. If you want to know the history of a key, you need to maintain a window for the history of the key. But when the data is updated quickly, these 1000 changes are “not enough” because the data is quickly overwritten and the previous record is still not found. In order to solve this problem, ETCD V3 abandoned the unstable “sliding window” design of V2, introduced MVCC mechanism, adopted the storage structure of the main index from the history record, saved all the history record changes of the key, and supported fast query of data in unlocked state. Etcd is a key-value database. Etcdv2 keys are a recursive file directory structure. In V3 keys have been changed to a flat data structure, which is more concise and supports fast query of keys through line tree optimization.

Since ETCD V3 implements MVCC and saves the historical version of each key-value pair, the data is much larger and the entire database cannot be stored in memory. So ETCD V3 ditches in-memory databases and switches to disk databases, where the entire data is stored on disk and the underlying storage engine uses BoltDB.

To sum up, there are three main points of optimization:

  • Save historical data: Instead of v2’s “sliding window” design, MVCC saves all historical data.
  • Data drop disk: Because to save historical data, data quantity attitude, not suitable for full memory storage, use BoltDB storage;
  • Query optimization: Abandon v2’s hierarchical directory design and use line tree to optimize the query.

other

Etcd V3 optimizations, including mini transactions, snapshot mechanism, etc., will not be explained here, related content will be explained later.

Multi-version concurrency control

Why MVCC

In the case of high concurrency, there are a lot of read/write operations. As for ETCD V2, it is a pure memory database, and the whole database has a stop-the-world big lock, which can solve the data competition caused by concurrency through locking mechanism, but there are some determinations through locking, as follows:

The granularity of the lock is difficult to control, and the entire database is locked every time stop-the-world is performed.

Read and write locks block each other.

If a lock-based isolation mechanism is used and there is a long read transaction, the object cannot be overwritten during this time and subsequent transactions are blocked until the transaction completes, which can have a significant impact on concurrency performance.

MVCC is actually multi-version concurrency control, ETCD was introduced in V3, it can be a good solution to lock problems, whenever the need to change or delete a data object, DBMS will not delete or modify the existing data object itself, but for the data object to create a new version, in this way, Concurrent reads can read older versions of data without locking, while writes can be performed simultaneously, which has the advantage of preventing reads from blocking.

In summary, MVCC maximizes efficient read/write concurrency, especially efficient reads, and is therefore ideal for etCD “read more write less” scenarios.

The data model

Before we introduce the implementation principles of MVCC, we also need to understand the data storage models of V2 and V3.

V2 is an in-memory database, and data is persisted through WAL logs and snapshots. The detailed data persistence methods will be described in the following sections.

For V3, because it supports the query of historical version data, it stores the data in a multi-version persistent K-V store. When the persistent key value data changes, the previous old value will be saved first. Therefore, after the data modification, all values of the previous version of key can still be accessed and watch. V3 data needs to be stored in historical versions, which greatly increases the storage capacity. The memory cannot store so much data. Therefore, V3 data needs to be persisted to disks and stored in BoltDB.

So what is BoltDB? BoltDB is a pure Go version of k-V storage. Its goal is to provide a simple, efficient and reliable embedded, serializable key-value database for projects, rather than a full database server like MySQL. BoltDB is also a key value store that supports transactions. Etcd transactions are implemented based on BoltDB transactions. For full understanding, LET me expand on two more questions:

  1. A periodic snapshot is provided for V2. Do you need to create a snapshot for V3?

The answer is no. After V3 implements MVCC, data is written to the BoltDB database in real time, that is, the persistence of data is “amortized” to each write request to the key, so v3 does not need to make snapshots.

  1. Will all historical data in V3 be saved?

The answer is no. Although v3 does not have snapshots and all data is stored in BoltDB, etCD may compress (delete) the old version of the key in order to prevent the data store from growing indefinitely over time.

Preliminary study on MVCC implementation

So how does V3 implement MVCC? We can first look at the following operations:

etcdctl txn <<< 'put key1 "v1" put key2 "v2"'

etcdctl txn <<< 'put key1 "v12" put key2 "v22"'

Copy the code

BoltDB will store 4 pieces of data, the specific code is as follows:

rev={3 0}, key=key1, value="v1"

rev={3 1}, key=key2, value="v2"

rev={4 0}, key=key1, value="v12"

rev={4 0}, key=key2, value="v22"

Copy the code

Reversion consists of two parts: the first part is main Rev, which increments the transaction by 1 for each operation, and the second part is sub Rev, which increments the transaction by 1 for each operation in the same transaction. As shown in the example above, main Rev is 3 for the first operation and 4 for the second.

If you want to query data from BoltDB, you must use reversion. However, clients always use key to query value. Therefore, etCD also maintains a kvindex in memory, which stores the mapping relationship before key reversion. Used to speed up queries. Kvindex is based on Google open source GoLang B tree implementation, also v3 in memory maintenance of the secondary index, so that when the client through key query value, will first query all revisions of the key in kvIndex, Then query data from BoltDB through revision.

Log and snapshot management

Data persistence

Etcd uses WAL logs and snapshots to persist data.

The use of WAL logging was covered in “3.3.1 Copying state Machines”, so let’s go over it again. Etcd updates to data are written to the WAL first, and then to memory after the WAL is synchronized to all distributed nodes via Raft. WAL logs can also be used to implement redo and undo functions. When data is faulty, WAL logs record all operations on the data, so WAL can be used to restore and roll back the database.

With WAL logs, why do you need to take snapshots on a regular basis? WAL logs are Redis’ AOF logs and Snapshot logs are Redis’ RDB logs. When synchronizing data between nodes in Redis, we first fully synchronize RDB logs (snapshot files) and then incrementally synchronize AOF logs (data incremental files). Etcd is also different because WAL logs are too trivial. If you need to synchronize data with WAL logs, it is too slow. We first synchronize all the previous data (Snapshot files) and then synchronize the subsequent incremental data (WAL logs). When WAL data is snapshot last night, old WAL data can be deleted.

The snapshot management

How to create a snapshot file is not difficult to understand if you know how to create a snapshot file in Redis.


In fact, copy-on-write is implemented using the copy-on-write technology. When a snapshot is taken, a Copy of the data is generated if the data is updated, as shown in the “key/value pair C” in the figure. When a snapshot is taken, if the data is not updated, the disk is directly dropped.

Welcome everyone to like a lot, more articles, please pay attention to the wechat public number “Lou Zai advanced road”, point attention, do not get lost ~~