The author
Kong Lingzhen, chief architect of Douyu, is fully responsible for the technical architecture system planning and construction of douyu station. He has more than 10 years of experience in medium and large Internet product architecture, and is good at architecture and scheme design in high concurrency and high availability scenarios.
Yu Jing, a Douyu technical support operation and maintenance expert, is responsible for the construction of Douyu high availability infrastructure, skilled in technical fields such as registration center and monitoring system, and also responsible for douyu live basic support.
Tang Cong, senior engineer of Tencent Cloud, author of geek Time column “ETCD Practice Class”, active contributor of ETCD, mainly responsible for r&d and design of Tencent Cloud large-scale K8S/ETCD platform, stateful service container, offline mixing and other products.
Peng Chen, architect of Tencent cloud container service product, has been focusing on the cloud native field for many years. He has helped a large number of users to transform and produce cloud native containers. He has rich front-line practical experience and has also published a large number of articles on cloud native technology.
Business background and pain points
Douyu Live, as a leading game live streaming platform in the industry, provides hundreds of millions of Internet users with high-quality game live streaming, interactive and entertainment services every day.
With the hot market of live broadcasting in recent years, douyu Live broadcasting platform, as an Internet company with excellent reputation and experience in the industry, has seen a blowout growth in the number of users. The stability and technical challenges brought by massive users to the platform become more and more intense. Douyu’s old architecture is shown in the figure below. There are certain risks and hidden dangers in both business support and architecture design.
Betta fish old structure
Picture a fish old structure
In order to bring better usability experience to users, Douyu urgently needs to solve the problem of single data center and upgrade the old architecture from single data center to multi-data center.
Multi-data center challenges
In the process of implementing the single live upgrade to the multiple live upgrade, we face a number of challenges to ensure trouble-free migration and upgrade, such as:
How can STateful services such as ETCD and ZooKeeper be synchronized to multiple DCS? Applications have a complex tree or network dependency with each other. Where should migration start? According to what dimension to divide the boundary of the target, how to avoid business welding together, resulting in the situation of no way to start? If a fault occurs after the migration, how can I quickly recover the fault without affecting the migrated services? Since there are many systems involved in the process of upgrading from single live to multiple live, this article will be the first part of Douyu live Multiple live transformation series, focusing only on the registry module, so we will first introduce etCD and ZooKeeper behind the registry to you.
Roles undertaken by ZK/ETCD
Dubbo uses a registry to solve the problem of service registration and discovery in large clusters. The following is the registry architecture diagram:
Dubbo supports the ZooKeeper registry by default, and while the new version does have an ETCD implementation, there is no precedent for large-scale production of this implementation, and the use of ETCD as a registry in the Java technology stack is rare.
When ZooKeeper is used as the Dubbo registry, its registration relationship is a tree structure, as shown in the following figure:
Zookeeper stores data based on a tree structure similar to a file system, while ETCD stores data using key-value pairs. The difference between the two makes it difficult to synchronize registration relationships.
In addition, if you migrate from ZooKeeper to ETCD, existing online services cannot be damaged or stopped during the migration. If the migration fails, you can also fall back to ZooKeeper.
New architecture for City Live and Live
In order to achieve multi-live, we successfully solved the above challenges by means of synchronous services across data centers, service dependency sorting and boundary demarcation, controllable change and other technical means and operation and maintenance concepts, and designed the following new architecture to achieve multi-live, as shown in the figure below:
Figure 2. New structure of betta Fish
Under the new architecture, traffic scheduling can be fine-grained by domain name or even URL, and the RPC level also has the ability to automatically call nearby. The local architecture of the registry is shown as follows:
Figure 3. Old structure of Douyu Registry
Selection and objectives of the registry’s active program
In the multi-active registry transformation process, we are faced with several options, as shown in the following table:
Due to historical reasons, we have two sets of registries, zookeeper (hereinafter referred to as zk) and etcd. Besides, we have four technical stacks, namely Java, Go, C++ and PHP, so we still have some deficiencies in the field of registries. We hope that etcd can be unified to solve the pain points. And achieve the following goals:
Reduced maintenance costs: Previously it was necessary to operate two registries of ZK + ETCD, and even more difficult to adapt the multi-live solution to ZK + ETCD, resulting in a doubling of the cost of multi-live registry development. Since ETCD is part of K8S, operating and maintaining etCD is unavoidable, which is the first reason for choosing ETCD.
Embrace a more prosperous ecology: Etcd has a cloud native hosting solution, manufacturers manage 10K node level K8S cluster through ETCD, ETCD also has a proxy, cache, mirror and other peripheral tools, Java side Dubbo also supports ETCD as a registry. Etcd has better prospects than ZK, which is the second reason to choose ETCD.
Enhanced cross-language capabilities: ETCD can be based on HTTP or GRPC protocol communication, and supports long polling, with strong cross-language capabilities. However, ZK requires the introduction of a dedicated client, which is not mature in other languages except Java. We have four development languages: JAVA, Go, C++ and PHP, which is the third reason to choose etcd.
Based on the above reasons, we choose plan 4. The four new architectures of Plan 4 are shown in the figure below:
Figure 4. New structure of Betta Fish Registry
The registry is full of difficulties and challenges
In order to realize the new registry and achieve our desired design goals, the registry faced the following difficulties and challenges during the transformation process:
How to solve the problem of multi-DC synchronization in zK? In particular, the Mechanism of ZooKeeper Watch is unreliable, which may lead to the loss of watch events. (Correct) How can I synchronize etCD data centers? From the solution selection below, we can see that the community does not currently have any mature, production-ready solutions. (Correct) How to solve the performance problem of cross-data center read? (Performance) How to address service stability across data centers? What if the network link, such as the private line on the Intranet, is interrupted? In terms of synchronization service design, does it cause etCD/ZK synchronization service to enter the full synchronization logic with extremely slow performance? Does the synchronization service itself have high availability and so on? How do we design test cases for disaster recovery test? In operation and maintenance, how can we quickly find hidden dangers, eliminate potential failures, and build a visual and flexible multi-active operation and maintenance system? (Stability, operation and maintenance)
Analysis of multiple activity difficulties in the registry
How can I ensure the communication between new and old services during migration?
Develop zk2etcd
Many of our Java development businesses use Dubbo framework for service governance, and the registry is ZooKeeper. We hope that all Java and GO development businesses use ETCD as the registry, which also lays a foundation for the possibility of cross-language invocation.
Due to a large number of businesses, the transformation and migration cycle will be very long, which is expected to last 1 to 2 years. In this process, we need to synchronize the registered data from ZooKeeper to ETCD in real time, and ensure data consistency and high availability. Currently, there is no tool on the market that meets our needs. Therefore, we cooperated with Tencent Cloud TKE team to develop a ZK2ETCD to synchronize ZooKeeper data to ETCD, and it has been open source. We will introduce the overall scheme in detail in the implementation section.
How to implement remote ETCD DISASTER Recovery?
With the ZK2ETCD synchronization service, we successfully solved the problem of ZooKeeper data migration, making the registry data of both new and old businesses use ETCD to store.
Therefore, the importance of ETCD is self-evident, its availability determines our overall availability, and douyu Live’s current deployment architecture is heavily dependent on a core computer room, once the failure of the core computer room will lead to the overall unavailability. Therefore, the next pain point of Douyu live broadcasting is to improve the availability of ETCD, hoping to achieve cross-city and remote DISASTER recovery capabilities of ETCD.
The ideal ETCD cross-city synchronization service of Douyu Live broadcast should have the following features:
After the ETCD is deployed for cross-city Dr, the READ/write performance does not significantly deteriorate and meets the basic requirements of service scenarios. Synchronization components are available in the production environment, and have complete consistency detection, logging, and metrics monitoring. Services that do not have strong data consistency requirements can access the etCD cluster service in the nearest region, and services that have strong data consistency requirements can access the primary ETCD cluster. If the active cluster is faulty, service O&M can quickly promote the standby cluster to the active cluster based on consistency monitoring. So what are the options? What are the advantages and disadvantages of each scheme? Finally, the following schemes were evaluated:
Single-cluster multi-site deployment solution
Etcd community make-mirror scheme ETCD community Learner scheme Tencent Cloud ETCd-Syncer single-cluster multi-site deployment scheme The following figure shows the single-cluster multi-site deployment scheme:
In this scenario, the ETCD Leader node copies data to the Follower nodes in each region via Raft protocol.
The advantages of this scheme are as follows:
After regional networks are interconnected, deployment is simple and no additional O&M components are required
Data is synchronized across cities. In a three-node deployment scenario, a fault can be tolerated in any city without data loss
After introducing its advantages, let’s look at its disadvantages, as follows:
In a three-node deployment scenario, at least two nodes are required to respond to any write request. However, when different nodes are deployed in different locations, the ping latency increases from a few milliseconds to about 30ms (shenzhen-shanghai). As a result, the write performance deteriorates sharply.
The default etCD read request is linear. After receiving a read request from the Client, the Follower node obtains related information from the Leader node to ensure that the local data catches up with the Leader node before returning data to the Client, avoiding old data. In this process, etCD read latency increases and throughput decreases.
The quality of inter-city deployment networks fluctuates easily, resulting in service quality jitter.
When a client accesses an ETCD cluster, multiple ETCD nodes must be configured to prevent a single point of failure. In this case, the client may access a remote ETCD node, resulting in a delay in service requests.
Etcd community make-Mirror scheme
After introducing the single-cluster multi-site deployment solution, let’s take a look at the make-mirror solution provided by the ETCD community. The schematic diagram is as follows:
In this solution, we deployed a set of independent ETCD clusters in different cities, and realized cross-city data replication through the make-mirror tool provided by the ETCD community.
The principles of the make-mirror tool are as follows:
After the prefix of data synchronization is specified, all data under the prefix is traversed from the primary cluster through the ETCD Range read interface and written to the destination ETCD. (Full synchronization) Then specify the “version number” returned by the read request through the ETCD Watch interface, and listen for all change events after this version number. After receiving the key-value change event pushed by the primary ETCD cluster, make-mirror writes the data to the hot spare cluster through the TXN transaction interface. (Incremental synchronization) This scheme has the following advantages:
The primary ETCD cluster has high read/write performance and is not affected by cross-region network delay and network quality fluctuation. If services can tolerate transient inconsistencies, the nearest ETCD cluster can be accessed. If service requirements are strongly consistent, the primary ETCD cluster can be accessed through Intranet private lines. Let’s take a look at its disadvantages, as follows:
When a large number of write requests are received, data in the secondary cluster may lag and dirty data may be read. If the make-mirror synchronization link in the community is interrupted, the system enters the full synchronization mode again after exit and restart, which results in poor performance and cannot meet the requirements of the production environment. The make-mirror tool provided by the community lacks features such as leader election, data consistency detection, logging, and metrics, and is not available in the production environment. Non-key-value data, such as Auth authentication data and lease data, cannot be synchronized.
Etcd Community Learner program
After introducing the make-mirror scheme of the ETCD community, we look at the Learner scheme provided by the ETCD community, and its schematic diagram is as follows:
Its core principles are as follows:
The ETCD RAFT library already supports Learner nodes in 2017, see PR 8751 for more information. In version 3.4 released in August 2019, THE ETCD community officially supports Learner node, which joins the cluster as a non-voting member node and does not participate in Voting such as cluster election, but only replicates data. After receiving the write request, the Leader synchronizes the log to the Follower and Learner node, and uses a data structure named Progress in memory to maintain the log synchronization Progress information of the Follower and Learner node. When the difference between the data of the Learner node and that of the Leader node is small, it can be promoted to join the cluster as a voting member node. The advantages of this scheme are as follows:
A Learner node can synchronize any type of data, such as key-value, Auth authentication data, and LEASE data, to the ETCD cluster. Let’s take a look at its disadvantages, as follows:
The Learner node only allows serial reading, that is, services read old data if they are read nearby. Etcd 3.4 or later supports Learner, and only one Learner node is allowed. If the primary cluster fails, the Learner node cannot be quickly promoted to a writable independent ETCD cluster. After introducing several existing schemes, we found that none of them could meet the demands of business production environment, so we developed the implementation of ETCD synchronization service available in production environment by ourselves, which will be detailed in the implementation section of the overall scheme.
How to ensure the stability and operability of etCD and ZK synchronization service?
To ensure the stability of etCD and ZK synchronization services, five types of common faults are simulated to test the self-healing ability of services in these typical fault scenarios. The detailed test scheme is as follows.
Fault scenarios
Redis intermittent disconnection (dependent on the ZK2etCD service), for example, redis version upgrade or non-smooth capacity expansion.
Zk2etcd is offline, for example, OOM, container expulsion, or host fault.
Etcd2etcd is offline, for example, OOM, container expulsion, or host fault
The network is interrupted intermittently, for example, OOM, container expulsion, or host fault.
Weak network environment, for example, when the private line is disconnected, the public network is temporarily replaced.
The actual trigger causes for the above five scenarios vary, and only one situation needs to be simulated.
Exercise scheme
Redis flash: change host to simulate redis is unreachable, and automatic correction stops. After simulated Redis is restored, automatic correction also automatically resumes.
Zk2etcd offline: Kill the container node to simulate the zK2ETCD hanging, k8S will automatically pull up within 15 seconds, after the pull is complete, the synchronization is normal and the data is consistent.
Etcd2etcd offline: Kill the container node to simulate zK2ETCD hanging, k8S will automatically pull up within 15 seconds, after the pull is complete, the synchronization is normal and the data is consistent.
Intermittent network disconnection: Change host to simulate zK and ETCD unreachable, and the synchronization is interrupted. Then remove host to simulate network recovery. After recovery, the synchronization is normal and data is consistent.
Weak network: The synchronization efficiency is reduced by less than 4 times after the public network is cut to simulate a weak network environment. A full synchronization can still be completed within 1 minute.
In addition, for operational issues, both ETCD and ZK synchronization services provide detailed metrics and logs. We configure visual observation views and alarm policies for each core scenario and exception scenario.
Implementation of the overall plan
The overall architecture
The etCD cluster live architecture diagram is as follows:
instructions
Black solid line: normal private line access
Black dotted line: Access through the public network
Solid red line: private line access after the active/standby switchover occurs in the ETCD cluster
Dotted red line: Public network access after active/standby switchover occurs in the ETCD cluster
Etcd2etcd/ZK2etCD data synchronization service diagram is as follows:
Zk synchronous service engineering practice
Zookeeper and ETCD have different storage structures, which makes synchronization difficult. Zookeeper storage is a tree structure, while ETCD V3 is flat. Zookeeper cannot list all keys by prefix as etcd does. Etcd cannot query the child nodes in a directory through list chilren as ZooKeeper does, which makes synchronization more difficult.
How do I detect data changes in ZooKeeper? Unlike etCD, Zookeeper watch can simply sense the addition of any key. It needs to recurse all the nodes of Watch to receive the ChildrenChanged event and get all the children under the ChildrenChanged event. The new data can be obtained by comparing it with the data in the ETCD and synchronized to the ETCD. Similarly, you can recursively watch the deletion events of all nodes and synchronously delete the data in etCD.
In addition, The Watch of ZooKeeper has a congenital defect. The watch is disposable, so the watch must be renewed every time an event is received. In theory, events may be lost between two watches, which may occur when the same key is changed repeatedly. If the loss event occurs, the data consistency will be destroyed. We introduced the ability of automatic diff and correction, that is, to calculate the difference between the data in ZooKeeper and ETCD, and each time through two rounds of DIFF calculation, because in the case of frequent data changes, A round of diff calculation often has some “false differences” due to not strong consistent synchronization, and when the diFF calculation results will automatically fix these differences.
How to solve the coexistence with ETCD2ETCD? When there is data written by etCD2ETCD and zK2etCD simultaneously in the same path, the automatic correction logic of ZK2etCD will calculate the difference and correct the difference, but we do not want to delete the data written by ETCD2etcd. We solved this problem by introducing redis for zk2etCD to store state. When Zk2etCD writes or deletes data synchronously to etCD, it also records and deletes data synchronously in Redis:
Then, when zK2ETCD automatically corrects the calculation difference, only the data written by this tool is considered to avoid deleting the data written by other synchronization tools by mistake.
Etcd2etcd engineering practice
In order to solve the ETCD synchronization problem, we investigated the following two solutions, and let’s take a closer look at how it works:
Etcd – syncer mirror – plus version
Etcd-syncer mirror-Plus is an enhanced version of the etCD community make-Mirror. To address the various limitations of make-Mirror, it implements the following features and benefits:
Multiple synchronization modes are supported, including full synchronization and resumable transmission, which does not worry about the jitter of private lines and public networks. High availability. Instances that replicate the same data path support deployment of multiple copies. Quick recovery Consistency check (full data check and snapshot check) Concurrent replication of multiple instances to improve performance (different instances are responsible for different paths). You are advised to configure multiple instances in the production environment, each instance is responsible for different paths. Good O&M capability, one-click deployment based on K8S Deployment. Rich metrics, logging, complete E2E test cases covering core scenarios (HTTP/HTTPS scenarios, service outages, network exceptions, etc.) so what are its disadvantages? Because its core principle is still dependent on THE MVCC + Watch feature of ETCD, the data cannot be guaranteed strong consistency and only key-value data can be synchronized.
Breakpoint continuation depends on the retention time of the MVCC historical version. It is better that the service can save at least one hour of historical data. When a large number of write requests are received, data in the secondary cluster may lag and dirty data may be read. Non-key-value data, such as Auth authentication data and lease data, cannot be synchronized.
Etcd – syncer Raft version
To solve all types of data synchronization problems and eliminate the dependence on etCD MVCC historical data, Tencent Cloud also provides Raft version of ETCD-Syncer, a synchronization solution based on Raft logs.
Its deployment diagram is shown below. The ETCD-Syncer synchronization service joins the main ETCD cluster as a learner node.
The master ETCD cluster Leader synchronizes Raft log data to etcd-Syncer via MsgApp/Snapshot messages. Etcd-syncer parses Raft logs. Apply requests such as Txn/Delete/Auth corresponding to Raft log entries to the destination ETCD cluster.
It has the following advantages:
Has all the features and benefits of mirror-Plus versions of ETCd-Syncer without relying on etCD MVCC history data.
Raft log synchronization based on ETCD, you can synchronize key-value, Auth, lease, and other types of data.
Do not rely on higher versions of ETCD.
Complete Dr Tests
grpc-proxy
This solution introduces the GRPC-proxy proxy service, which is used for the first time. In order to understand the performance of this proxy service, we used the benchmark of ETCD to test the read and write, and also hand-written a small tool to test the Watch. The following is part of the test.
Write the test
Load balancing entry for direct access to etCD services
Access to the ETCD service through the GRPC-proxy
Grpc-proxy can write data normally when endpoints are configured with private lines or public networks. When the total number of write keys is certain, the larger the number of connections and clients, the lower the total time. The larger the total number of write keys, the higher the Average write time. If the number of requests for a single write is 100,000, the etCDServer will receive too many requests. However, if grPC-proxy does not have a public network, the performance of the grPC-proxy service is lower than that of the dedicated line. The average time of grPC-proxy service is longer than that of direct connection, but it meets the requirements of read tests
Load balancing entry for direct access to etCD services
Access to the ETCD service through the GRPC-proxy
Grpc-proxy can read data normally when Endpoints are configured with private lines or public networks
The average duration of grPC-Proxy is higher than that of direct connection, but it is acceptable
Watch the test
According to an ETCDWatcher service written by us, we can test grPC-Proxy for watch: we can set the total number of Watcher, update frequency, and test time, and print out the bulletin at the end
./etcdwatch -num=100 -span=500 -duration=10 -endpoint=http://grpc-proxy-addr:23791 test done total 100 task 0 task failed current revision is 631490 least revision is 631490 0 task is not synced
Parameter Description:
Num Number of tasks
Span Update interval, in milliseconds
Duration Total test duration, in seconds
Current Revision: Indicates the written revision
Least Revision: Indicates the least revision synchronized among num tasks
If failed is 0, the fault is normal. If Task Not sync is displayed, watch and PUT are not synchronized
According to the above test results, the number of failed is 0, and the watch test is normal
zk2etcd
We are using version 1.2.5 and deploy in k8S deployment mode
Simulate a ZK server loss
Scenario by injecting error resolution addresses into hosts
During the symptom, zK disconnection is not reported, log monitoring indicators are not abnormal, and the system restarts. Fixed operand does not appear bulge increase. (In version 1.2.4, there is a bug that full sync is executed periodically, but the key that needs to be fixed is not sensed. Fixed operation bulge may be observed after zK2etcd service instance restart.
Simulated redis loss of contact
10:07:34 Restore redis 10:16:00 Restart synchronization pod (operation restart to check whether full sync is normal)
During the phenomenon, the number of fixed operations does not increase, and no obvious abnormality is found in other monitoring indicators. After the restart, the number of fixed operations does not bulge
Simulate ETCD loss
Emulation 16:15:10 The ETCD server is disconnected
Ticket back
Restart the art pod
The number of fixed operations did not increase during the phenomenon, and no obvious abnormality was found in other monitoring indicators
After the restart, the number of fixed operations increased (it is not sure whether full sync did not take effect, or there was just update and repair after the restart).
conclusion
As long as the Full Sync mechanism works properly, all abnormal scenarios can be recovered after the next Full Sync is triggered
The minimum recovery interval depends on the set full sync periodic execution interval (5m by default). The service can adjust the parameters based on the tolerance of the interval
In addition, in order to avoid the occurrence of an exception, the full sync mechanism is scheduled to run but not aware of the situation, you can restart the ZK2etcd service as soon as possible
For the added ETCD public network test, the full Sync completed and zK and ETCD operations took longer (in seconds) than those on the Intranet
etcd2etcd
Etcd2etcd synchronization service, I use deployment double copy
Multiple copy backup capability
It is expected that the standby working node will take over the synchronization task in 5 seconds
Test solution EtCD Syncer two-instance deployment
Kill the running working node to observe
Conclusion The active/standby switchover works properly in both incremental and full synchronization. (Note that incremental synchronization is performed after the active/standby switchover occurs in full synchronization, which may result in slow synchronization.)
Breakpoint continuation capability
Synchronization is expected to resume from the breakpoint after recovery
In Part 1, the standby node took over the synchronization after switching to the primary node, and fast_path changed to 1 to demonstrate the breakpoint continuation capability. We added several additional validation scenarios:
(a) Short-term failure
Fault scenarios
During the synchronization between the central ETCD cluster and the hot spare cluster, the synchronization service reported an error due to the key of -etcd-syncer-meta- in the central ETCD cluster as the source (the same key cannot be contained in TXN), resulting in data differences
The phenomenon of
Add the filtering of -etcd-syncer-meta- to the synchronization service operation parameters, and observe that after a period of catch-up data, the final miss number decreases to reach the consistency
(b) Prolonged failure
Fault scenarios
Stop the deployment of the synchronization service
Start the synchronization service after data difference and compact occur between etCD clusters on both sides
The phenomenon of
After data difference and compact occurred, the synchronization service was restarted, and the log was as follows: Full synchronization was triggered due to the occurrence of compacted
Synchronization service monitoring indicators :(a) DST miss key drops quickly; (b) SRC Miss key increased and kept declining
Analysis of the
After the synchronization service stops, the number of keys in the source ETCD changes a lot. The monitoring figure shows a decrease during this period, indicating that keys have been deleted
There is also a small problem exposed here. When SRC miss keys occur, it cannot be automatically repaired at present, and manual access is required to clean up the redundant keys
- Reset Triggers full synchronization
When a significant synchronization difference (for example, a DST miss) occurs for emergency repair, full synchronization is triggered by setting the –reset-last-synced-rev parameter to delete breakpoint continuations
Symptom DUE to an exception, a DST Miss (yellow line instance in the figure) occurs simultaneously. To fix this, the new instance is run with the –reset-last-synced-rev parameter added
Analysis of the
Slow_path is 1, indicating full synchronization is triggered (example of green line in figure)
The DST miss value for the green line instance does not grow, indicating that a consensus has been reached
- Network fault
The private line between two ETCD clusters is interrupted
Incremental synchronization
Full synchronization is being performed
Test plan
When the private line is interrupted and the public network is switched over, you need to change the ETCD cluster access address in the running parameters, that is, restart will occur. (The restart scenario has been covered before and will not be repeated here.)
conclusion
The ETCd-Syncer synchronization service has a good master/slave mechanism and can perform switchover in a timely and efficient manner
The performance of breakpoint continuation after a short time failure meets expectations. If a long time fault occurs and the complex condition of Compact occurs, a SRC Miss occurs after the synchronization is restored, and manual access may be required
The –reset-last-synced-rev parameter has a good effect on SRC Miss anomaly repair
About us
More about cloud native cases and knowledge, can pay attention to the same name [Tencent cloud native] public account ~
Benefits: the official account responds to the “Manual” backstage, and you can get “Tencent Cloud native Roadmap manual” & “Best Practices of Tencent Cloud Native” ~