Development history of Ele. me Service Registry (Huskar)

background

As a company’s business becomes more complex and its developer staff grows, servitization becomes a way to decouple the business from the organizational structure. Service at the same time, the flow path, but in a complex, from the original flow via the access layer uniflow to the business layer and persistence layer, into a between any two internal services are likely to have a synchronous invocation (RPC) or asynchronous (message queue), so how to design the flow path topology has become a part of the selection of service.

In addition to some of the earlier bus topology schemes (represented by the ESB), star topologies (the central Router proxy for all services intermediating traffic) and mesh topologies (services and point-to-point direct invocation of services) are more likely candidates for contemporary servitization. These two methods correspond to the so-called Server-side service discovery pattern and client-side service discovery pattern respectively.

When Ele. me started servitization in the early stage, it already selected the service mesh service topology. This option has its advantages:

Compared with central agent, service and point-to-point direct reduce the waste of Intranet bandwidth. For example, in containerized or virtualized scenarios, the caller and callee may be on the same host. In this case, the central proxy scheme makes the call traffic look far away and “go around” the Router once.
This option also eliminates the cost and pitfalls of operating and maintaining a centralized agent. Leaving aside the HA problem, it is difficult to deploy seamless only. For example, when service A is published, the central Router agent needs to be reloaded to cut off service B’s connection.

But this choice also increases the overall SOA architecture’s dependency on service registries and service discovery facilities — service registries become non-reducible dependencies for all services. In the event of a registry failure, all service calls are affected. Because our production environment is not a completely static environment, change happens all the time, even if there is no manual release, there are machines going up and down, containers automatically drifting, etc. These rely on the service registry to “inform” all service callers through a register-discovery approach.

This is why Ele. me built the service registration discovery center as an important middleware.

The existing scheme

Service registrie-discovery solutions are common in the industry, many based on distributed coordinators such as ZooKeeper, ETCD, or Consul. One possible reason is that they provide both distributed replicas and a consistent appearance of timing — think of as “atomicity” in a distributed environment. This is in line with the requirements of the service registry that the throughput is not too large, the total amount of data is not too large, but the data needs to be secure, cannot be lost, and must be “registered to be discovered”.

In the early days of Ele. me (when only Python services were available), ZooKeeper was also chosen as a service registration and discovery solution, which should be similar to many open source SOA frameworks (such as Dubbo). To register the service node, write the service node to the directory corresponding to app_id and cluster in ZooKeeper. If the writing succeeds, the service is successfully started. Discovery, that is, the watch primitive of ZooKeeper listens to the instance nodes under app_id and cluster of the other party, and the discovery updates are written to the local file system cache for the connection pool of the customer.

Later, Ele. me introduced Java and Go services, because it didn’t want to implement the ZooKeeper Recipe repeatedly (those of you who have written ZooKeeper Recipe know that writing a stable Recipe without race conditions is a bit of a challenge). Therefore, the logic implemented by Python SDK is packaged into a simple HTTP service, which “pushes” to Java, Go and other services in the way of HTTP Comet polling to achieve the purpose of service discovery.

It was simple, feasible, and satisfied our needs for a long time. But it has serious pitfalls and has given us a serious P2 accident. This makes us understand why Netflix develops Eureka instead of ZooKeeper, and also facilitates us to choose the path of ZooKeeper Recipe as a Service.

Evolutionary path

Load – Why not directly connect to ZooKeeper

When Netflix developed Eureka, it had a pretty neat Roadmap: Service registration – the capacity of the discovery center has significant characteristics. The capacity demand of service registration is relatively fixed, because the number of services that an IDC can accommodate is relatively fixed (may not consider the number of small containers after containerization, but it can also be considered that the coefficient difference is also fixed to some extent); However, it is difficult to predict the capacity demand of service discovery, because it is impossible to predict the change of online service dependency in advance [1]. For example, every time service A depends on one more service, the capacity demand of service discovery increases. However, the number of services that service A depends on varies with business iterations. More than 2,000 services are released every day, so it is impossible for the middleware team to predict and plan the total number of dependencies in advance.

Projected to ZooKeeper, except for an odd number of resolution members (5 members in ele. me) responsible for “writes”, i.e. service registrations, and “reads” are provided by observer nodes, which seems to be an easy scale approach. The Observer node of ZooKeeper does not participate in the resolution, but forwards the resolution to the leader for a vote when a write request is encountered, and resolves the resolution locally when a read request is encountered. But there are two problems with this.

First, ZooKeeper’s resolution members themselves are not necessarily highly available. If a ZooKeeper member node needs to be restarted during o&M operations, or a Full GC occurs, or even a network partition occurs between the Leader and observer, the Observer and leader will lose contact. If the Observer fails to heartbeat after the ticket cycle expires and finds that it is disconnected from the leader, it stops the observer service. This does not meet the requirements for service discovery to be “highly available”. We would prefer to retain a “last minute” snapshot in the event of a registry breakdown and continue to accept read-only service discovery requests.

The other problem is more serious than the first one – the Observer takes over the service discovery request and puts extra pressure on the service registration facility (the leader and followers who vote). When the ZooKeeper client is connected to an observer, a session needs to be established. This session is used as the transaction log identifier and to determine when the temporary node dies. In this way, ZooKeeper can realize distributed locking and distributed member management. Because of this, session creation needs to be written to the transaction log, which is voted on by the leader. In some network jitter scenarios, the session between a large number of clients and ZooKeeper is disconnected. After the network is restored, these clients re-establish their sessions instantly. These requests are made to the Observer, but the Observer does not have the ability to “write”, so it can only be handled by the leader. When the leader writes to the transaction log, it is not only “queued”, but also needs to send many I/OS to the followers to vote, so the response speed is predictably slow. When the request is slow enough, the client will time out. If the timeout is over, it cannot sleep a random number of seconds and then rebuild, because it will affect the call if the service finds that it is not available. Therefore, we have to rebuild the session immediately, at this time, the outstanding queue of ZooKeeper resolution members will be longer, forming a vicious circle. At this time, if a business party releases and does not comply with the strategy of grayscale release by node, the node will be unable to be pulled up after restart. Because ZooKeeper cannot be written, the node cannot be registered and therefore cannot be pulled. When we encountered this problem, unfortunately, there was a critical business release that did not comply with the grayscale strategy. As a result, the critical service was unavailable for a long time, which eventually led to a P2 accident, causing business loss to the company.

When we rechecked, we found that some of the observers were connected to the main cluster across the city at the time of the incident, which triggered the start condition of this vicious cycle during the long distance network jitter. After the accident, we came to a conclusion that it was ok to use ZooKeeper as a service, but when the number of services was very large, we could not let the service directly connect to ZooKeeper through an SDK. So over the next few months, we took the Python service directly connected to ZooKeeper’s SDK offline and transformed it into a Java Go service that connected to an HTTP middleware to register and discover services. This HTTP middleware internally encapsulates the ZooKeeper Recipe, which connects to the ZooKeeper cluster. The ZooKeeper Observer node is a dead end, so let’s get rid of it and let the HTTP service connect to the resolution cluster. Because the HTTP middleware (hereafter referred to as the service registry) itself has a limited number of nodes, it does not cause too much stress to the ZooKeeper cluster in terms of service discovery. That is, the service registry takes on a role similar to fanout in message queues.

Huskar – ZooKeeper Recipe as a Service

Huskar is the project name of our service registry. This project has a long history and was established before I joined Ele. me. The project was originally set up to provide the Python implementation of ZooKeeper Recipe for Java and Go services. After we came to the conclusion that “services should not be directly connected to ZooKeeper”, it gradually developed into a unified service registry for the whole company.

The implementation of Huskar is a typical Gevent + Gunicorn selection in Python applications [2][3]. The overall approach is multi-process + in-process coroutine to process multi-way IO. This whole is based on an internal application framework (Vespene, A.K.A Zeus-Core). ZooKeeper client selects Kazoo, an open source implementation [4].

When I took over the project, it had implemented basic functionality, but had a series of problems — inconsistent code quality, low test coverage. Given that the project itself was the result of rapid prototyping, these questions are understandable. However, one of the more serious issues that we address as a first priority is the race condition in its service discovery part implementation. The service uses HTTP Comet to “push” the “change” events of the service to which the client subscribes, which determines that it needs to listen for events in multiple clusters of multiple appids. This is a recursive listening operation, and the recursive listening primitive is not available in ZooKeeper (etCD seems to have learned this lesson and designed the built-in recursive listening primitive), so it needs to be implemented with Recipe. Originally, Huskar combined Kazoo’s built-in two ready-made recipes ChildrenWatch and DataWatch to implement recursive listening. Under the background of network communication with ZooKeeper, this design inevitably resulted in message loss and message out of order. This is not enough for a reliable service discovery middleware.

When researching solutions, we gave priority to referring to the community’s more mature ZooKeeper Recipe library, which was originally developed by Netflix and later given to the Apache Foundation as a Curator. A Recipe for TreeCache in exhibit Exhibit nicely matches this requirement: It also monitors children and triggers children’s data watch. However, it does not send the watch event directly. Instead, it is used to maintain a snapshot of in-memory and keep the ZooKeeper session updated while it is reliable. When an update occurs, the actual update event is determined by comparing the meta information in the In-memory snapshot (for example, changing the ZXID of a Znode). This approach perfectly matches the service discovery scenario: It ensures a correct and ultimately consistent event flow, keeps the ZooKeeper session updated when it is reliable, and keeps the “last minute” snapshot readable when the session is lost. Accurately push an event representing the difference by comparing the snapshot when the session resumes.

Therefore, wrapping TreeCache’s design into the Huskar service addresses several key issues encountered with service discovery using ZooKeeper:

Highly available, ultimately consistent: TreeCache itself provides assurance;
Scaleable for services: packaging as HTTP services significantly reduces the number of ZooKeeper sessions without putting pressure on resolution members; When the service finds insufficient capacity, Huskar is a weak-state (not completely stateless because of Snapshot) service and is free to scale horizontally.

This leaves us with one last problem, which is why we used Python to implement Huskar, and there is no TreeCache implementation in Kazoo. Predictably, we ended up porting TreeCache to Kazoo [5] (face masking).

Effect evaluation

So far, Huskar, a revamped service registration and discovery center, has hosted more than 2,000 services across ele. me, with nearly a million instances, configuration items, and switches.

Ele. me currently has two hypermetro rooms, which deploy two independent Huskar clusters. The lower ZooKeeper uses replication middleware developed by Beijing Technology Innovation Department to implement non-strong and consistent cross-region bidirectional replication.

In terms of capacity for service discovery, Huskar has approximately 50,000 daily HTTP Comet long connections per machine room, and this number continues to grow. At the release peak, the number of events pushed is about 2kps.

Since the revamp went live, we found some problems with the TreeCache implementation, fixed them and sent patches back upstream. Huskar currently has an annual uptime of 99.996% and holds a zero accident record.

In the future

Once Huskar achieved its goal of stability, I began to focus more on productization of its back office (the company’s service configuration center) to enhance our service governance capabilities. In the words of our director, SOA is easier to develop than to govern, and perhaps 80% of the effort is spent on governance rather than development, especially when productization and platformization are not enough.

Huskar, as a central “controller” for all services, should not be satisfied with service registration and discovery. To borrow network engineering terminology, if connection pools, soft loads, etc. in the SOA service framework (or sidecar process) are the “data plane” responsible for traffic delivery in SOA, Huskar’s role in it should be the “control plane”.

We’ve already made a few attempts at this, and some of them have already landed. For example, if A node change in service discovery is pushed by Huskar’s HTTP Comet, then under centralized configuration control, Huskar decides which cluster member list of service B to push when service A requests to discover service B, thus achieving the effect of “cluster routing” on the control line. We used to specify app_id and cluster when calling a service, but now we have to report my own app_id and the app_id I want to call. Huskar, the “control plane”, decides which cluster the traffic goes to. Then the service provider, as well as the SRE team that ensures site stability, can be completely indifferent to the callers when changing the cluster traffic policy.

However, we are still in the early days of our experiment. Some service control planes in the industry can achieve extremely fine-grained traffic control capability. They can be used for Canary publishing, etc., and can even be connected with the publishing platform and container platform to automatically control and gradually expand the traffic connected to the new Codebase, and automatically roll back when problems occur. Those who pay attention to the field of servitization will know that I am talking about Netflix[6]. There is still a long way to go for ele. me to be the benchmark of the industry, which is the direction we will strive for in the future.

The authors introduce

Jiang ge zhang (pine rat oreo & tonyseek) joined ele. me in March 2016, working in the framework tools department. Currently, I have formed a project team of three people including myself in Beijing, responsible for the service Registration and Configuration Management Center (Huskar) project. At present, I pay more attention to soA-related topics such as configuration governance and traffic governance, and try to implement the Technology Product Designer/Manager practice of Technology middleware productization.

reference

[1]: Eureka 2.0 Architecture Overview · Netflix/ Eureka Wiki · GitHub
[2]: What is gevent? – gevent 1.3.2. Dev0 documentation
[3]: Gunicorn – Python WSGI HTTP Server for UNIX
[4]: GitHub – python-zk/kazoo: Kazoo is a high-level Python library that makes it easier to use Apache Zookeeper.
[5]: Kazoo. Recipe. Cache — Kazoo 2.4.0 documentation
[6]: Automated Canary Analysis at Netflix with Kayenta — Netflix TechBlog — Medium