A full link crash caused by ZooKeeper. Procedure

This is the 18th day of my participation in the August Challenge

Zookeeper link avalanche review

Some people may feel a lot of sense in the above mentioned points, but have little actual feeling.

However, having personally experienced the whole link avalanche in 2015 JD promotion, I am deeply touched.

Although at that time, I was still a child who had just joined the work force.

Historical review:

That sunny morning, because of the promotion activities already all over the sky publicity, I and the group of big men, early sat in front of the computer monitoring system indicators.

9, 10 o ‘clock, a part of the system suddenly alarm more, part of the machine frequently alarm — can not connect to the registry.

It’s on sale. Come on, reboot it, see if it works.

To no avail, more services and more nodes failed, and the exception spread from the fixed machine room to all nodes.

We know, the registry is supposed to be down.

Repeated restarts were attempted to reconnect the registry, but to no avail.

Later, the students who have a platform said do not restart for the time being and wait for notice. After a long wait, finally, can restart, sure enough, are connected, but, daylily…

What really happened:

At first, a node in the registry must be down, because of a large number of nodes such as the second kill activity expansion, or full bandwidth is unknown now, in short, it is down.

Because Zookeeper guarantees CP, only the nodes connected to the Leader can provide services. Therefore, although the service server is fine in the faulty machine room, it cannot provide services.

However, there is no shortage of user requests, and a large number of requests are diverted to the normal server in the machine room, which makes the service system unable to carry. The bar registry collapsed along with it.

However, ZK does not guarantee availability and cannot be serviced properly in situations such as electing a Leader.

Therefore, a large number of business systems trying to reconnect to the registry at the same time by restarting, either failed to connect, or, a large number of write operations to register service nodes, again the registry washed down.

After all, it is necessary to expend more system resources to ensure globally unique node creation under high concurrency.

A vicious circle appears, the more restart, the more unable to get up…

So, when the platform later required a batch restart, the registry was restored to normal.

0.5 Selection of registry

To sum up, registries need to ensure AP principles and consider capacity expansion and disaster recovery.

JD registry optimization scheme:

The KV form of mysql+ Redis is used to replace the tree form of ZooKeeper. User registry addressing to sharding registries for horizontal scaling and disaster recovery.

The realization of the Eureka2.0

Eureka2.0 supports the separation of read and write clusters in the solution, which is also adopted in ant’s open source sofa registry.

In fact, a quick glance will show that the current popular open source registry is in fact in the direction of high availability, scalability, disaster recovery.

A full link crash caused by ZooKeeper. Procedure

Zookeeper link avalanche review

0.5 Selection of registry

Related Posts

Basic principles of Redis five data structures

Jingdong T7 architects launched an electronic version of SpringBoot, from building small systems to building large systems

Java curriculum – Library management system (complete design + source code)