At 6 o ‘clock in the morning of Sunday, I received an alarm call from the company, and the number of anomalies exceeded the threshold. Then the group leader said there was something wrong with his network and asked people to check the reason. After connecting to VPN, it was found that the Table object of HBase could not be found and the HBase cluster could not be connected.

Our group is a service group. The remaining problems can only be dealt with by the architecture group, which is specially responsible for Hbase. However, I was interested in technology and wanted to understand the cause of this problem, so I consulted the architecture team.

When it comes to communication, that’s another story.

Before troubleshooting, read HBase architecture details to understand HBase principles. When the Master fails, HBase re-elects the Master. When the RegionServer fails, HBase also performs fault recovery. HBase has high availability.

However, if the Zookeeper cluster is down and the Hbase cluster is down, is the Hbase cluster still available?

The problem

During the Zookeeper cluster Leader election, the ZK cluster becomes unavailable (due to network jitter). Other services that depend on the ZK cluster are affected, such as the Hbase cluster, and the whole Hbase cluster fails.

Why is the Hbase cluster recovered in more than an hour? What the architecture team said was that there was something wrong with their alarm, so the process was slow, and the details would be reported in an event report.

After MY continuous inquiry, only give this much information at present, those people in Beijing, in QQ when asking questions love and ignore, really dizzy. Since we can’t get real information from them, we have to do our own analysis.

Analysis of the

How long does the Zookeeper Leader election last?

How do I recover the Hbase cluster?

ZK self-test

I have tested it on the local machine. It takes a short time for the ZooKeeper Leader to be reelected, but if there is data, it may take a long time to synchronize data to other nodes. But if it takes too long, it’s because they were designed with less thought in mind.

In addition, when the Zookeeper cluster changes from available to unavailable, the connection between the client and the ZK server is disconnected and the connection is continuously retried.

If the reconnection takes too long after the disconnection exceeds the session expiration time, the server task session has expired, and the session clearing is performed. After the session expires, the server notifies the client that the session has expired and you need to reconnect. The client then needs to restore the temporary data.

However, if the ZK server cluster is not available and cannot be reconnected, then… . The client continues to retry until the connection is successful.

Client:

Server:

When the new ZK node rejoins: you can see that the Leader election for this node took 295s, or 4 minutes. However, newly added nodes only spend 0.2s on Leader election

The time it takes for the ZK Leader to be reelected is also related to the amount of data on the ZK. It also depends on the amount of network and cluster. The time frame is hard to estimate, but most can be worked out within 10s. There could be some surprises, of course

Real situation

The ZK cluster is caused by network jitter

Internal troubleshooting records of the company, when encountering problems, first say the problem and impact.

Fault: the communication in the cluster is disconnected, causing the server to re-elect the Leader, which takes 21672ms

Impact: RegionServer and YARN cluster of the Hbase cluster are down

Although the Leader election only takes 30 seconds, the cluster delay becomes large until the ZK cluster works normally. During the 5 minutes, the communication in the cluster is intermittent. You can check the logs.

HBase recovery

At 6:35 a.m., Zhang Jun makes a call to say that there is a problem with Hbase. RegionServer is offline.

Cause: The Replication point information is stored in the ZK by the Replication module. The ZK is unavailable due to network jitter. RegionServer fails to connect to the ZK for four consecutive times and sleeps for one second each time.

Solution: Hbase is started at 6:48 a.m. It takes about 50 minutes to open the region, and the impact lasts until 7:30 a.m

Reflection:

  • The Hbase outage was not immediately discovered by the person in charge of Hbase, but was notified by the service. No alarm was received because the alarm was turned off. This is a serious fault in work
  • It takes time to open a Region. In fact, many regions are not urgent to open. For example, regions that are not in the current month table have low access requirements.

Future measures to be taken:

  • Alarm has been added, for alarm closed things should be limited, from the system optimization of this thing, should not be closed
  • Region priority adjustment, and some framework optimizations

The last

The exact reason will wait until the architecture team writes the incident report.

reference

  • ZooKeeper tutorial
  • Scenario in which HBase hangs up
  • Recovery process of an HBase cluster crash