Responsible for service is always in a rush hour before and happen a lot of pressure test redis connection timeout exception redis. Clients. Jedis. Exceptions. JedisConnectionException, according to the original business rules, first from the database query, It is then cached in Redis with a timeout of 3 minutes.

In addition, due to the characteristics of the business, no measures have been taken to degrade or limit the flow, and the QPS at the peak almost reaches 20000. Although this phenomenon is simply an exception and will not cause the whole process of the main link to be unavailable, we still need to find out the cause of the problem and solve it.

First of all, we went to the students in charge of Redis for investigation. They told me that Redis is stable now and there is no problem. There will be no problem in the past, now and in the future.

. .

It makes sense, and I have no power to refute it.

In that case, we have to look for our own reasons.

Thinking of the February

Viewing Exception Distribution

First of all, based on our experience, we take a look at the situation of our own server to see which machines are abnormal in the end. Through monitoring, we switch to the single-machine dimension to see whether the abnormal is evenly distributed. If the distribution is not uniform, only a small number of hosts are extremely high, and basically we can locate the machine with the problem.

Ah, that’s comfortable, that’s where the problem is, only a few machines are unusually tall.

However can not be so, we continue to say the investigation train of thought……

Redis situation

Again according to the experience, while in charge of redis classmate say redis thief stable yapping, but we are in line with the attitude of suspicion, not too believe what they said, this is very important, especially in work, the classmates, don’t say what you believe what others, to be in line with the spirit of conan, happen when everyone is a criminal suspect kills, of course, you want to rule out oneself, Firmly believe that this is definitely not my pot!

Well, let’s see if the Redis cluster has too much node load, such as 80% as a general rule of thumb as a threshold.

If one or a few nodes exceed the threshold, it indicates that there may be a hot key problem. If most nodes exceed the threshold, it indicates that there is a large pressure problem of redis as a whole.

In addition, you can see if there is a slow request. If there is a slow request and the time matching problem occurs, there may be a problem with a large key.

B: well… .

Redis is right. Redis is as steady as an old dog.

CPU

We assume that we are still helpless and still don’t know what the problem is. Don’t worry, then look for someone else’s cause, see how the CPU is, maybe the operation and maintenance secretly give us the wrong configuration of the machine.

Let’s see how high the CPU usage is, if it’s over 80%, or based on experience, our previous service peak is usually 60%.

See if the CPU has current limiting, or if it has intensive and long current limiting.

If these phenomena exist, it should be the pot of operation and maintenance, and there are not enough machine resources for us.

GC pause

Well, OPS didn’t get killed this time.

Let’s see what GC looks like.

Frequent GC and long GC times can make it difficult for the thread to read the Redis response in a timely manner.

How do you judge this number?

In general, we can calculate, again based on our experience with the mess, total GC time per minute /60s/ GC number per minute, if it reaches ms level, the impact on redis read and write latency will be significant.

Just to be sure, we also want to check that the historical surveillance levels are pretty much the same.

Okay, excuse me. Let’s move on.

network

The network area we mainly look at TCP retransmission rate, which is basically in the larger companies have this monitoring.

TCP retransmission rate = Number of TCP retransmitted packets per unit time/Total number of TCP sent packets

We can think of TCP retransmission rate as a simple measure of network quality and server stability.

According to our experience, the lower the TCP retransmission rate is, the better. The lower the TCP retransmission rate is, the better our network is. If the TCP retransmission rate remains above 0.02% (based on our actual situation), or suddenly increases, we can doubt whether there is a network problem.

Like this picture, if it’s like an ELECTROcardiogram, basically the network problem is gone.

Container host

Some machines may be virtual machines (VMS), and CPU monitoring indicators may be inaccurate, especially for I/O-intensive scenarios. Other methods can be used to query the status of the host.

The last

According to a series of SAO, we according to the positioning to the machine and then cleared a heap of, finally is to locate the problem, have separate several machines at the height of the TCP retransmission rate high, according to the ops to provide solution: 】 【 restart problem machine, we have solved the problem very well.

However, this is after all a palliative approach, how to solve the final?

As I wrote in my other article, did anyone ever tell you how to solve more complex cache penetration