The author | interface rompin, English name Leo, a code agriculture

Arthas official community is holding an essay call for prizesClick on the submit】

1. Abnormal protrusion

The CPU usage of a RegionServer in the HBase cluster suddenly reaches 100%. After the RegionServer is restarted, the CPU load gradually reaches the peak. After the cluster is restarted for several times, the CPU load continues to remain high. The RegionServer breaks down and the HBase cluster ends.

2. Phenomena above anomalies

From the CDH monitoring page, almost all the core indicators except CPU are normal, disk and network IO are very low, memory is sufficient, compression queue, refresh queue is also normal.

Prometheus surveillance is similar to this, no stickers.

The numbers in the monitoring indicators can only tell us visually the phenomenon, not the cause of the anomaly. So our second reaction is to look at the log.

(Wechat screenshot of the enterprise)

At the same time, there is a lot of noise output like this in the log.

It turns out that such output is just some irrelevant information, which does not help us analyze the problem, or even interfere with our problem location.

However, a large number of scan responseTooSlow warnings in the log seem to tell us that a large number of time-consuming scan operations are taking place in the HBase Server, which may be the primary cause of high CPU load. However, due to a variety of factors, we didn’t focus on this, because this kind of information is also frequently encountered in historical periods.

3. Arthas person

Neither monitoring nor logging allows us to be 100% sure which operations are causing the high CPU load. Using the top command, we can only see that the HBase process is consuming a lot of CPU, as shown in the figure below.

Without further analysis, you still don’t know which execution threads in hBase-related processes are causing the problem. This command is used to analyze processes in Java. You can use jstack or jstat gcutil. But it’s not these two, or even Async-profiler, but Arthas. Async-profiler is also a powerful tool, but Arthas includes it and is much more powerful and a miracle worker.

Arthas has heard about it for a long time and thought it would only be used to analyze WEB applications such as Spring Boot, but after looking through its official documentation for the last two days, I realized how ignorant I was. For an introduction and introduction to arthas, please refer to its documentation, which is more detailed and friendly than any third-party source.

Github.com/alibaba/art…
Official document of Alsace
Github.com/jvm-profili…

4. Use arthas to analyze abnormal HBase processes

4.1 run arthas

Java -jar /data/arthas/arthas-boot.jar --target-ip 0.0.0.0Copy the code

–target-ip defaults to 127.0.0.1 and is set to 0.0.0.0 for use with webConsole

4.2 Arthas successful interface

Using the top command, the ID of the abnormal HBase process is 1214, which is the HRegionServer process. Enter the serial number 1 and press Enter to enter the command line interface for listening to the process.

4.3 dashboard

Run the Dashboard command enter to view the overall CPU usage of the process. As shown in the figure, thread 59 occupies the highest CPU usage.

4.4 the thread

Enter thread and press Enter to view the execution status of all threads in the process.

4.5 the thread – n 3

Output resources occupy the top three threads.

4.6 Thread-n 3 -I 5000

The unit time is 5 seconds, the resources occupy the top three threads.

4.7 Using Async-Profiler to generate flame charts

The simplest command to generate a flame map.

profiler start
Copy the code

Every once in a while, maybe 30 seconds.

profiler stop
Copy the code

View it on the Web Console.

A primer on fire diagrams:

View the JVM process CPU flame chart tool.

The flame map clearly identifies the threads with the highest CPU usage as those with the longest green box, the SCAN operation.

5. The CPU load caused by the SCAN operation is high

Through the process analysis above, we can finally determine that the CPU load is high due to the scan operation. We query the HBase API based on packaged happybase, happybase. Readthedocs. IO/en/latest /

In fact, the normal scan operation can normally return results, and the abnormal query tables are not very large, so we exclude the possibility of hot spots. The business side’s query logic is abstracted as follows:

from happybase.connection import Connection
import time
start = time.time()
con = Connection(host='ip', port=9090, timeout=3000)
table = con.table("table_name")
try:
    res = list(table.scan(filter="PrefixFilter('273810955|')",
                      row_start='\x0f\x10&R\xca\xdf\x96\xcb\xe2\xad7$\xad9khE\x19\xfd\xaa\x87\xa5\xdd\xf7\x85\x1c\x81ku ^\x92k',
                      limit=3))
except Exception as e:
    pass
end = time.time()
print 'timeout: %d' % (end - start)
Copy the code

The combination of PrefixFilter and row_start is used to implement paging queries. Row_start is an encrypted user_id with special characters. Logs show that all time-consuming queries have garbled characters. Therefore, we suspect that the exception in the query is related to these garbled characters.

However, when the follow-up tests were replicated, it was found again.

# will timeout res = list (table scan (filter = "PrefixFilter (' 273810955 | ')", row_start = '27', Limit = 3)) # not timeout res = list (table. Scan (filter = "PrefixFilter (' 273810955 | ')", row_start = '27381095', limit = 3))Copy the code

Even if the filter and ROW_start parameters are not garbled characters, the combination of filter and ROW_start anomalies will cause high CPU anomalies. If row_START is set to a smaller value than the prefix, the estimated data scanning range will be larger. Similarly, if full table scanning is triggered, CPUload will inevitably increase.

6. Frequent connection creation or thread pool usage causes continuous increase of SCAN threads

Our common code for working with HBase is wrapped in HappyBase, which also uses The HappyBase thread pool. In our further testing, arthas was used to monitor thread conditions when we were using connection pools or creating connections repeatedly in loops. Scan thread is found to be very serious, the test code is as follows:

6.1 Connections are created outside the loop and reused

from happybase.connection import Connection
import time
con = Connection(host='ip', port=9090, timeout=2000)
table = con.table("table")
for i in range(100):
    try:
          start = time.time()
        res = list(table.scan(filter="PrefixFilter('273810955|')",
                              row_start='\x0f\x10&R\xca\xdf\x96\xcb\xe2\xad7$\xad9khE\x19\xfd\xaa\x87\xa5\xdd\xf7\x85\x1c\x81ku ^\x92k',
                              limit=3))
    except Exception as e:
        pass
    end = time.time()
    print 'timeout: %d' % (end - start)
Copy the code

When the program starts running, you can open arthas to monitor the HRegionServer process and run the thread command to check the thread usage at this point:

A few are running, most are waiting. At this point, CPU load:

6.2 Loops are frequently created and then used internally

The code is as follows:

from happybase.connection import Connection
import time
for i in range(100):
    try:
        start = time.time()
        con = Connection(host='ip', port=9090, timeout=2000)
        table = con.table("table")
        res = list(table.scan(filter="PrefixFilter('273810955|')",
                              row_start='\x0f\x10&R\xca\xdf\x96\xcb\xe2\xad7$\xad9khE\x19\xfd\xaa\x87\xa5\xdd\xf7\x85\x1c\x81ku ^\x92k',
                              limit=3))
    except Exception as e:
        pass
    end = time.time()
    print 'timeout: %d' % (end - start)
Copy the code

As you can see in the figure below, the number of threads starting to run is increasing and CPU consumption is increasing.

At this time, the CPU usage rose sharply from the stable one just now:

6.3 Accessing HBase in Connection Pool Mode

The CPU was pulled up by the previous experiment, and the cluster was restarted to restore the CPU to the previous stable state. Then continue with our test, testing the code:

No timeout

from happybase import ConnectionPool
import time
pool = ConnectionPool(size=1, host='ip', port=9090)
for i in range(100):
    start = time.time()
    try:
        with pool.connection(2000) as con:
            table = con.table("table")
            res = list(table.scan(filter="PrefixFilter('273810955|')",
                                  row_start='\x0f\x10&R\xca\xdf\x96\xcb\xe2\xad7$\xad9khE\x19\xfd\xaa\x87\xa5\xdd\xf7\x85\x1c\x81ku ^\x92k',
                                  limit=3))
    except Exception as e:
        pass
    end = time.time()
    print 'timeout: %d' % (end - start)
Copy the code

If the timeout is not specified, only one thread will continue to run, because my connection pool is set to 1.

The CPU load is not too high, if MY connection pool is set to be larger, or if my concurrency is increased, there should be more of these scan-consuming threads and CPU usage will skyrocket.

Specify timeout

from happybase import ConnectionPool
import time
pool = ConnectionPool(size=1, host='ip', port=9090, timeout=2000)
for i in range(100):
    start = time.time()
    try:
        with pool.connection(2000) as con:
            table = con.table("table")
            res = list(table.scan(filter="PrefixFilter('273810955|')",
                                  row_start='\x0f\x10&R\xca\xdf\x96\xcb\xe2\xad7$\xad9khE\x19\xfd\xaa\x87\xa5\xdd\xf7\x85\x1c\x81ku ^\x92k',
                                  limit=3))
    except Exception as e:
        pass
    end = time.time()
    print 'timeout: %d' % (end - start)
Copy the code

In this test, I specified the timeout time in the connection pool with the expectation that the connection would time out and be disconnected in time to continue the next time query. At this point, the server processes the thread of scan request:

The number of server threads used to process the RUNNING state of scan requests continues to grow and consume a lot of CPU.

7. hbase.regionserver.handler.count

Refer to the great spirit of the blog, as well as their understanding of the parameters, the every client by RPC requests (read or write), sent to the server, the server will have a thread pool, specifically responsible for handling the client’s request, the thread pool can guarantee the same point in time is 30 threads can run, the remaining requests or jam, Either the scan request is jammed into the queue to be processed, and the server thread pool is full of SCAN requests. A large number of time-consuming operations consume THE CPU resources, and other conventional read and write requests are bound to be greatly affected, and gradually the cluster will end up.

8. Control that scan requests occupy a small queue

. First of all, the hbase regionserver. Handler. Count parameters can not be small, if too small, the cluster high concurrency, speaking, reading and writing time delay will be high, because most of the request in the queue. Ideally, read and write occupy different thread pools, and scan and GET occupy different thread pools for read request processing to achieve thread pool resource isolation. If I were you, my first reaction would probably be to simply and roughly create three thread pools: write thread pools, GET thread pools, scan thread pools. The SCAN thread pool allocates a very small core thread, allowing it to consume very small resources and limit its unlimited expansion. But is that really the case? I have not studied the source code carefully. HBase provides the following parameters to separate read and write resources. The following content is excerpted from the HBase official document and translated into Google Translate. Hbase.apache.org/2.1/book.ht…

hbase.regionserver.handler.count

Describes the number of RPC listener instances spun on RegionServer. The host uses the same attribute for the count of the host handler. Too many handlers can backfire. Make it a multiple of the CPU count. If most of the time it is read-only, then the handler count works well close to the CPU count. Start with twice the CPU count and adjust from there. The default 30Copy the code

hbase.ipc.server.callqueue.handler.factor

Describes the factors that determine the number of call queues. A value of 0 indicates that a queue is shared between all handlers. A value of 1 means that each handler has its own queue. The default 0.1Copy the code

hbase.ipc.server.callqueue.read.ratio

Describes dividing call queues into read and write queues. The specified interval, which should be between 0.0 and 1.0, is multiplied by the number of call queues. A value of 0 indicates that the call queue is not split, which means that read and write requests are pushed to the same group of queues. A value less than 0.5 indicates that the read queue is smaller than the write queue. A value of 0.5 indicates that there will be an equal number of read and write queues. A value greater than 0.5 indicates that there will be more read queues than write queues. A value of 1.0 means that all but one queue are used to schedule read requests. Example: Given that the total number of call queues is 10, a read ratio of 0 indicates that 10 queues will contain two read/write requests. A read.ratio of 0.3 means that three queues will contain only read requests and seven queues will contain only write requests. A read.ratio of 0.5 means that five queues contain only read requests and five queues contain only write requests. A read. Ratio of 0.8 means that eight queues will contain only read requests and two queues will contain only write requests. Read. Ratio of 1 means that nine queues will contain only read requests and one queue will contain only write requests. The default 0Copy the code

hbase.ipc.server.callqueue.scan.ratio

Describes the number of read call queues given (calculated from the total number of call queues multiplied by callqueue.read. Ratio). The scans. ratio property divides read call queues into small read queues and long read queues. A value less than 0.5 indicates that the number of long read queues is smaller than that of short read queues. A value of 0.5 indicates that there will be an equal number of short reads and long read queues. A value greater than 0.5 indicates that there are more long read queues than short ones. A value of 0 or 1 indicates that the same queue is used for fetching and scanning. Example: Assuming that the total number of read call queues is 8, a ratio of 0 or 1 indicates that eight queues will contain both long and short read requests. A scans. ratio of 0.3 means that two queues will contain only long read requests and six queues will contain only short read requests. Scans. ratio of 0.5 indicates that four queues will contain only long read requests and four queues will contain only short read requests. Scans. ratio of 0.8 means that 6 queues will contain only long read requests and 2 queues will contain only short read requests. The default 0Copy the code

The function of these parameters is explained in detail on the official website. According to the meaning of it, a certain proportion can be configured to achieve the purpose of separating read and write queues, GET and SCAN queues. However, after adjusting the parameters, the above test found that it was not difficult to control the number of RUNNING threads.

There is a question here, what is the relationship between queues and thread pool direct as I understand it? Is it a thing? After this, we need to look at the source code and see its essence.

9. To summarize

Then then SuoSuo managed to locate problems recorded the whole process, in fact the text description is not very detailed, just as much as possible, reducing the time of the scene and combing the general thinking process of problem, lest later forgotten, at the same time also can hope for everyone from me by some inspiration, has also been a lot of great spirit during the period of point, in this special thanks to all bosses also help.

Arthas’s essay campaign is in full swing

Arthas is officially holding an essay call if you have:

Problems identified using Arthas
Source code interpretation of Arthas
Advise Arthas
No limit, other Arthas related content

Welcome to participate in the essay activity, there are prizes to win oh ~ click to submit

“Alibaba Cloud originator focuses on micro-service, Serverless, container, Service Mesh and other technical fields, focuses on the trend of cloud native popular technology, large-scale implementation of cloud native practice, and becomes the public account that most understands cloud native developers.”

Use Arthas to diagnose HBase abnormal processes