The QPS was stuck at 1W during the tuning of the callback interface for the self-built storage service, but the performance bottleneck was not found through various monitoring and analysis. Finally, it was found that the database CPU monitoring was not prepared to mislead. Hereby, we record the whole tuning process
One, source code analysis
Callback interface source performance analysis:
- 1 Redis query
- Two DB operations, including one query and one write operation
- Five local cache reads
- 1 external interface invocation
The biggest performance bottleneck is likely to be in DB operations
Second, resource allocation
service | configuration |
---|---|
mysql | One master one slave 16C 128G |
redis | 16 g 8 nodes |
The callback service | 2c 4g |
Three, test analysis
Next, pressure test is carried out according to the number of concurrency and callback node as the only variables
concurrency | qps | The callback node | avg rt/ms | 95 rt/ms | 99 rt/ms | callback cpu |
---|---|---|---|---|---|---|
5 * 10 | 1922 | 1 | 28.83 | 57 | 83 | 190% |
5 * 10 | 4103 | 2 | 11.89 | 27 | 48 | 180% |
10 * 10 | 4060 | 2 | 23.28 | 63 | 105 | 170% |
10 * 10 | 6028 | 3 | 13.94 | 42 | 77 | 180% |
15 * 10 | 6060 | 3 | 21.15 | 60 | 94 | 170% |
15 * 10 | 7560 | 4 | 15.87 | 52 | 94 | 160% |
20 * 10 | 7830 | 4 | 21.27 | 64 | 111 | 170% |
20 * 10 | 9860 | 5 | 16.08 | 51 | 89 | 170% |
25 * 10 | 9960 | 5 | 19.63 | 62 | 121 | 170% |
25 * 10 | 9790 | 6 | – | – | – | 170% |
30 * 10 | 10700 | 6 | 21.35 | 68 | 115 | 170% |
40 * 10 | 11170 | 6 | – | – | – | 180% |
40 * 10 | 10170 | 10 | 24.04 | 65 | 116 | 100% |
40 * 20 | 10700 | 10 | 41.78 | 109 | 142 | 100% |
In summary, it can be found that one callback node provides 2000Qps, and the QPS of additional nodes remains 1W after 5 nodes
4. Influencing factors
The factors affecting the callback interface were listed in the source code analysis above:
- Redis query
- The db operation
- Local cache read
- External interface call
Again, comment out the processing logic in the code by way of unique variable exclusion
factor | qps |
---|---|
Redis query | 1w |
The db operation | 2.5 w. |
External interface call | 1w |
Ps: Since the local cache information is read in multiple places, there is no test for the local cache read if the modification scope is large
From this information, it can be ruled out that redis queries and external interface calls are not constraint points for performance, but mysql is a possible point, but it does not prove that mysql is the reason for QPS failure
5. Database monitoring
Since mysql is currently analyzed to a suspect, then we further analysis for the database
5.1 Database CPU
Start by looking at the CPU monitoring for your databaseThe service deployed by the database is limited to 8C on 16C machines. From the perspective of database CPU monitoring, the HIGHEST CPU is 4C, accounting for only 50%, which is far from reaching the bottleneck
5.2 the Query Time
Since slow query time is set to 0, slow query data does not mean much, so look at the processing times of select and INSERT SQL in the databaseAs shown in the monitoring, the average insert Query Time is 6ms, and the average select Query Time is 289us. From the last set of data in the test, the average response Time of the interface is ART 41.78ms. In fact, 6.289ms(6ms+289us) is only 15% of the time
Vi. Client
The problem cannot be analyzed directly through monitoring. Next, verify whether it is caused by the problem of the initiating pressure client, so as to avoid facing
6.1 Client Resource Monitoring
throughdstat
The CPU, memory, and bandwidth resources on the client are not bottleneck
6.2 Adding a Client
In order to exclude the influence of other potential factors such as the number of connections, the number of clients increased from 20 to 40 or 60. When the total number of concurrent connections remained unchanged or doubled, the QPS was still about 1W. Is it the influence of other factors besides the number of nodes?
6.3 influxdb l
Because the data reported by the client depends on the influxDB and is processed synchronously, it is found that the CPU usage of the influxDB reaches about 90%. However, after the influxDB is increased from 4C to 8C, the pressure test is performed again. The CPU of influxDB accounts for less than 52%, but the QPS is still around 1W
6.4 Add extra pressure
In fact, only from the monitoring can simply rule out that it is not caused by CPU running full, it needs to design an experimental scheme to check whether it is the cause of the client
Design scheme of experiment:
- The client provides 20 x 10 concurrency and the running time is about 2 minutes. Check the QPS curve
- Additional system provides 50 concurrent, at this time the pressure provided by the client continues, the running time is about 2min, check the QPS curve
- Stop the additional concurrent pressure provided by the system, the pressure provided by the client continues, the running time is about 2min, and check the QPS curve
According to the above scheme, the experimental results are as follows:From this experimental result, we can definitely exclude the influence of client initiating pressure
Thread lock
Check whether there is thread lock information on the server, which causes the QPS to fail
In fact, when the number of callback service nodes increases from 5 to 10, the QPS is still 1W, which can be ruled out as thread lock and other reasons, but thread data can be captured to confirm again
7.1 Obtaining a Thread Snapshot
jstack 1 > 1.txt
Copy the code
Specific [tuning tools] thread snapshot analysis, using jCA tools to open
7.2 Thread Status
The thread state is basically inRunnable
andWaiting on condition
7.3 Waiting on condition
Waiting on condition = Waiting on condition
Mostly poll in Tomcat
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:89)
at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:33)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
Copy the code
A few on Druid’s getConnectionDirect
There are no obvious anomalies to analyze
8. Flame diagram
Since it is not the client that initiated the stress, and the database monitoring can not analyze the problem, then look at the server stack analysis interface processing time consumption
usearthas profiler
Grab stack information or usePerf captures dataOpen the file and view the flame chartIn updateOrSaveFileInfo, the select SQL query is performed first, and then the INSERT SQL is performed to write data. Therefore, it can be seen from the flame diagram that the service processing basically consumes the operations of the two databases
Link analysis
9.1 Method Entry
Execute the command
trace com.lluozh.llz.controller.CallbackController handleLluozhCallback '#cost > 10' -n 10
Copy the code
Results obtained
`---ts=2021-05-21 20:32:17; thread_name=http-nio-8080-exec-2836; id=132e; is_daemon=true; priority=5; TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@2b9aeedb `---[35.16514ms] com.lluozh.llz.service.callback.impl.CallbackServiceImpl:updateOrSaveFileInfo()
+---[0.005561ms] com.lluozh.llz.model.po.FileInfoPo:getAppId() #88+ -- -- -0.003188ms] com.lluozh.llz.model.po.FileInfoPo:getFileKey() #88+ -- -- -21.122075ms] com.lluozh.llz.daoService.FileInfoService:existFileInfo() #88
`---[14.001955ms] com.lluozh.llz.daoService.FileInfoService:addFileInfo() #101
Copy the code
The results are basically consistent with those of flame map grasping
9.2 Obtaining data Step By step Downwards
From the stack information above, work your way down, and then get the SQL operation for insert
trace com.lluozh.llz.daoService.FileInfoService addFileInfo '#cost > 10' -n 10
Copy the code
Results obtained
`---ts=2021-05-21 20:45:31; thread_name=http-nio-8080-exec-457; id=32c; is_daemon=true; priority=5; TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@49ced9c7 `---[21.410149ms] com.sun.proxy.$Proxy135:addFileInfo()
`---[21.38085ms] com.sun.proxy.$Proxy134:addFileInfo()
Copy the code
The dynamic proxy com.sun.proxy is discovered from this stack information, and time is almost entirely spent processing this logic
9.3 Tracking of dynamic proxies
With dynamic proxies, how do you go down to get more detailed stack information?
Ps: Druid’s source code is not very good to find all the call methods link, in fact, look at the flame map analysis can be found to find all the underlying call methods
Continue to look at the fire map captured abovecom.sun.proxy
The logic behind the executionAs you can see from the flame diagram, the main druid method executed is
com/alibaba/druid/pool/DruidDataSource.getConnection
com/alibaba/druid/pool/DruidPooledConnection.prepareStatement
com/alibaba/druid/pool/DruidPooledPreparedStatement.execute
Copy the code
9.4 Druid Executes trace
So let’s trace the execution of Druid
- getConnection
Execute the command
trace com.alibaba.druid.pool.DruidDataSource getConnection '#cost > 10' -n 10
Copy the code
Results obtained
`---ts=2021-05-21 20:57:22; thread_name=http-nio-8080-exec-584; id=425; is_daemon=true; priority=5; TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@49ced9c7 `---[19.703046ms] com.alibaba.druid.pool.DruidDataSource:getConnection()
`---[19.686816ms] com.alibaba.druid.pool.DruidDataSource:getConnection() #109
`---[19.678128ms] com.alibaba.druid.pool.DruidDataSource:getConnection()
`---[19.646259ms] com.alibaba.druid.pool.DruidDataSource:getConnection() #1296
`---[19.634726ms] com.alibaba.druid.pool.DruidDataSource:getConnection()
+---[0.003328ms] com.alibaba.druid.pool.DruidDataSource:init() #1300
`---[19.564124ms] com.alibaba.druid.pool.DruidDataSource:getConnectionDirect() #1306
Copy the code
As can be seen from the above flame diagram, the processing time of getConnection (18.48% of the total database insert logic time of the server, only 1.07% of the total database operation time of getConnection) is only a small part of the total database operation time. Time is not spent getting the connection
Ps: When configuring the number of database connections on the server side for performance tuning, you can trace the proportion of getConnection time to facilitate targeted tuning
- prepareStatement
Execute the command
trace com.alibaba.druid.pool.DruidPooledConnection prepareStatement '#cost > 10' -n 10
Copy the code
Results obtained
[arthas@1]$ `---ts=2021-05-21 21:01:01; thread_name=http-nio-8080-exec-633; id=456; is_daemon=true; priority=5; TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@49ced9c7 `---[29.744699ms] com.alibaba.druid.pool.DruidPooledConnection:prepareStatement()
+---[0.004268ms] com.alibaba.druid.pool.DruidPooledConnection:checkState() #335+ -- -- -0.004607ms] com.alibaba.druid.pool.DruidPooledConnection:getCatalog() #338+ -- -- -0.004426ms] com.alibaba.druid.pool.DruidPooledPreparedStatement$PreparedStatementKey:<init>() #338+ -- -- -0.003182ms] com.alibaba.druid.pool.DruidConnectionHolder:isPoolPreparedStatements() #340+ -- -- -0.004073ms] com.alibaba.druid.pool.PreparedStatementHolder:<init>() #348+ -- -- -0.003288ms] com.alibaba.druid.pool.DruidConnectionHolder:getDataSource() #349+ -- -- -0.003013ms] com.alibaba.druid.pool.DruidAbstractDataSource:incrementPreparedStatementCount() #349+ -- -- -0.00486ms] com.alibaba.druid.pool.DruidPooledConnection:initStatement() #355+ -- -- -0.005498ms] com.alibaba.druid.pool.DruidPooledPreparedStatement:<init>() #357
`---[0.003386ms] com.alibaba.druid.pool.DruidConnectionHolder:addTrace() #359
Copy the code
Druid source
@Override
public PreparedStatement prepareStatement(String sql) throws SQLException {
checkState();
PreparedStatementHolder stmtHolder = null;
PreparedStatementKey key = new PreparedStatementKey(sql, getCatalog(), MethodType.M1);
boolean poolPreparedStatements = holder.isPoolPreparedStatements();
if (poolPreparedStatements) {
stmtHolder = holder.getStatementPool().get(key);
}
if (stmtHolder == null) {
try {
stmtHolder = new PreparedStatementHolder(key, conn.prepareStatement(sql));
holder.getDataSource().incrementPreparedStatementCount();
} catch (SQLException ex) {
handleException(ex, sql);
}
}
initStatement(stmtHolder);
DruidPooledPreparedStatement rtnVal = new DruidPooledPreparedStatement(this, stmtHolder);
holder.addTrace(rtnVal);
return rtnVal;
}
Copy the code
As you can see from the stack information and source code, the time consumed in the prepareStatement callback is the call logic for this piece of code in the fire diagramIn fact, the underlying call is mysql native processing, try to trace the underlying method
trace com.mysql.cj.jdbc.ClientPreparedStatement getInstance '#cost > 10' -n 10
Copy the code
Results obtained
`---ts=2021-05-21 21:03:43; thread_name=http-nio-8080-exec-697; id=497; is_daemon=true; priority=5; TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@49ced9c7 `---[16.192273ms] com.mysql.cj.jdbc.ClientPreparedStatement:getInstance()
`---[16.181172ms] com.mysql.cj.jdbc.ClientPreparedStatement:<init>() #134
Copy the code
How can we continue to analyze it at present
- execute
Execute the command
trace com.alibaba.druid.pool.DruidPooledPreparedStatement execute '#cost > 10' -n 10
Copy the code
Results obtained
`---ts=2021-05-21 21:15:27; thread_name=http-nio-8080-exec-590; id=42b; is_daemon=true; priority=5; TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@49ced9c7 `---[32.167451ms] com.alibaba.druid.pool.DruidPooledPreparedStatement:execute()
+---[0.002757ms] com.alibaba.druid.pool.DruidPooledPreparedStatement:checkOpen() #488+ -- -- -0.002721ms] com.alibaba.druid.pool.DruidPooledPreparedStatement:incrementExecuteCount() #490+ -- -- -0.00353ms] com.alibaba.druid.pool.DruidPooledPreparedStatement:transactionRecord() #491+ -- -- -0.002565ms] com.alibaba.druid.pool.DruidPooledConnection:beforeExecute() #495
`---[0.003494ms] com.alibaba.druid.pool.DruidPooledConnection:afterExecute() #503
Copy the code
Druid source
@Override
public boolean execute(a) throws SQLException {
checkOpen();
incrementExecuteCount();
transactionRecord(sql);
// oracleSetRowPrefetch();
conn.beforeExecute();
try {
return stmt.execute();
} catch (Throwable t) {
errorCheck(t);
throw checkException(t);
} finally{ conn.afterExecute(); }}Copy the code
It can be seen that mysql operations are mostly handled in stmt.execute(), not other extra processing such as checkOpen()
9.5 Trace results under different pressures
Compare and analyze service trace results under high pressure and normal conditions
- Trace at low pressure
[arthas@1]$ trace com.lluozh.cstore.controller.CallbackController handleLluozhCallback -n 100
Press Q or Ctrl+C to abort.
Affect(class count: 2.method count: 2) cost in 387 ms.listenerId: 8 ` -ts= 2021-05-21 16:53:18;thread_name=http-nio- 8080 -exec- 2263;id=fb7;is_daemon=true;priority= 5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@ 2b9aeedb
`---[14.628907ms] com.lluozh.cstore.controller.CallbackController$$EnhancerBySpringCGLIB$$e624886:handleLluozhCallback()
`---[14.378576ms] org.springframework.cglib.proxy.MethodInterceptor:intercept` # () - 1 - [11.086846ms] com.lluozh.cstore.controller.CallbackController:handleLluozhCallback(a) + - [0.003396ms] org.apache.commons.lang3.StringUtils:isBlank() # 312 + - [0.003682ms] javax.servlet.http.HttpServletRequest:getAttribute() # 315 + - [0.00307ms] org.apache.commons.lang3.StringUtils:isBlank() # 316 + - [0.00373ms] org.slf4j.Logger:info() # 319 + - [0.003959ms] javax.servlet.http.HttpServletRequest:getHeader() # 322 + - [0.002916ms] org.apache.commons.lang3.StringUtils:isBlank() # 323 + - [0.003681ms] javax.servlet.http.HttpServletRequest:getRequestURI() # 324 + - [0.003218ms] javax.servlet.http.HttpServletRequest:getQueryString() # 325 + - [0.003068ms] org.apache.commons.lang3.StringUtils:isBlank() # 326 + - [0.003989ms] org.apache.commons.lang3.StringUtils:isEmpty() # 339 + - [0.003264ms] org.apache.commons.lang3.StringUtils:isEmpty() # 343 + - [0.003291ms] org.apache.commons.lang3.StringUtils:isEmpty() # 346 + - [0.003914ms] org.apache.commons.lang3.StringUtils:isEmpty() # 350 + - [0.004014ms] org.apache.commons.lang3.StringUtils:isEmpty() # 353 + - [1.73458ms] com.lluozh.cstore.component.CacheService:getAppConfigCommonFromLocalCache() # 357 + - [0.003084ms] org.apache.commons.lang3.StringUtils:isNotEmpty() # 362 + - [2.699534ms] com.lluozh.cstore.component.CacheService:getBucketInfoFromLocalCache() # 366 + - [1.111017ms] com.lluozh.cstore.service.AbstractStoreService:getDownloadUrl() # 370 + - [0.00313ms] org.slf4j.Logger:info() # 371 + - [1.626661ms] com.lluozh.cstore.component.CacheService:getSessionInfoFromRedis() # 373 + - [0.002987ms] com.lluozh.cstore.dto.response.store.UploadSessionInfoDto:getExpireDays() # 374 + - [0.002685ms] com.lluozh.cstore.model.po.BucketInfoPo:getServiceProviderId() # 379 + - [2.673879ms] com.lluozh.cstore.service.callback.CallbackService:updateOrSaveFileInfo() # 380 + - [0.08924ms] com.lluozh.cstore.model.po.BucketInfoPo:getServiceProviderId() # 384 + - [0.004951ms] com.lluozh.cstore.model.dto.StoreFileEventDto: <init> () # 384 + - [0.004444ms] com.lluozh.cstore.component.FridayService:storeFileEvent() # 384 + - [0.003681ms] com.google.common.collect.Lists:newArrayList() # 387 + - [0.009182ms] com.lluozh.cstore.controller.CallbackController:processReturnBodyFields() # 392 + - [0.003904ms] org.slf4j.Logger:info() # 393 + - [0.003003ms] com.lluozh.cstore.dto.response.store.UploadSessionInfoDto:getExpireDays() # 398 ` - [0.003562ms] com.lluozh.cstore.constants.ApiResult:success() # 412Copy the code
- Trace at high pressure
`---ts=2021- 05- 21 16:55:53; thread_name=http-nio- 8080.-exec- 2477.; id=1117; is_daemon=true; priority=5; TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@2b9aeedb
`---[5.209409ms] com.lluozh.cstore.controller.CallbackController:handleLluozhCallback()
+---[0.003767ms] org.apache.commons.lang3.StringUtils:isBlank() # 312+ -- -- -0.004069ms] javax.servlet.http.HttpServletRequest:getAttribute() # 315+ -- -- -0.003483ms] org.apache.commons.lang3.StringUtils:isBlank() # 316+ -- -- -0.004073ms] org.slf4j.Logger:info() # 319+ -- -- -0.004323ms] javax.servlet.http.HttpServletRequest:getHeader() # 322+ -- -- -0.003516ms] org.apache.commons.lang3.StringUtils:isBlank() # 323+ -- -- -0.003849ms] javax.servlet.http.HttpServletRequest:getRequestURI() # 324+ -- -- -0.003775ms] javax.servlet.http.HttpServletRequest:getQueryString() # 325+ -- -- -0.003475ms] org.apache.commons.lang3.StringUtils:isBlank() # 326+ -- -- -0.003478ms] org.apache.commons.lang3.StringUtils:isEmpty() # 339+ -- -- -0.003234ms] org.apache.commons.lang3.StringUtils:isEmpty() # 343+ -- -- -0.003204ms] org.apache.commons.lang3.StringUtils:isEmpty() # 346+ -- -- -0.003379ms] org.apache.commons.lang3.StringUtils:isEmpty() # 350+ -- -- -0.003201ms] org.apache.commons.lang3.StringUtils:isEmpty() # 353+ -- -- -0.006058ms] com.lluozh.cstore.component.CacheService:getAppConfigCommonFromLocalCache() # 357+ -- -- -0.003683ms] org.apache.commons.lang3.StringUtils:isNotEmpty() # 362+ -- -- -0.004469ms] com.lluozh.cstore.component.CacheService:getBucketInfoFromLocalCache() # 366+ -- -- -0.012478ms] com.lluozh.cstore.service.AbstractStoreService:getDownloadUrl() # 370+ -- -- -0.00316ms] org.slf4j.Logger:info() # 371+ -- -- -0.482134ms] com.lluozh.cstore.component.CacheService:getSessionInfoFromRedis() # 373+ -- -- -0.004054ms] com.lluozh.cstore.dto.response.store.UploadSessionInfoDto:getExpireDays() # 374+ -- -- -0.003648ms] com.lluozh.cstore.model.po.BucketInfoPo:getServiceProviderId() # 379+ -- -- -4.027026ms] com.lluozh.cstore.service.callback.CallbackService:updateOrSaveFileInfo() # 380+ -- -- -0.197707ms] com.lluozh.cstore.model.po.BucketInfoPo:getServiceProviderId() # 384+ -- -- -0.005363ms] com.lluozh.cstore.model.dto.StoreFileEventDto:<init>() # 384+ -- -- -0.004561ms] com.lluozh.cstore.component.FridayService:storeFileEvent() # 384+ -- -- -0.028581ms] com.google.common.collect.Lists:newArrayList() # 387+ -- -- -0.006182ms] com.lluozh.cstore.controller.CallbackController:processReturnBodyFields() # 392+ -- -- -0.003908ms] org.slf4j.Logger:info() # 393+ -- -- -0.0034ms] com.lluozh.cstore.dto.response.store.UploadSessionInfoDto:getExpireDays() # 398
`---[0.004344ms] com.lluozh.cstore.constants.ApiResult:success() # 412
Copy the code
Nothing unusual was found in the comparison
X. Packet capture analysis
Since the time located from the stack information is consumed in the execution of the operation, but this time cannot be counted, so it is impossible to determine whether it is consistent with the average execution time of mysql database obtained previously
Most of the problems between different components can be dealt with by capturing the package, which is the best way for components to clarify themselves
Capture mysql data packets on the server. For details, see [Network Protocol] mysql protocol data packet analysis
Response time curve by packet capture aggregation
Although there is no average over time, this data gives an idea of the gap between the server sending and receiving packets and the average response time displayed by mysql
Ps: The image contains 1/3 select operation, 1/3 INSERT operation, 1/3 transaction_read_only operation
11. Database analysis
So what else is happening on the database side that causes this? This time, I directly looked for the dba to log in to the online database to check the resource information of the database
11.1 Top Databases
During the pressure test, the CPU used by Top to obtain the online database is 8C, which is full, 800%. This is completely inconsistent with the monitoring CPU data provided
11.2 the Threads of the Activity
Here is the Threads Activity data for MySQL at runtimeAt this time, the number of SQL Threads being processed is close to 300. According to dba, the number of SQL Threads being actively processed by the 8C database is close to 300, indicating that the database has been processed at full load.
Then, you can check whether the database is fully loaded by directly handling top and Threads Activities from the database side. Before confirming why the monitoring of the database is inaccurate, we should verify whether it is caused by the database
11.3 Upgrading the Database
Upgrade the database core limit from 8C to 12C, and test the same concurrency again
QPS changed from 1W to 1.5W
At this point, the CPU of mysql went from 800% to nearly 1200%
The problem of database monitoring
Going back to the question of why database monitoring is inaccurate, why is it that the data you get from monitoring is about half as good as the actual usage
12.1 Grafana Kanban Configuration
View the configuration obtained by the database CPU in Kanban
12.2 container_cpu_usage_seconds_total
So what does namespace_pod_name_container_name: container_CPU_usage_seconds_total mean?
The average CPU data of the last 5 minutes is displayed as the CPU data of the node
Because the duration of pressure measurement is about 2-3min, and there is an interval of 2-3min after each pressure measurement, the data obtained is basically half of the real data
12.3 Modifying Kanban Configuration
Modify kanban configuration to change the average CPU data in the first 5 minutes to the average CPU data in the first 1 minuteIn this way, although the data is not real-time data, the data acquisition and presentation is relatively correct