A, cause
The business side reported that the query performance of Gremlin after the graph service upgrade is five times slower than the previous version, so here are the positioning records and thoughts.
Second, ideas and steps
- Because the new and old versions of the database together at this time, has not yet cut, it can not be monitored to view the relevant information, using a set of clusters. It’s better to start with the server side
- Since HugeGraph uses a save-calculation separation, we first need to determine whether the slowness is due to the graph
Server/Client /hbase
Layer by layer. Start by finding the API for graph Server to call Gremlin, which is dissected using Ali’s open source project Artha
(⚠️ note: our following queries are different query statements each time, because the graph itself has cache, the queried point edges will be added to the cache instead of following the path we track, find several cases for query in advance)
2.1 Preparations:
Download and start the arthas curl -o https://arthas.aliyun.com/arthas-boot.jar && Java – jar arthas – boot. The jar
2.2 Link Tracing
1. We perform trace com. Baidu. Hugegraph. API. Gremlin. GremlinAPI get, and then use the postman execute the query, Check gremlin call link and time-consuming, including the trace com. Baidu. Hugegraph. API. Gremlin. GremlinAPI for invoking classes, get to invoke methods:
`---ts=2021-05-10 16:10:28; thread_name=grizzly-http-server-28; id=63; is_daemon=false; priority=5; TCCL = sun. Misc. The Launcher $AppClassLoader @ 232204 a1 ` - [45097.334926] ms com. Baidu. Hugegraph. API. Gremlin. GremlinAPI: get () [throws the Exception] + - [0.016083] ms javax.mail. Ws. Rs. Core. HttpHeaders: getHeaderString () # 114 + - [0.009815] ms Javax.mail. Ws. Rs. Core. UriInfo: getRequestUri () # 115 + - [0.034164] ms javax.mail. Ws. Rs. Core. UriInfo: getQueryParameters (# 116) + - [0.009963] ms com. Baidu. Hugegraph. API. Gremlin. GremlinAPI: client () # 117 / / focus on the + - [45034.476917] ms Com. Baidu. Hugegraph. API. Gremlin. GremlinClient: doGetRequest () # 95 + - [0.045994] ms Com. Codahale. Metrics. The Histogram: the update () # 118 + - [0.024552] ms javax.mail. Ws. Rs. Core. The Response: getLength () # 119 + - [0.019595] ms com. Codahale. Metrics. The Histogram: the update () # 95 + - [62.2564] ms com.baidu.hugegraph.api.gremlin.GremlinAPI:transformResponseIfNeeded() #120 [throws Exception] `---throw:com.baidu.hugegraph.exception.HugeGremlinException #152 [null]Copy the code
- Once we locate the function that takes the most time, we track that function in turn and execute the query on the front end
trace com.baidu.hugegraph.api.gremlin.GremlinClient doGetRequest
:
`---ts=2021-05-10 16:15:09; thread_name=grizzly-http-server-27; id=62; is_daemon=false; priority=5; TCCL = sun. Misc. The Launcher $AppClassLoader @ 232204 a1 ` - [2395.877787] ms Com. Baidu. Hugegraph. API. Gremlin. GremlinClient: doGetRequest (+) - [0.025715] ms javax.mail. Ws. Rs. Core. MultivaluedMap: entrySet () # 65 + - [0.012735] ms com. Baidu. Hugegraph. Util. E: checkArgument () # 66 + - [0.136357] ms Javax.mail. Ws. Rs. Client. WebTarget: queryParam () # 70 + - [0.113385] ms javax.mail. Ws. Rs. Client. WebTarget: request (# 72) + - [0.032459] ms javax.mail. Ws. Rs. Client. Invocation $Builder: header () # 73 + - [0.027921] ms Javax.mail. Ws. Rs. Client. Invocation $Builder: the accept () # 74 + - [0.015461] ms Javax.mail. Ws. Rs. Client. Invocation $Builder: acceptEncoding () # 75 ` - [2395.244523] ms javax.ws.rs.client.Invocation$Builder:get() #76Copy the code
Seeing that the client query is in, we go to the client
- perform
trace com.baidu.hugegraph.backend.store.hbase.HbaseTable query -n 10 '#cost>100'
In the command, #cost> indicates the number of milliseconds that the query takes to filter out, and -n indicates the number of query end calls
`---ts=2021-05-10 17:52:20; thread_name=gremlin-server-exec-13; id=28d1; is_daemon=false; priority=5; TCCL = sun. Misc. The Launcher $AppClassLoader @ 232204 a1 ` - [104.269486] ms Com. Baidu. Hugegraph. Backend. Store. Hbase. HbaseTable: query () + - [0.002985] ms Com. Baidu. Hugegraph. Backend. Query. The query: limit () # 156 + - [104.241873] ms Com. Baidu. Hugegraph. Backend. Store. Hbase. HbaseTable: query () # 162 | ` - [104.232068] ms Com. Baidu. Hugegraph. Backend. Store. Hbase. HbaseTable: query () | + -- - [0.002668] ms Com. Baidu. Hugegraph. Backend. Query. The query: empty () # 167 | + -- - [0.002552] ms Com. Baidu. Hugegraph. Backend. Query. The query: the conditions () # 184 | + -- - [0.002382] ms Com. Baidu. Hugegraph. Backend. Query. The query: ids () # 186 | + -- - [0.002212] ms com. Baidu. Hugegraph. Backend. Query. The query: ids (# 187) / / the focus on the longest | ` - [104.200252] ms com. Baidu. Hugegraph. Backend. Store. Hbase. HbaseTable: queryById () # 188 ` - [0.007368] ms com.baidu.hugegraph.backend.store.hbase.HbaseTable:newEntryIterator() #95Copy the code
- As can be seen above
queryById
This function takes the longest time and seems to enter hbase. At this point, it can be basically confirmed that there is a problem at the hbase level. We continue to verify.trace com.baidu.hugegraph.backend.store.hbase.HbaseTable queryById -n 10 '#cost>100'
`---ts=2021-05-10 17:55:23; thread_name=gremlin-server-exec-20; id=28e0; is_daemon=false; priority=5; TCCL = sun. Misc. The Launcher $AppClassLoader @ 232204 a1 ` - [117.206504] ms Com. Baidu. Hugegraph. Backend. Store. Hbase. HbaseTable: queryById () + - [0.002969] ms Com. Baidu. Hugegraph. Backend. Store. Hbase. HbaseTable: table () # 210 + - [0.002292] ms Com. Baidu. Hugegraph. Backend. Id. Id: asBytes () # 95 ` - [117.1856] ms com.baidu.hugegraph.backend.store.hbase.HbaseSessions$HbaseSession:get() #95Copy the code
- Continue to follow
trace com.baidu.hugegraph.backend.store.hbase.HbaseSessions$HbaseSession get -n 10 '#cost>100'
` - [294.896811] ms com. Baidu. Hugegraph. Backend. Store. Hbase. HbaseSessions $Session: get () ` - [294.891196] ms Com. Baidu. Hugegraph. Backend. Store. Hbase. HbaseSessions $Session: get () # 427 ` - [294.885125] ms Com. Baidu. Hugegraph. Backend. Store. Hbase. HbaseSessions $Session: get () + - [0.001963] ms Org, apache hadoop, hbase. Client. Get: < init > () # 658 + - [0.032304] ms Com. Baidu. Hugegraph. Backend. Store. Hbase. HbaseSessions: access $100 () # 663 / / into the get the hbase data + - [294.828674] ms Org, apache hadoop, hbase. Client. The Table: the get () # 664 + - [0.003014] ms Com. Baidu. Hugegraph. Backend. Store. Hbase. HbaseSessions $RowIterator: < init > () # 95 ` - [0.001837] ms org.apache.hadoop.hbase.client.Table:close() #95Copy the code
As shown above, the bottleneck is not server/ Client (time-consuming), but hbase query is slow
2.3 Why Is hbase Query Slow?
Our old and new version map databases are distributed in the same hbase cluster. If they are slow, they are both slow unless there is something special about them. Oh, the previous heterogeneous strategy
The three-copy of HDFS adopts the 1SSD and 2HDD policy to ensure that the read IO is sent to SSD to ensure performance. Go to HDFS to query the heterogeneity immediately:
hdfs storagepolicies -getStoragePolicy -path /home/hbase/data/netlabpro/g_ie
Sure enough, there is no strategy at this time, it seems that isomerism failed, but how can it fail? Oh, I am silly, although set up a heterogeneous after upgrade, then later party clear the import process as a result of business failure and other issues, and to re-establish the delete table, and did not set the heterogeneous strategy, and due to the large index data before, we are heterogeneous table level, rather than the db level, so in my reconstruction region information, The previous isomerism disappears, so it is set to the DB level (the child default path inherits the isomerism policy) to prevent isomerism from disappearing. ~/software/hadoop/bin/hdfs storagepolicies -setStoragePolicy -path /home/hbase/data/netlabpro/ -policy ONE_SSD
Ok Run a compaction operation to query the table.