The background,

First of all, an online analysis application was found to be abnormal, and the analysis data had not been produced for several days. Log in online and check error. Log.

YCYX-Task

Two, the investigation process

1. Preliminary positioning

jinfo

First, we use jINFO PID to view the heap-related parameters of the current JVM:

jstat

Next, we use the command jstat -gcutil pid 1s 5 to check the current heap usage within 5 seconds:

jmap

In addition to the jstat command, we can also use jmap-heap PID to view the current JVM heap:

jmap -F -histo pid | head -n 13
-F
jmap -histo:live

java.util.HashMap
java.util.ArrayList

Jmap -f -dump:file= a.binpid dump: 4.02 gb tar -czvf a.tar.gz a.bin.pid dump: 4.02 GB jmap -f -dump:file= a.bin.pid Dump: 4.02 GB tar -czvf a.tar.gz a.bin.pid

2. MAT in-depth analysis

Adjust MAT maximum heap memory

Pull the package a.tar.gz back to the local directory and unpack it. However, since A. bar is too large, MAT will definitely overflow memory when opened, so adjust the maximum heap memory of MAT software:

[MAT under the root directory of MemoryAnalyzer. Ini] - startup plugins/org. The eclipse equinox. Launcher_1. 5.0 v20180512-1130. The jar -- the launcher. The library The plugins/org. Eclipse equinox. The launcher. The win32. Win32. X86_64_1. 1.700 - vmargs - Xmx6g v20180518-1200Copy the code

-Xmx changed to 6G!

MAT analyzes large objects

YCYX-Task
java.lang.Object[]
Details

jmap -histo pid | head -n 13
ArrayList
HashMap

310000

Code read check

Combined with the JAVA class reported in the error.log log, the number of lines of code that reported the problem, and the fact that the problem should be in an ArrayList, it’s easy to locate the relevant code block:

  /** * Analyze data according to the specified start and end time *@paramBeginTime Start time *@paramEndTime indicates the endTime */
   @Override public void exec(String beginTime, String endTime) {
        List<Map<String, Object>> emlWithLoginList = new ArrayList<>();
        List<Map<String, Object>> emlList = new ArrayList<>();

        while (true) {
            try {
				// Exit if the analysis has reached the end time.
            	if (DateUtil.compareTime(beginTime, endTime) > 0) {
                    break;
                }

                // Each loop is pushed forward 10 hours, YCYX_PERIOD=10 hours
                String tmpTime=DateUtil.addHours(beginTime, YCYX_PERIOD);
                
                //1. Construct the request
                BoolQueryBuilder bqb = QueryBuilders.boolQuery();
                bqb.must(QueryBuilders.rangeQuery(CREATE_TIME).gt(beginTime).lte(tmpTime));
                bqb.must(QueryBuilders.termQuery(IS_DELETE, IS_FIELD_VALUE));
                bqb.must(QueryBuilders.existsQuery(TOS));
                bqb.must(QueryBuilders.existsQuery(FROMS));
                bqb.must(QueryBuilders.existsQuery(SEND_TYPE));
                bqb.must(QueryBuilders.existsQuery(SESSION_ID));

                log.debug("emlAnalysis begin at " + beginTime + ", and end at " + lastTime);

                SearchSourceBuilder requestBuilder = new SearchSourceBuilder().query(bqb).size(PAGE_SIZE).sort(CREATE_TIME, SortOrder.ASC).fetchSource(new String[] {"*"},
                        new String[] {EML_CONTENT});

                //2. Initiate a request
                EsHelper.getResponseScroll(EsCluster.DEFAULT, INF_EML_INDEX, "14m", requestBuilder.toString(), result -> {
                            
                            //3
                            EsSearchHit[] hits = result.getHits().getHits();
                            List<Map<String, Object>> loginList;
                            for (EsSearchHit hit : hits) {
                            	// Add the mail Map to the list
                             	Map<String, Object> emlMap = hit.getSource();
                                emlList.add(emlMap);
                                
                                // Add the mail +MailLogin Map to another list
                                Map<String, Object> emlWithLoginMap = new HashMap(emlMap);               
                                String sessionId = emlMap.get(SESSION_ID).toString();
                                MailLogin mailLogin = EsService.getMailLoginBySessionId(sessionId);
                                emlWithLoginMap.put("Login_Key", mailLogin );
                                emlWithLoginList.add(emlWithLoginMap);
                            }
                            return true;
                        });

                //4. Class A check logic
                checkDSYJEml(emlWithLoginList);
				emlWithLoginList.clear();

                //5. Class B check logic
                checkYCYXEml(emlList);
				emlList.clear();
                
                beginTime = tmpTime;
            } catch (Throwable e) {
                log.error("", e); }}}Copy the code

EmlWithLoginList = emlWithLoginList = emlWithLoginList = emlWithLoginList = emlWithLoginList = emlWithLoginList = emlWithLoginList = emlWithLoginList = emlWithLoginList

However, through MAT, we checked all the keys of Map elements in the List in question, but did not find the relevant element called Login_Key, so we speculated that the List in question should be this emlList!

Every 10 hours

Therefore, we first guess whether the ES data in the current period is too large.

Select * from CREATE_TIME(@createtime); select * from CREATE_TIME(@createtime);

Therefore, find the first and last element of the problem List through MAT, namely, get the start and end time of this query:

Figure shows that query this batch of data, the program got 310,000 pieces of data!

And I went to Kibana to check the amount of data during this period:

Suddenly lost, this situation is a little unreasonable, think for a long time, so, and carefully looked at the code:

clear()
emlWithLoginList
emlList

So it’s pretty obvious:

  1. If you’re executingcheckYCYXEml(emlList)When an exception occurs, it will be directly affected by the followingcatchTo capture;
  2. So they won’t go down thereemlList.clear()The code, at the same time, doesn’t walkbeginTime = timeTime;
  3. Because the try catch is inside the while loop, the loop continues after an exception is thrown, and because it is not executedbeginTime = timeTime, so the queried data is still the data of the previous period;
  4. At the same time, whyemlWithLoginListThere is no problem, because in front of the exception code, normal can be doneclear()Operation.

Error. Log: checkYCYXEml(emlList) : error. Log: error.

If this is the case, this code will repeatedly query the data from 2019-11-01 16:00:00 to 2019-11-02 02:00:00, and continuously add it to the emlList, and finally crash the JVM!

So the problem List in MAT should have multiple identical elements.

How do you test this?

Because these data have an EMlkey field, which is a unique primary key, corresponding to the _id of this record in ES, MAT can be used to find out whether multiple elements in the emlList have the same EMlkey according to the EMLkey of a certain data, which can be proved.

MAT groups strings

-XX:+UseStringDeduplication

String emlkey = String emlkey = String emlkey = String emlkey = String emlkey

merge shortest paths to gc roots

854742740486326718e

emlkey
document_id
_id

Finally, the problem has been basically located and it is easy to fix. The two Clear() methods are moved to finally after catch to ensure that 100% of them will be called. In addition, the problem of IceTimeOut has been solved by the colleagues who interconnect with middleware.