Analysis and solution of an online burst frequent fullGC

Before a summary

One day in April, I just went to work in the afternoon. When I was sleepy in spring, MY whole life was not very sober. As a result, Dingding began to receive alarms, and one online service was in fullGC very frequently. Indeed, a few minutes later, the downstream service started calling the timeout alarm

Our company’s internal APM tool is pinpoint, I can see service timeout 13:50 ~ 14:03 this period of time service response time has a lot of more than 5000ms

Find the instance with the problem

The red line represents fullGC. Basically, this instance is unavailable. Requests to this instance are basically timed out, while other instances are normal at this time

Fast recovery

  • Take down the problematic instance and remember to dump the heap file first

Problem analysis

  • Cause analysis,
    1. Based on the above phenomena, it is assumed that an infrequent request or a particular scenario caused a large amount of data to be loaded into memory, and the request was handled by the instance in question.
    2. Because the service is back to normal after a while, the service log can’t find any useful information, analysis in the bottleneck, but as long as this problem appeared again, will cause the service basically is not available, so still need to find the root reason, the root of the problem thoroughly, avoid subsequent produce a greater impact.
    3. Our service has limited ways to load data, either through database query or external interface return. According to the dump file, we can see that most of the objects are our internal entity objects (forget the screenshot here), so it should be the database query that returned a large amount of data.
  • solution
    • JVM parameter tuning: Adjust JVM parameters to avoid this problem as much as possible
    • Code logic tuning: Find problem code and fix it

JVM parameter tuning

The whole idea is to minimize the number of “ephemeral objects” moving to the old age, so that the old age doesn’t explode too fast and trigger majorGC or fullGC, causing service STW and impacting the business, but this adjustment doesn’t prevent extreme situations caused by the code

-Xmx5g 
-Xms5g 
-XX:MaxMetaspaceSize=512M 
-XX:MaxTenuringThreshold=15 
-XX:MetaspaceSize=512M 
-XX:NewSize=2560M 
-XX:MaxNewSize=2560M 
-XX:SurvivorRatio=8 
-XX:+UseConcMarkSweepGC 
-XX:+PrintGCApplicationStoppedTime 
-XX:+UseCMSCompactAtFullCollection 
-XX:CMSInitiatingOccupancyFraction=85 
-Xloggc:/opt/zcy/modules/agreement-center/gc.log 
-XX:CMSFullGCsBeforeCompaction=2 
-XX:+CMSScavengeBeforeRemark 
-XX:+UseCMSInitiatingOccupancyOnly
Copy the code
  • Resize the new generation:-xx:NewSize=2560M.-xx:MaxNewSize=2560MThe heap size is 5g. Adjust the size of the new generation to 2560M, which is half of the whole heap size, so that as many classes can be put into the new generation as possible
  • Adjusted target’s age threshold for promotion to the old age:-XX:MaxTenuringThreshold=15In the CMS, the default value is 6. Adjust it to 15, so that the object is kept in the new generation as much as possible, and the collection is completed in the new generation
  • Adjust the ratio between Survivor zone and Eden Zone:-xx:SurvivorRatio=8, the Eden area size is 2560M*0.8 = 2,048m

Code logic tuning

The solution here is to limit the code batch data query, find out the bad code in the code batch database query and repair

  • Solution a: By mybatis plug-in, adding limit global query statements, limit the biggest return data, but in our business, often have tens of thousands of them are linked data, here actually data structure design is unreasonable, the limit is greater than the tens of thousands of also lost its meaning, because some form row is larger, tens of thousands of records also has hundreds of megabytes, When the number of requests is large, this problem will also occur, but also can not find the problem of the code, the project code is too much, look at the code to find problems can only look at fate, not reliable
  • Scheme 2: The MyBatis plug-in is also used to count the number of query results each time. If the number exceeds a certain threshold, the alarm log is printed and monitored in real time. Based on the log, the entire link is found and the faulty code is found

I used the second option here, with the following plug-in code:

@Intercepts(@Signature(type = ResultSetHandler.class, method = "handleResultSets", args = {Statement.class}))
@Slf4j
public class QueryDataSizeInterceptor implements Interceptor {

    /** * The number of logs to be queried exceeds the threshold of printing WARN logs */
    private Integer querySizeLimit;

    /** * Whether to enable */
    private Boolean isOpen;

    public QueryDataSizeInterceptor(Integer querySizeLimit, Boolean isOpen) {
        this.querySizeLimit = querySizeLimit;
        this.isOpen =isOpen;
    }

    @Override
    public Object intercept(Invocation invocation) throws Throwable {
        try {
            if(isOpen) { processIntercept(invocation.getArgs()); }}catch (Throwable throwable) {
            log.warn("QueryDataSizeInterceptor.failed,cause:{}", Throwables.getStackTraceAsString(throwable));
        }
        return invocation.proceed();
    }

    private void processIntercept(final Object[] queryArgs) {
        Statement statement = (Statement) queryArgs[0];
        try {
            HikariProxyResultSet resultSet = (HikariProxyResultSet) statement.getResultSet();
            MetaObject metaObject1 = SystemMetaObject.forObject(resultSet);
            RowDataStatic rs = (RowDataStatic) metaObject1.getValue("delegate.rowData");
            if(Objects.nonNull(rs) && ! rs.wasEmpty() && rs.size() >= querySizeLimit) { MetaObject metaObject2 = SystemMetaObject.forObject(statement); String sql = (String) metaObject2.getValue("delegate.originalSql");
                log.warn("current.query.size.is.too.large,size:{},sql:{}",rs.size(), sql); }}catch (Throwable throwable) {
            log.warn("QueryDataSizeInterceptor.failed,cause:{}", Throwables.getStackTraceAsString(throwable)); }}@Override
    public Object plugin(Object target) {
        return Plugin.wrap(target, this);
    }

    @Override
    public void setProperties(Properties properties) {}}Copy the code

Most of the code is mybatis plug-in template code, the core code is very simple

private void processIntercept(final Object[] queryArgs) {
        Statement statement = (Statement) queryArgs[0];
        try {
            HikariProxyResultSet resultSet = (HikariProxyResultSet) statement.getResultSet();
            MetaObject metaObject1 = SystemMetaObject.forObject(resultSet);
            RowDataStatic rs = (RowDataStatic) metaObject1.getValue("delegate.rowData");
            // A WARN log is generated when the number of logs exceeds the configured number
            if(Objects.nonNull(rs) && ! rs.wasEmpty() && rs.size() >= querySizeLimit) { MetaObject metaObject2 = SystemMetaObject.forObject(statement); String sql = (String) metaObject2.getValue("delegate.originalSql");
                log.warn("current.query.size.is.too.large,size:{},sql:{}",rs.size(), sql); }}catch (Throwable throwable) {
            log.warn("QueryDataSizeInterceptor.failed,cause:{}", Throwables.getStackTraceAsString(throwable)); }}Copy the code

Code logic: A WARN log is generated when the number of query logs exceeds the configured number. Configure the corresponding log alarm on the log platform

Once again

With the log, you can find the corresponding code by traceId, and you can see that there is more than 300,000 data from the database into memory, and fullGC is triggered normally

Long total = protocolQualificationManager.count(criteria);

            if (total == 0) {
                return Response.ok(new Paging<>(0L, Collections.EMPTY_LIST));
            }
            //List<AgProtocolQualification> result = agProtocolQualificationDao.paging(criteria);
            List<AgProtocolQualification> result = protocolQualificationManager.paging(criteria);
            Set<Long> protocolIds = FluentIterable.from(result).transform(k -> k.getProtocolId()).toSet();

            // There is a problem with this query
            List<AgProtocol> protocols = agProtocolDao.queryByIds(Lists.newArrayList(protocolIds));
Copy the code

The code looks ok, looking at the mapper of the corresponding query

<select id="queryByIds" parameterType="java.util.List" resultMap="defaultResultMap">
		SELECT
		<include refid="allColumns"/>
		FROM
		ag_protocol
        <where>
            <if test="ids ! = null and ids.size ! = 0" >
                and id in
                <foreach collection="ids" open="(" close=")" separator="," item="id">
                    #{id}
                </foreach>
            </if>
            
            <! SQL > select * from table;
            <if test="ids == null or ids.size == 0" >
                and false
            </if>
            <include refid="not_deleted"/>
        </where>
	</select>
Copy the code
  • If IDS is null, the entire table is queried for not_deleted data, and more than 300,000 records are returned

Pit points and lessons

  • Dynamic SQL if all criteria are not matched, can not directly query the full table, should return null, in code or mapper SQL limit
  • Optimize business data structures and add limits to your code
  • Restrictions should also be made at the database level. If a large number of deletions are made here, the business impact may be greater