1. Event restore

Yesterday afternoon, I received a 504 alarm, which is obviously a timeout alarm. At that time, I was busy with other things, so I didn’t pay attention to it, so I just took a glance at it, but the method that caused the alarm was familiar and I wrote it, so I was a little surprised at first.

After the surprise, continue to deal with the work at hand.

An hour later, I received the same alarm again, obviously not occasionally, something must be wrong, so the search began.

The interface of alarm is getControllerAMethod interface of a Controller layer Controller era, which calls multiple microservices and finally assemblethe result data to return. The problem is ServiceM. The getServiceMMethod method in the ServiceM service has very simple logic, mainly two database queries, fetching data from the MySQL database and returning it to the caller.

The call chain is shown below

2. Environment introduction

Language: **Go

MySQL DB: * * * *

** Database interaction: ** Database/SQL (the company has done some customization and encapsulation on this basis, which is still database/ SQL)

The following begins to introduce the specific investigation process of this problem.

Gut reaction: Start with SQL statements

According to the alarm information and log details, it is a timeout problem.

Your first reaction is to check to see if the SQL statement is a slow query (although I know in my heart that this is extremely unlikely). The SQL statement is simple and looks like this

select a, b, c from tableA where a in (a1,a2,a3)
Copy the code

The index can be hit without looking at the execution plan.

But I looked at the execution plan and the actual execution, and the analysis results and response time were pretty good without any problems.

The gut reaction of the investigation is so boring, start the next investigation.

4. Upstream check whether the context time is exhausted

Since it is a timeout problem, it may be either upstream or downstream timeout. The first step has ruled out the possibility that the downstream timeout is caused by slow query.

Could that be an upstream timeout? This is obviously possible, since we know that the Go context can be passed along and all service invocations share the total amount of time set.

And as you can see from the figure above, ServiceM is also the last step in the upstream interface, so if the above services take too much time, it will cause the time limit of ServiceM to be compressed to the point of being almost empty.

However, a log check shows that the SQL query corresponding to the getServiceMethod method is almost always returned in tens of milliseconds in the ServiceM layer.

In this case, it does not appear that the timeout was caused by insufficient upstream time.

5. Downstream preliminary investigation: Rows became the main suspect

Since the upstream time quota is sufficient, the problem is most likely to be in the downstream service interface.

Here’s a pseudo-code implementation of the getServiceMMethod method

rows, err = db.query(sql1)
iferr ! =nil{... }defer rows.Close()
for rows.Next() {
  rows.scan(...)
}

rows, err = db.query(sql2)
iferr ! =nil{... }defer rows.Close()
for rows.Next() {
  rows.scan(...)
}
Copy the code

After reading the code, I started to get a little excited. I thought it was right. It was probably the problem of the rows.

In the Go component learning — Are you using the DATABASE/SQL connection pool correctly? Article I focus on the pits caused by rows not being closed properly. Because the database connection was not closed correctly during the process of traversing rows, the connection was leaked, so that the subsequent query could not get the connection and caused the timeout.

The only thing that can come to mind is that the same ROWS object is used in the two queries. Is there some situation that causes the connection to be closed in the previous query and the next query directly uses the closed connection, resulting in timeout?

At this point, the alarm started to become more and more frequent, so I changed the two queries here from the original one ROWS receive to two rows receive and closed, and then submitted the code, after the test passed and started online.

A brief period of calm

When the code went live, the results were immediate.

The alarm stopped immediately and there was no more timeout message in the log. Everything was calm again, which convinced me that I should have found the cause and solved the problem.

When I got home, I was still a little uneasy. From 11 o ‘clock, I took out my computer and began to simulate, verify and restore the cause of the alarm.

After three hours, realize the calm may have been an illusion

For three hours, from 11 o ‘clock until 2 o ‘clock, INSTEAD of finding the cause, I found that my solution might not have solved the problem.

Because the code in question is not complicated, as shown above, there are only two Select queries.

I wrote the same code to test it locally, and it didn’t time out or get a connection. Even if I set both maxConn and IdleConn to 1, both times out.

Unless, as described in the article “Go component learning — Database/SQL connection pools are you using correctly”, an early exit from row.next () with no rows.close () statement will result in the Next query not getting the connection, as shown in the code below

func oneConnWithRowsNextWithError(a) {
	db, _ := db.Open("mysql"."root:rootroot@/dqm? charset=utf8&parseTime=True&loc=Local")

	db.SetMaxOpenConns(1)
	rows, err := db.Query("select * from test where name = 'jackie' limit 10")
	iferr ! =nil {
		fmt.Println("query error")
	}

	i := 1
	for rows.Next() {
		i++
		if i == 3 {
			break
		}
		fmt.Println("close")
	}

	row, _ := db.Query("select * from test")
	fmt.Println(row, rows)
}
Copy the code

However, the original code had the defer rows.close () method, and the connection would definitely be closed eventually without any memory leaks.

At this point, MURPHY’s Law comes to mind.

So, I began to dig source code, combined with the log, found an important clue, that is, many query tasks are actively cancelled. Cancel is the cancel returned by context.Timeout (this code is our rich database/ SQL function).

Cancel fires if an ERR is returned by executing the QueryContext method in database/ SQL

// QueryContext executes a query that returns rows, typically a SELECT.
// The args are for any placeholder parameters in the query.
func (db *DB) QueryContext(ctx context.Context, query string, args ...interface{}) (*Rows, error) {
	var rows *Rows
	var err error
	for i := 0; i < maxBadConnRetries; i++ {
		rows, err = db.query(ctx, query, args, cachedOrNewConn)
		iferr ! = driver.ErrBadConn {break}}if err == driver.ErrBadConn {
		return db.query(ctx, query, args, alwaysNewConn)
	}
	return rows, err
}
Copy the code

The first reaction was that the upstream time was insufficient and the context was directly cancelled, but this reason has been ruled out before.

So I kept scrolling down the code for several layers, and wondered whether a parameter called QueryTimeout in our self-developed code was too small. However, I took a look at the configuration (this is very important, I will say later), and found that 800ms was clearly enough.

With more and more questions, MY heart reluctantly went to bed.

Murphy’s law still works

At 1pm this afternoon, I received the familiar alarm, which will come (but only once).

As mentioned earlier, this cancel may be an important message, so the reason for the problem is that it did not run, it must be due to timeout, and the timeout may be due to the failure to get the connection.

Because the getServiceMMethod has been checked, there is no connection leakage, but will there be leakage in other places? All the code for the ServiceM service is checked and the code that uses the Rows object is closed properly.

After the investigation, hope was dashed.

So far, I’ve mentally ruled out a connection leak.

During this period, I also asked our DBA, because I checked the log, several queries were in hundreds of milliseconds around 8:00 this morning, I suspected it was caused by abnormal DB status. The DBA replied to me that the database was fine.

I also looked at the monitoring myself, the DB status and compared with the past no abnormal flow, there is no big ups and downs in the number of connection pools.

My colleague said that there were new functions a few days ago, and the quantity was larger than before, so I also looked at the corresponding codes of the new functions, but did not find any problems.

9. Prepare for the “Curve to Save the country”

The root cause of my desire had not yet been found, and I began to wonder if I could circumvent this unknown problem in some other way. After all, the best way to solve a problem is not to solve it (another way).

Prepare to increase the timeout of the ServiceM method.

You are also going to add caching to the getServiceMMethod method of the ServiceM service to block some of the requests.

All right, let’s call it a day and try them tomorrow.

So, I’m going to stand up and do a little bit of a mental inventory of this alarm.

A flash of inspiration, I found the truth

Upstream alarm, downstream timeout ->

Exclude upstream time is not enough ->

Exclude downstream rows not closed ->

SQL > select * from database;

Confirm downstream timeout ->

Can’t get the connection ->

I can’t get the connection. I can’t get the connection. I can’t get the connection

QueryTimeout = QueryTimeout = QueryTimeout IdleConn =X, maxConn=Y.

Taking another look at the QPS monitoring for the getServiceMMethod method, I think I’ve found the truth.

From early morning until 8 am, QPS rises until around 8 am, breaking Y, while maxConn=Y.

The value of maxConn is greater than the value of maxConn, causing the query task to wait for the connection to be obtained. The connection is still not obtained after the timeout period, thus triggering the above mentioned cancel and causing the timeout alarm to be generated upstream because the ServiceM getServiceMMethod has executed the timeout period. Because I’ve been waiting.

Other tasks have completed the query task during this period. Put the connection back into the connection pool, and subsequent requests can be used directly. After all, there will be a timeout waiting window.

Then why didn’t this kind of alarm occur before? I understand that the quantity was relatively small before. Recently, my colleague launched a new function, which led to the increase in the number of calls to this interface, which highlighted this problem.

11,

In fact, the final reason may be very simple, that is, the number of configured connections appears to be small, resulting in the subsequent query task can not get the connection timeout.

However, this timeout spans from the Controller layer to the Service layer and the lowest DB layer, and each layer can cause a timeout, so it is relatively difficult to troubleshoot.

Here are some post-mortem points

  • Recently changed code needs special attention. If there was no alarm for a long time before and an alarm was generated after the new code was uploaded recently, there is a high probability that the alarm is related to the online service (this timeout is also related to the new online code, although there is no problem with the code itself, the traffic increased after the new online service).
  • Use tools wisely. Use system tools such as monitoring and logging to find useful clues and connections.
  • Canvass and track from top to bottom. For bugs that are difficult to locate, you can check, verify, and eliminate them in a certain order, such as call order, until the cause is found.

Welcome to JackieZheng