background

Yesterday morning, due to the epidemic, I worked at home. When I got up in the morning, I routinely checked the incremental data update service in my charge and opened the monitoring page of the queue, only to find that there was a backlog of messages

Troubleshoot problems

The consumer end

Therefore, I opened the monitoring page of the company’s K8S and entered the service. I saw the service consumption log on the console, which had been stuck for 6 hours. Later, I cut into the container to check the service threads. Find that the CPU usage of the Java process in the container has not changed much, almost not moving, and feel that the thread is in suspended animation

Contact the OPS

Since the service has been running for a long time, I have not changed it recently. I asked OPS if there was any adjustment of cluster resources recently, but OPS said no, and then helped me to check the cluster usage of the whole data group, and found that there was no problem and resources were sufficient. I described the situation to OPS, OPS said that maybe OOM, but I feel obviously not the case

Troubleshoot problems

Since OPS said resources all have no, this program also ran for half a year,, there is no change recently, I immediately very meng, don’t know what circumstance, have previously been colleagues and I feedback this problem, because the task of the whole data set, the crawler, etl, is hosted in data task platform, the platform is also my own maintenance, At that time, I did not realize this problem, and later helped them to do a regular restart of the service, just to prevent this kind of suspended animation.

Therefore, I went to the monitoring page to check the service pressure of the interface for obtaining messages, and found that both the memory and CPU usage were good. Due to the traffic of the service, there was something wrong with the MONITORING page of OPS, and I could not see the traffic situation, which made it more difficult for me to troubleshoot the problem.

I had no choice but to look at the situation from the service system, so I entered the service and used JPS. There was no command in the system, so I followed the command

Yum install Java -- 1.8.0 comes with its - devel. X86_64Copy the code

The Java process ID in the system is 1

1 com.patsnap.analysis.postporcess.rdapi.dpp.PatentTextAPIInvokServiceV4
Copy the code

Found the problem

After using jstack -l 1, check the Java process stack situation, too much information, I picked out a useful information

"pool-1-thread-3" #13 prio=5 os_prio=0 tid=0x00007f84d8535000 nid=0x13 runnable [0x00007f84a34bf000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) - locked <0x00000000fb338c70> (a java.io.BufferedInputStream) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593) - locked <0x00000000fb338cc8> (a sun.net.www.protocol.http.HttpURLConnection) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) - locked <0x00000000fb338cc8> (a  sun.net.www.protocol.http.HttpURLConnection) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) at com.patsnap.base.core.util.HttpMethodUtils.request(HttpMethodUtils.java:234) at com.patsnap.base.core.util.HttpMethodUtils.sendPost(HttpMethodUtils.java:48) at com.patsnap.base.core.util.ActiveMQUtils.sendMsg(ActiveMQUtils.java:200) atCopy the code

I see three threads, because when I get the message, there are multiple threads. The service is open for three threads, and it looks like all three threads are RUNNABLE. No problem. It looks like it’s reading network resources. Is it possible that this place is a problem, so I looked on the Internet, there are also a lot of people said that encountered such a problem

Location problem

With this in mind, I looked at the underlying HttpURLConnection method based on the stack information

public static String request(String method, String urlPath, Map<String, String> headers, JSONObject param) throws Exception { .... URL url = new URL(urlPath); conn = (HttpURLConnection)url.openConnection(); conn.setRequestMethod("post"); conn.setDoOutput(true); conn.setUseCaches(false); conn.connect(); . }Copy the code

ConnectTimeout and ReadTimeout are not set at all, so I read the description of the next two fields:

ConnectTimeout means the time taken to establish a connection.

* Sets a specified timeout value, in milliseconds, to be used
* when opening a communications link to the resource referenced
* by this URLConnection.  If the timeout expires before the
* connection can be established, a
* java.net.SocketTimeoutException is raised. A timeout of zero is
* interpreted as an infinite timeout.
​
* <p> Some non-standard implementation of this method may ignore
* the specified timeout. To see the connect timeout set, please
* call getConnectTimeout().
Copy the code

ReadTimeout Indicates that the connection is established and the server resource is read. If no data is read within the specified time, an exception is reported

* Sets the read timeout to a specified timeout, in
* milliseconds. A non-zero value specifies the timeout when
* reading from Input stream when a connection is established to a
* resource. If the timeout expires before there is data available
* for read, a java.net.SocketTimeoutException is raised. A
* timeout of zero is interpreted as an infinite timeout.
*
*<p> Some non-standard implementation of this method ignores the
* specified timeout. To see the read timeout set, please call
* getReadTimeout().
Copy the code

How well interpreted is interpreted A timeout of zero is as an infinite timeout

So I can see why the thread is blocking right over here, and all three threads are blocking, so I can see why the service is suspended, right

To solve the problem

I set connectTimeout and readTimeout to 60 seconds. Then rebuild the service image, released up, sure enough, the problem was solved, the service is normal consumption, although there will be timeout, but the service is not blocked!

Reflection problem

After a long time of service operation, such problems rarely occur. Recently, our services have been migrated from AWS to Tencent Cloud. Before, many services were directly connected to MQ to get messages when they were in AWS EC2 or Fargate. Because the domain name I used before could not be accessed directly, I switched to the restful API provided to get messages. After the switch, a large number of data task processing services went online, resulting in great pressure on the API message proxy service, which was already loaded with two loads, resulting in many cases such as link timeout or read timeout. Then I looked at the traffic of the API proxy and it was much bigger. Now I added 2 more loads and finally solved the problem

conclusion

Troubleshoot the problem, not always holding the code does not change, the mindset of services are running for a long time have no problem to solve the problem of service, this will cause you found that the problem will become long long time, we should still recently through system tools to locate the problem, and finally to solve the problem, and then in the emergence of reflection problem!

Come on!

Recommend a few blog posts, troubleshooting when used

1: How to use thread stack to locate problems

2: JMAP, jstat, jStack

3: Java programmers must: jStack command parsing