preface

A few days ago, I encountered a very strange dubbo related problem in the test environment. After searching the Internet, I did not find any similar posts or articles, so this article came into being.

Hope to have not met or are meeting friends help.

The phenomenon of

One day the test redeployed a Dubbo application in the test environment and found that the application “didn’t start up.”

But after a few hours, he recovered on his own and was able to provide dubbo services.

But in fact, after my follow-up investigation, I found that it was not that the application failed to start at the beginning, but that the startup speed was very slow, so the application would not provide services until it started for a long time.

The speed is so slow that it takes two hours.

As a result, the test was completely afraid to release validation in the test environment, and had to wait two hours for each function to be verified and a bug fixed, which no one could stand ๐Ÿ˜‚.

And after many observations, it did take about two hours to get the app up and running.

Try to solve

In the end, the test failed, leaving me, an “accident report writer,” to take a look.

I didn’t take it seriously when I learned about the phenomenon:

Don’t think about it, it is not the main thread blocked, let’s see if the initialization of the database, Zookeeper, etc., can not be connected to cause blocked ——- to many accidents to deal with the experience told me.

So I called back to the test and asked him to check with the operation and maintenance first. Do not affect me until absolutely necessary, Touch Fish ๐Ÿณ.

The second day early in the morning to see the test of the students’ wechat profile picture beat when I have been ready to accept another “worship big guy ๐Ÿ‘” reply, but received “the network is normal, no one moved, not to solve the strike ๐Ÿคฌ”.

All right, there’s no getting away from it.

First of all, the direction of this kind of problem should be correct, that is, the main thread is blocked, as to what caused the blocking can not be as blind as before.

I’ll use jStack PID to print a thread snapshot to the terminal after the application restarts, and drag it to the end to see what the main thread is doing.

The first few snapshots are normal:

Load Spring —-> Connect to Zookeeper -> connect to Redis, all are executed sequentially without blocking.

After a while the application did not come up, I jstack again and got the following message:

Turn over the source

I waited for more than a dozen minutes and jStack got the same snapshot every time.

As you can see, the main thread is stuck in line 303 of one of Dubbo’s methods, ServiceConfig. Java.

So I found the source code here:

In a nutshell, the logic here is to get the native IP and register it with Zookeeper for other service calls.

Following like as the stack is stuck in the Inet4AddressImpl getLocalHostName.

However, this is a native method and our application can’t interfere with it at all. The end result is that calling this native method is very time-consuming.

So the problem seems to be blocked up here, and there’s not much you can do about it.

The final solution

Since this is a native method, it has nothing to do with the application itself (which is true, the problem has cropped up all of a sudden).

Is it the problem of the server itself? Considering that the native method is to obtain the hostname of the local machine, is it related to this hostname?

This is on my own Ali cloud server testing, the real test environment is not that name.

Ping server hostname after you get the hostname, something strange happens:

The command is initially stuck for a few seconds (tens of seconds) before it outputs the HOSTNAME IP and the corresponding delay.

When I ping the IP directly, I can respond quickly to the following output.

Finally I tried adding the corresponding host configuration to the /etc/hosts configuration file:

xx.xx.xx.xx(ip) hostname
Copy the code

Ping hostname again has the same effect as ping IP directly.

So I restart the app again, and everything is fine.

conclusion

Finally, I will try to analyze the causes of this problem according to my adjustment:

  • whenDubboAt startup, the local IP is obtained from the serverhostname ไปŽ dnsThe server returns the current IP address.
  • Due to thednsThere is a network problem between the server or the local server and the DNS server, causing this process to take longer (guess).
  • I’m in the localhostAfter the IP address is configured in the file, it is equivalent to a local cache. The IP address configured on the local cache is preferentially obtained, avoiding the process of interacting with the DNS server. Therefore, the speed is improved.

While the problem was solved, a few questions remained:

The first is why the interaction with DNS server is so slow, even if it is slow, it does not take 2 hours like the application to return, here I do not understand too clearly, friends with relevant experience can leave a comment.

The second question is whether Dubbo can be more robust when it relies on external sources, although I’m sure that’s a problem that few people have encountered.

For the problem that has not been successfully started for a long time, can you add a hint, such as directly throwing an exception to exit the program, and telling the developer the possible cause of the problem, so as to facilitate troubleshooting?

Your likes and shares are the biggest support for me