Upgrade process and symptoms

Dubbo version was upgraded from 2.7.4 to 2.7.8. 2. It took more than 3 weeks for the development, test and pre-release environment to run.

The 2020-09-24 15:03:40. 664, dar: N/A, INFO, org. Apache. Zookeeper. ClientCnxn, ClientCnxn. Java, 1299, onConnected Session establishment complete on server xx/x:2181, sessionid = 0xe72dc060e3f1d06, Negotiated timeout = 40000 2020-09-24 15:03:40.666, [TID:N/A], INFO, 参 见, 参 见 org.apache.dubbo.remoting.zookeeper.curator.CuratorZookeeperClient, Log4jLogger.java, 66, info [DUBBO] Curator zookeeper client instance initiated successfully, session id is e72dc060e3f1d06, dubbo version: 2.7.8, current host: 10.3.107.40 15:03:53 2020-09-24. 734, dar: N/A, WARN, org. Apache. The zookeeper. ClientCnxn, ClientCnxn. Java, 1164, run Session 0xe72dc060e3f1d06 for server xx/xx:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Packet len18594568 is out of range! at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:113) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) at Org. Apache. Zookeeper. ClientCnxn $SendThread. Run (ClientCnxn. Java: 1141) the 2020-09-24 15:03:55. 540, dar: N/A, WARN, org.apache.dubbo.remoting.zookeeper.curator.CuratorZookeeperClient, Log4jLogger.java, 81, warn [DUBBO] Curator zookeeper connection of session e72dc060e3f1d06 timed out. connection timeout value is 3000, Session expire timeout value is 60000, Dubbo version: 2.7.8, current host: 10.3.107.40 15:03:56 2020-09-24. 165, dar: N/A, INFO, org. Apache. The zookeeper. ClientCnxn, ClientCnxn. Java, 879, Initiating session 2020-09-24 15:03:56.346, [TID:N/A], INFO, org.apache.zookeeper.ClientCnxn, ClientCnxn.java, 1299, onConnected Session establishment complete on server xx/xx:2181, sessionid = 0xe72dc060e3f1d06, Negotiated A timeout = 40000 2020-09-24 15:04:08. 135, dar: N/A, WARN, org.apache.dubbo.remoting.zookeeper.curator.CuratorZookeeperClient, Log4jLogger.java, 81, warn [DUBBO] Curator zookeeper connection recovered from connection lose, reuse the old session e72dc060e3f1d06, Dubbo version: 2.7.8, current host: 10.3.107.40 15:04:53 2020-09-24. 046, dar: N/A, WARN, org. Apache. The zookeeper. ClientCnxn, ClientCnxn. Java, 1164, run Session 0xe72dc060e3f1d06 for server xx/xx:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Packet len18594616 is out of range! at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:113) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) at Org. Apache. Zookeeper. ClientCnxn $SendThread. Run (ClientCnxn. Java: 1141) the 2020-09-24 15:04:54. 626, dar: N/A, WARN, org.apache.dubbo.remoting.zookeeper.curator.CuratorZookeeperClient, Log4jLogger.java, 81, warn [DUBBO] Curator zookeeper connection of session e72dc060e3f1d06 timed out. connection timeout value is 3000, Session expire timeout value is 60000, Dubbo version: 2.7.8, current host: 10.3.107.40Copy the code

4. And then there are millions of logs like this

The 2020-09-24 15:37:50. 309, dar: N/A, the ERROR, org. Apache. Curator. Framework. Imps. CuratorFrameworkImpl, CuratorFrameworkImpl.java, 694, logError Background operation retry gave up org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862) at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:647) at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152) at org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:187) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:604) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)Copy the code

5. O&m receives an alarm, indicating that ZK is down

6. Roll back the application

Question why

There are mainly the following doubts:

  1. Why is there a problem in the online environment while there is no problem in the other environment? Is there a problem with ZK online? It doesn’t exist
  2. org.apache.dubbo.metadata.report.MetadataReportWhat is it for? Why 2.7.4 has no problem and 2.7.8 has a problem?

Investigation process:

  1. Packet lenxx is out of range!
The 2020-09-24 15:03:53. 734, dar: N/A, WARN, org. Apache. Zookeeper. ClientCnxn, ClientCnxn. Java, 1164, run Session 0xe72dc060e3f1d06 for server xx/xx:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Packet len18594568 is out of range! at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:113) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)Copy the code

I encountered this problem when operating ZK. At that time, there was a node with a long string of chaotic numbers. When obtaining this node, I reported such an error. I did some research on the Internet,

public static final int packetLen = Integer.getInteger("jute.maxbuffer",4096 * 1024); protected void readLength() throws IOException { int len = incomingBuffer.getInt(); if (len < 0 || len >= ClientCnxn.packetLen) { throw new IOException("Packet len" + len + " is out of range!" ); } incomingBuffer = ByteBuffer.allocate(len); }Copy the code

The jute. Maxbuffer size can be used to fix this problem. But why did you report this error? Because some node is too big? However, there is no such error in version 2.7.4. Could it be caused by something listening on ZK in the new version?

2. Find a critical log in the log center and find that MetadataReport is involved in it

Injection of @DubboReference dependencies is failed; nested exception is java.lang.IllegalStateException: Failed to receive INITIALIZED event from zookeeper, pls. check if url zookeeper://xx:2181/org.apache.dubbo.metadata.report.MetadataReport? client=&dubbo.config-center.root-path=/ is correctCopy the code

In the current version of Dubbo, the metadata center is just a place to store metadata, and there is no complicated logic involved. And we have enabled the metadata center for all applications through the configuration center, and changed the metadata center to Redis. Why is this thing generated here?

3, inMetadataReportInstanceInit method, which involves fetchingMetadataReportWill eventually lead toZookeeperDynamicConfigurationInitialize the

4. Why does it not appear in other environments, but only in everyday life? The metadata center is redis configured for all clients in other environments through dubbo Global Configuration center, but not online. This results in the use of the default metadata center, and ZK. Dubbo uses the registry as the metadata center by default when the client is not configured with a metadata center