The author | Halimao Java, lovers go (github.com/halimao/)

Arthas official Community is holding an essay contest to win prizes. Click to submit.

What if bugs in production cannot be reproduced in development? What should I do if there are no logs at key locations? Don’t panic, SAO Nian. Let the powerful Arthas carry you through the production environment.

When FIRST introduced to Arthas, I was struck by its ability to take input parameters and return values from the Watch method. This is cool and allows you to track the results of each step and get the current value of a variable as if it were a local single step. In the past, to locate online problems, insufficient information required log printing. To locate problems, you may need to restart the application repeatedly. With Arthas, there is no need to add log printing at all, restart to apply these operations. Spent about a weekend afternoon, ran the local official demo, familiar with the common operation. I already have a general idea of what Arthas can do, what it can solve, and how to solve it. When you need to use it, it will come in handy.

Here’s a particular bug that came knocking.

Background: The same chatting and dating product is released under one main brand and multiple new brands. Servers share a common set of data, but all information displayed to the public, related to the brand, needs to be replaced by copywriting. In the same group, users of the main brand and the new brand can chat with each other.

Problem phenomenon:

  • In an online group, when the same chat message involves the content that needs to be replaced, some users on the main brand side are displayed as normal, while others are displayed as the new brand text.
  • Different group chat messages, the same user some display normal, some abnormal;
  • Users of the new brand can see the group chat message copywriting is replaced normally.

First post the relevant code (source decomcompiled directly with Arthas jad):

Group chat message delivery method:

Write method copywriting to replace logical code:

PublishMessage is the downstream message class, replaceMsgMap is the pre-generated corresponding copy for each new brand, and the main brand uses the original message copy.

At first glance, the copywriting substitution logic seems fine (but here’s the problem, you can think about it for a moment), and it feels like you’re dealing with another esoteric metaphysical bug (never chalk up a programming bug to a supernatural event). No problem with the code, local single step debugging was done all morning and it didn’t reappear, so it looks like it will have to be located in production, Arthas is coming.

Due to the large number of message forwarding in the production environment, direct attach process has high risk and is not conducive to single message observation and positioning. Therefore, the pre-release environment is selected for attach, the request volume is controllable, the data is consistent with the online data, and only the read operation will not affect the production environment.

Use Arthas’s watch command to observe the input parameters to the write method.

  • -x indicates the traversal depth, which can be adjusted to print specific parameters and results. The default value is 1
  • -b indicates before the method is invoked

As you can see, the publishMessage and userSession parameters are displayed. The data can then be observed under the pre-publish trigger message.

Set up a test group, in addition to the main brand of their own test users and another new brand users. After sending several group chat messages at the beginning, a new brand user and a test user of the main brand were recruited, and the probability of recurrence was much higher. I observed the publishMessage value of the same group chat message sent to each group member, and found that if the user of the new brand is first traversed, and then the user of the main brand is traversed, the publishMessage copy is actually the new brand copy!! The in the mind fierce a surprise, be, be the bug that this low-level error causes, everybody should also guess reason.

Here’s why:

  • Iterates through the publishMessage parameter passed by the group members. Each change in payload affects the publishMessage argument passed in
  • ReplaceMsgMap stores only the new brand copy
  • If it is empty when the main brand obtains the corresponding copy according to the appName, payload is not set. The payload of the most transmitted publishMessage is used
  • When traversing group members, the order is random

If a primary brand user is traversed after the new brand user, the payload field of the publishMessage is set to the new brand text. However, the main brand cannot find the corresponding text in the replaceMsgMap, so it does not update the payload. If the payload of the user that was traversed last time is reused, the text is displayed abnormally.

ReplaceMsgMap adds the text of the main brand to the replaceMsgMap, and updates the payload field every time you iterate to the main brand to ensure that the text is displayed properly.

During the entire locating process, there is no need to add logs or restart the online application. In this way, sufficient information can be obtained for troubleshooting and locating problems. Spend half a day, fumble and tinker, need time can save a lot of effort. For details, see -> Official documentation

One click to install and start Arthas

  • Method 1: Implement Arthas one-click remote diagnosis using Cloud Toolkit

Cloud Toolkit is a free local IDE plug-in released by AliYun to help developers develop, test, diagnose and deploy applications more efficiently. Plugins enable one-click deployment of native applications to any server, even the cloud (ECS, EDAS, ACK, ACR, applets, etc.); There are also built-in Arthas diagnostics, Dubbo tools, Terminal terminals, file uploads, function calculations, and MySQL executators. Not only the mainstream version of IntelliJ IDEA, but also Eclipse, Pycharm, Maven and others.

It is recommended to download Cloud Toolkit using the IDEA plugin to use Arthas: t.tb.cn/2A5CbHWveOX…

  • Method 2: Download directly

Address: github.com/alibaba/art… .

Arthas’s essay campaign is in full swing

Arthas is officially holding an essay call if you have:

  • Problems identified using Arthas
  • Source code interpretation of Arthas
  • Advise Arthas
  • No limit, other Arthas related content

Welcome to participate in the essay activity, there are prizes to win oh ~ click to submit

“Alibaba Cloud originator focuses on micro-service, Serverless, container, Service Mesh and other technical fields, focuses on the trend of cloud native popular technology, large-scale implementation of cloud native practice, and becomes the public account that most understands cloud native developers.”