Author: Tao Jianhui
This is an internal blog post I wrote in May 2020, hoping that the students of R&D and technical support could help users quickly locate bugs and solve problems. I did another iteration in December 2020, and did some internal training on it. I have also been following the questions in the wechat group for some time and found that quite a few users find it difficult to analyze TDengine’s logs or any other distributed system’s logs. Now I’m going public with this blog post, which uses analyzing TDengine logs as an example to describe a method that, if you can master it, will make analyzing distributed system logs extremely easy.
TDengine is a clustered system. Every operation involves logical nodes such as APP, TAOSC, MNode and VNode. These nodes communicate with each other through sockets. And during testing, there may be multiple instances of TDengine, which complicates the analysis. For an operation, how to collate logs of different logical nodes and filter them efficiently becomes the key to analyze the problem.
This article summarizes a set of methods for finding bugs quickly.
Enable the log function
TDengine each independent module has its own debugFlag, including TAOSC, DNode, VNode, Mnode, TSDB, WAL, sync, Query, RPC, Timer, etc. At present, the log output of each module can be controlled to:
- Fatal/Error, Error, Error will be displayed in logs
- Warning, Warning, logs display WARN
- Info: important information
- Debug indicates general information
- Trace, very detailed and recurring debugging information
- Dump, raw data
The output log can control the output to:
- file
- The screen
All the above control is controlled by a byte of the debugFlag, which contains the following bit diagram:
Therefore, if you want to print error, warning, INFO, and debug to a log file, set debug to: 135; If you also want to output trace logs, debug should be set to: 143; If only error and warning are printed, debug is set to 131. In normal cases, you are advised to set debug to 135.
The debug flag setting of each module is controlled by the taos. CFG configuration file. The parameters of each module and the module name displayed in logs are as follows:
If too many parameters are configured, the simplest way is to set the debugFlag parameter for the debugging. After setting this parameter, the debugFlag parameter is set to the same debugFlag parameter for all modules except TMR logs. The default value of the debugFlag is 0. If the debugFlag is not 0, all log configuration parameters will be overwritten. Util is not recommended except in special cases. The debugFlag of timer is 135, and 131 is appropriate.
The log file
TDengine generates client and server logs and stores them in the log directory. The default log directory is /var/log/taos, which can be specified by modifying the configuration parameter logDir in taos.cfg
- The client log file is named Taoslogy.x (because multiple clients can run, multiple log files can be generated on one machine)
- The log file on the server is taosdlog.x
The size of a log file is controlled. When it reaches a certain number of lines (controlled by numOfLogLines in taos.cfg), a new log file is generated. But TDengine only keeps two log files, whose names end in 0 or 1, alternating.
Log format:
Log files, from left to right, are divided into four large blocks
- Time stamps, to the point of subtlety
- Thread ID, because it is multi-threaded, this parameter is important, because only the log output of the same thread is guaranteed timing, is output according to the design flow
- Module name, three letters
- Logs output by each module
Steps for analyzing logs
When a test or customer reports a Bug, either manually or automatically, a specific action is performed. This operation typically involves executing an SQL statement. This problem can be caused by the client or the server code. The following uses create Table as an example to explain log analysis. Time stamps are removed in the figure to facilitate focused interpretation.
Look at the client first
The first thing you need to view is the client log, as shown in the following screenshot:
- First find the SQL statement that failed and search for “SQL:” in the client log to see it (the second line of the screenshot). If SQL result:success is displayed, SQL result:success is displayed. Otherwise, SQL result: XXXX is displayed, where XXXX is an error message. How to quickly find failed SQL, you need to know the approximate time range, SQL statement, so that the search will be fast.
- Taoc log data, one of the most important parameters is pSql, which is an address assigned to internal SQL Obj. Taosc places this address information at the top of all TAOSC logs, immediately after TSC. This value is critical for traditional client and server logging. In the screenshot above, it is highlighted with a green background.
- The pSql parameter is passed as aHandle to the RPC module, which prints it out in all messages (green background). Because many modules will call RPC module, RPC will also print out who called, for example, in the screenshot, is called by TSC, will print out RPC TSC.
- The RPC will send the create-table message to the server, and the RPC log will be printed out (screenshot line 8), telling which End Point of the DNode it is sent to. The screenshot shows that it is sent to hostname: 9BE7010a917E, port is 6030. If there is a problem, we need to check the log of the server where the End Point is located.
- As you can see, the RPC module received a response from the server, but to avoid the resource consumption of the conversion, the log shows only the IP address in hexadecimal (line 9 of the screenshot, 0x20012AC) and the port number. The RPC module’s log is critical because it colludes logical nodes.
If you look at the server
After analyzing the client logs, the server logs are critical. The following uses create-table as an example, please take a screenshot:
- From the client log, find the pSql, the value is 0x5572C4FAB3A0, so in taosdlog directly search 0x5572C4FAB3A0, you can see the log with green background in the screenshot. PSql is therefore an important parameter for stringing client and server logs together.
- The create-table operation is handled by an MNode. In the screenshot, because the first table is created, the vNode needs to be created first, then the table needs to be created and other operations, involving many modules, I won’t explain in detail.
- Finally, the MNode creates the table and sends back the response via the RPC module (line 52, last line of the screenshot), so it is clear that the server is working properly.
Note: After receiving the message, the DNode module will distribute the message to the message queue of MNode and VNode according to the message type. The worker thread then consumes the messages in the message queue and processes them. For VNode, the sub-modules TSDB, WAL, sync and CQ are all executed by single thread. Therefore, after finding pSql (the second line of the screenshot), you need to look down according to the thread ID to know the whole process, which is easy to analyze.
A few key points
- Find the failed SQL statement first
- Find the value of pSql and copy it, assuming XXXXX
- Grep “XXXXX” taoslogx.x, find the client log related to this SQL query, and see if the problem can be found
- Open the Taosdlog server log, search for the pSql value XXXXX, and check the timestamp to see if the operation failed
- The server logs are then analyzed
RPC module messages are critical. It is important to note that for each RPC message, parse code: xx is printed. This is the result of protocol parsing. 0 indicates that there is no problem, and any other value indicates that protocol parsing failed. But at the same time, the message itself contains code: 0xXX, which is the error code brought by the sender. Usually, the server sends the error code to the client. If the message is correct, the code is 0, otherwise, an error is reported.
Another log matching method
When a client sends a message through the RPC module, the log contains similar messages
sig:0x01000000:0x01000000:55893
Copy the code
This is the RPC’s source ID, dest ID, and transcation ID, which together uniquely identify links from a client. Each new message is sent by adding one to the transcation ID, which is unique for a period of time (transcation ids are two bytes and loop).
Version 1.6 can only rely on the string sig to match the client and server logs, but requires a lot of context, so it is cumbersome and inefficient.
Version 2.0 passes pSql to the server side so that client-server log matching is greatly accelerated.
Be familiar with logging methods
- Start by understanding the design of TDengine and the flow of each major operation.
- Turn on all log switches (set the debugFlag to 135), execute all SQL statements once, and check the corresponding client and server logs against each SQL statement.
View the SQL statements executed by the client
The client generates a lot of logs to view the SQL statements executed to analyze and repeat problems. How many ways are there to find out what SQL statements are being executed
- If the client log is enabled, run the following command: grep “SQL:” taoslog* to view all SQL statements executed in the log.
- If you are using TAOS to execute SQL statements manually, please view the hidden file taos_history in the home directory. Taos_history contains all history commands executed by TaOS.
- Configure the client. In the configuration file, set parameter tscEnableRecordSql to 1. That is, the SQL statements entered by the client are printed to a separate file (tscnote-XXXX. 0, where XXXX is pid).
- For the Resetful interface, setting the httpEnableRecordSql parameter to 1 in the Taosd configuration file prints the HTPP request to a separate file (httpNote.0) in the same directory as the server logs.
Dynamic Change Logs
Sometimes the server or client cannot be restarted, but the log Settings are incorrect. In this case, you need to dynamically set the log Settings. Perform the following steps:
Show dnodes; Alter Dnode ID debugFlag 143; // Set the corresponding debugFlagCopy the code
The id of the second step is obtained in the first step.
Sometimes, you need to output subsequent logs to a new file to facilitate the query and search of logs. Run the following command:
alter dnode id resetlog;
Copy the code
Sometimes the shell cannot be connected at all. In this case, you can send the SIGUSR1 command to the process on the machine running Taosd, for example:
kill -SIGUSR1 pidxxx
Copy the code
Starting from original: mp.weixin.qq.com/s/LUz1niYaj…