GitHub 18K Star Java engineer into god’s path, not to learn about it!

GitHub 18K Star Java engineer into god’s road, really not to learn about it!

GitHub 18K Star Java engineer into god’s road, really really don’t come to know it!

This article will introduce a real case that happened in our online environment. The problem happened during a big promotion period, which caused a great impact on our online cluster. This article will briefly review the problem.

It is convenient for you to understand that the actual investigation and solution process may not be exactly the same as described in this paper, but the idea is the same.

Problem process

During a rush period, a large number of alarms suddenly occurred in an online application, indicating that the disk usage was too high, which once reached more than 80%.

In this case, we log on to the online machine for the first time and check the disk usage of the online machine. Run the df command to check disk usage.

$df
Filesystem     1K-blocks    Used Available Use% Mounted on
/               62914560 58911440 4003120  93% /
/dev/sda2       62914560 58911440 4003120   93% /home/admin
Copy the code

It was discovered that the disk drain on the machine was serious because of the high volume of requests during the rush, so we initially wondered if there were too many logs, causing the disk to run out.

As a background, our online machine is configured to automatically compress and clean logs, which will be triggered when a single file reaches a certain size or the machine content reaches a certain threshold.

However, the big promotion day did not trigger the log clearing, resulting in the machine disk was used up.

After investigation, we found that some Log files of the application took up a lot of disk space, and they are still increasing.

du -sm * | sort -nr
512 service.log.20201105193331
256 service.log
428 service.log.20201106151311
286 service.log.20201107195011
356 service.log.20201108155838
Copy the code

Nr: du – sm * | sort – the current directory file size, and sorted by size

Therefore, after communication with operation and maintenance students, we decided to carry out emergency treatment.

The first method is to manually clean up the log files. After the operation and maintenance students log in to the server, they manually clean up some unimportant log files.

rm service.log.20201105193331
Copy the code

However, after executing the cleanup command, the disk usage on the machine did not decrease, but continued to increase.

$df
Filesystem     1K-blocks    Used Available Use% Mounted on
/               62914560 58911440 4003120  93% /
/dev/sda2       62914560 58911440 4003120  93% /home/admin
Copy the code

So we started to find out why the disk space was not released after the log was deleted. Through the command, we found that a process was still reading the file.

Lsof | grep does SLS 11526 root 3 r REG 253, 0, 2665433605, 104181296 / home/admin / * * * * / service in the log. 20201205193331 (deleted)Copy the code

Lsof | grep does is used to view all have open the file and select the state of deleted files

After verification, this process is an SLS process that continuously reads log content from the machine.

SLS is a logging service of Alibaba, providing one-stop data collection, cleaning, analysis, visualization and alarm functions. To put it simply, it will collect the logs on the server and persist them for query and analysis.

All our online logs are collected through SLS. Therefore, through analysis, we find that the disk space is not released, which is related to the log reading of SLS.

At this point, the problem is basically located, so let’s interrupt the principle and introduce the background knowledge behind it.

Background knowledge

In Linux, file deletion is controlled by the number of links. Only when no link exists in a file, the file will be deleted.

In general, each file has two link counters :i_count and i_nlink. That is, on Linux, a file can only be deleted if both i_nlink and i_count are 0.

  • I_count represents the number of current file consumers (or calls)
  • I_nlink indicates the number of media connections (number of hard links);
  • I_count is a memory reference counter and i_nlink is a disk reference counter.

When a file is referenced by a process, the i_count increases. When hard links are created for files, the number of i_nlinks increases.

On Linux or Unix systems, deleting a file through rm or file manager simply unlinks it from the directory structure of the file system, effectively reducing the disk reference count i_nlink, but not the i_count.

If a file is being called by a process and the user uses the rm command to “delete” the file, the file cannot be found by using the ls command or other file management commands, but it does not mean that the file is actually deleted from the disk.

Since there is still a process running normally, reading or writing to the file, the file is not actually “deleted”, so the disk space is always occupied.

This is the principle of our online problem, because there is a process working on the log file, so the rm operation did not actually delete the file, so the disk space is not free.

Problem solving

After understanding the online problem phenomenon and the relevant background knowledge above, we can think of a way to solve this problem.

Find a way to kill the SLS process reference to the log file so that the file can actually be deleted and the disk space can truly be freed.

kill -9 11526
$df
Filesystem     1K-blocks    Used Available Use% Mounted on
/               62914560 50331648 12582912  80% /
/dev/sda2       62914560 50331648 12582912  80% /home/admin
Copy the code

Before executing kill -9, you must consider the consequences of executing it. After executing kill -9 on the server, you are told not to come back the next day.

Later, after a review, we found that there are two main reasons for such a problem:

  • 1. Online logs are printed too much and too often
  • 2. The pulling speed of SLS logs is too slow

After in-depth analysis, we found that the application printed a lot of process logs. Initially, logs were printed for troubleshooting online problems or data analysis, but during the rush period, the log volume increased rapidly, resulting in a rapid increase in disk space usage.

In addition, the application shared a SLS project with several other large applications, which led to the SLS pulling speed being slowed down, and thus the process could not be finished.

Afterwards, we also summarized some improvements. For the second problem, we split the SLS configuration of the application and managed it independently.

For the first issue, a log degradation policy is introduced during the rush to demote logs once disk overloads occur.

For log degradation, I developed a general tool that dynamically pushes the log level through configuration and dynamically changes the log output level online. In addition, the modification of this configuration can be configured on our plan platform to greatly promote the timing or emergency plan processing during the period, which can avoid this problem.

About the log degradation tool development ideas and related code, the next article to share with you.

thinking

Every time we go back after a big push, we find that most of the problems are caused by a few small problems piling up.

In the process of problem analysis, it is often necessary to use a lot of non-development skills related knowledge, such as operating system, computer network, database, and even hardware related knowledge.

So I’ve always believed that a great programmer is judged by his ability to solve problems!

About the author: Hollis, a person with a unique pursuit of Coding, is a technical expert of Alibaba, co-author of “Three Courses for Programmers”, and author of a series of articles “Java Engineers becoming Gods”.

If you have any comments, suggestions, or want to communicate with the author, you can follow the public account [Hollis] and directly leave a message to me in the background.