case
One day, the lovely product manager ran over to Chen PI said, a long time of use, the recent also did not send the version of xx service hung!! We need to take care of it and send out an incident report.
The service is down. To restore it as soon as possible, restart it first. As it turned out, the service was quickly restarted and running normally.
As an internal service, I did not access ELK log analysis platform. However, when I downloaded the log file of the service to the operation and maintenance personnel for analysis, the operation and maintenance personnel said that there was no log file, only one log file was left after the restart of the same day!!
Without log, how to output accident report, this is not to take programmer worship heaven? Dude, call dude!!
screening
First of all, the way to think is to shake the pot!! Did you use the all-purpose command rm -rf /? However, the stubborn operation would not admit it, so it had to find another way.
Since it is not manually deleted, could it be that there is a script program on the server that periodically clears log files and logs under the service are not cleared after the check?
Well, if we rule out homicide, could it be suicide? So you have to analyze the engineering itself. Since this project was originally developed and maintained by others, Chen PI pulled this project and opened the log configuration file logback. XML. Analysis showed that logs were outputed to the local log file. Secondly, log files are cut and saved according to the size of the day, that is, the maximum size of a single file is 50MB, and the log files of the last 7 days are saved. It is true that there is no problem, part of the configuration information is as follows:
<property name="maxHistory" value="Seven" />
<property name="maxFileSize" value="50MB" />
<appender name="logFile"
class="ch.qos.logback.core.rolling.RollingFileAppender">
<File>./logs/server.log</File>
<rollingPolicy
class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<FileNamePattern>./logs/server%d{yyyy-MM-dd}_%i.log</FileNamePattern>
<MaxHistory>${maxHistory}</MaxHistory>
<timeBasedFileNamingAndTriggeringPolicy
class="ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP">
<maxFileSize>${maxFileSize}</maxFileSize>
</timeBasedFileNamingAndTriggeringPolicy>
</rollingPolicy>
<encoder
class="ch.qos.logback.classic.encoder.PatternLayoutEncoder">
<charset>UTF-8</charset>
<pattern>${layout}</pattern>
</encoder>
</appender>
Copy the code
Ask the cute product manager when he noticed the problem and what feature was last used. The product manager said that he wanted to log in this system today to export some analysis data and make more requirements for development. The last time he used this system was a few days before the Chinese New Year.
Good guy, last time use is years ago time, probably this service has been suspended for a long time, because few people use, so did not detect. However, even if the service died before the end of the year, it will retain at least 7 days before the end of the service log file.
At this point, the o&M service has been restarted when it is thought to ask for the log file. The truth is revealed according to the configuration information in the project log configuration file.
The service has been suspended for at least 7 days, and the maximum duration of saving the project log configuration is 7 days. Once the service is restarted, all the log files generated 7 days ago are deleted according to the configuration policy. Then generate log files for the day after the restart! So, who’s going to take the fall?
A further inquiry with the product manager confirmed this. The product manager used this system to export nearly hundreds of thousands of data a year ago, and at that time the export kept going in circles, waiting for a long time and then directly shut down the system. In addition, the JVM memory for service Settings is not very large, and it is likely to be OOM at that time. To address this problem, the service increases memory and limits the amount of exported data.
Optimal solution
In order to avoid the recurrence of such accidents, the following measures can be adopted to solve the problem:
- In disaster recovery, you need to back up multiple logs. In addition to saving the logs in the local file of the server, you also need to back up the log files to another place or access the ELK log analysis platform.
- The duration of saving engineering log configuration files is expanded from 7 days to 30 days, but the specific duration depends on your own scenario.
- If the service encounters an accident, the site of the accident is reserved, logs and the service thread stack are reserved, and the service is restored.
- Service access monitoring platform, timely notify the relevant person in charge of accidents, timely solution, to avoid more serious accidents caused by lag.
As for the knowledge about Logback logs, I have been studying the documents on the official website recently. I will sort out some useful knowledge points and post them later.
Logback official website: logback.qos. Ch /