I haven’t written any more for the whole ten days, which is really busy.
My friends who add my wechat know that I went out for a spring outing last weekend. The department organized a trip to Wailingding Island. The environment was quite good, and there were fewer people going there during this period, so it was worth going there.
Today I will talk about a production problem last weekend.
1 What happened
At noon on Sunday, I came back from Wailingding Island and went straight to the company because there were some problems in production. The problem is as follows: Some HBase nodes hang, causing some data to be lost. Customers with lost data will be stuck when they come to extend credit or borrow money. When it was determined that data could not be recovered for a short time, it was decided to solve the problem at the system level. At this time, I consulted two senior employees. Although these data are input parameters of rules, rules may not use these data to make decisions. Could you confirm with colleagues of rules whether these data are used? All I got back was that the data was online a long time ago, so it must be in use. At this time, only the system data could be analyzed, but the lost data happened to be the original data, not the processed data, and the original data did not enter rules, so the code to obtain the data source was simply modified.
In simple regression testing, test colleagues discovered a strange phenomenon, the old data is covered, check the various configurations of SQL, found no problem, because there are many models and rules into is such a configuration, and then into a historical issues in the debug, still have not found the problem, the fast 11 o ‘clock in the evening, A colleague contacted the rules colleague and found that the data of the card was not used in the rules of borrowing, that is to say, the problem of the card could be solved by closing the data source. No more, the problem of the data card could be solved first, and then the historical problem could be analyzed carefully. After finishing the work, I returned home at 1:30.
2 analyse
This week, I continued to follow up this production history problem, and finally found that it was a Bug in the system framework, which caused private data to be overwritten by public data during data processing. During this period of time, I have been thinking about this production problem. From the perspective of hindsight, the problem of the stuck parts can be solved quickly, but it took a whole 10 hours of torturing. There must be a reason for it.
2.1 Inertial thinking
Inertial thinking is a term used by a person to thinkin a past way, as if an object were moving inertially. Conventional thinking often leads to blind spots and a lack of possibilities for innovation or change.
The above process found two patterns of habitual thinking.
- One is that colleagues who have experienced the whole system development process directly deny the scheme to confirm whether the rules are using the lost data. Because I have not experienced the development of the future, I am a spectator to look at this problem, so I came up with the idea of first determining whether the data is in use. The conventional wisdom here is that because the data came online long ago and was used then, it is still used today.
- The other is that I have been trying to solve the historical problems in production by checking the business code and SQL configuration, because it was used in the past and there was no problem before. The conventional wisdom here is that while this was fine before, this time the problem should be business code or SQL configuration.
Here is to do something is right, before cause at the time of the same problems will go to the right approach as the correct answer before, but that is not equal to the right, the right previously only the reference answer, not the correct answer, this involves thinking problem, if as a reference for the answer, it is divergent thinking, If the reference answer is wrong, you can look for other reference answers or find other solutions. If as the right answer, then thinking is rigid, will put the right answer has been set inside, will not come out.
With that in mind, what can we do to break out of the rut? The following two points are not sure whether they are right or not, but I decided to try to implement them after thinking about them.
Tell yourself this is conventional thinking
. In the Miracle of Mindfulness, there are cases of washing dishes and eating oranges. It’s all about feeling the dishes and feeling the oranges. Fitness friends will also know that when fitness muscles ache, to feel that feeling. Instead of blaming yourself for falling back into the rut, confront it head-on and tell yourself that it’s rut, that the reference answer is wrong, and find another answer.The empty cup state
. If there is no good reference answer, empty yourself, according to what you see in front of you, according to the normal way to solve the problem.
2.2 Prioritize
The most urgent thing at that time was to solve the problem of manufacturing the card. In the process of solving the problem, but found a historical Bug, at this time the card problem code has been verified, should be directly on the production, to solve the current urgent, and then solve the historical Bug. The reality is that the focus on historical bugs has led to a long delay in launching hotfixes.
It is very important to set priorities, not only in special emergency situations, but also in daily work. There are many things to do every day. You should learn what to do first and what to do later. To solve this problem, we can use the four-quadrant work method. What is the four-quadrant work method? See below.
(Note: The quadrants here are somewhat different from those in mathematics. In mathematics, the third and fourth quadrants are the opposite of those in the diagram. Here are the order of importance and urgency.)
Each task is measured on two dimensions: importance and urgency. For example, cards and historical bugs encountered above are placed in the first quadrant, while historical bugs are placed in the second quadrant. Therefore, they should be solved separately. After solving the problems of cards, historical bugs should be solved. If you had this awareness at that time, you could go online directly after verifying the hotfix code of the card, and then analyze historical bugs later.
3 summary
After this incident, let yourself calm down to think, thinking about what went wrong, thinking about the nature of the mistake, thinking about how to avoid making the same mistake again, thinking about how to improve with practical actions. Making mistakes is not terrible, but doing it again is. Well, at this moment, I grow up again. I hope my reply can also give you some enlightenment.
Recommended reading:
Behavioral pattern: Observer pattern
Behavioral pattern: Iterator pattern
Behavioral pattern: Policy pattern
Welcome to pay attention to the public number: LieBrother, exchange progress together.