A few years ago, I took charge of a development team in the middle of the process. At that time, the business system of the team belonged to the financial industry. The development and testing of the system are almost over, and the function of the system is quite complex, with three or four subsystems, many scheduled tasks and several third-party systems.
After getting familiar with this team, I told everyone that we need to make monitoring alarm for this system (monitoring alarm is called by many names). Monitoring alarm is the golden bell jar of the business system, which is very important to the business system. At that time, some people did not have the concept of monitoring and alarm, and did not feel the importance of it. I guess there may be some people think that I am just a new appointer. Most people still think that I have fulfilled all the requirements of the product, and I have testers to check it, so I don’t need to do extra work for myself.
Indeed, when we build a software system, we often encounter a similar phenomenon above: from the beginning of receiving requirements, developers will generally think that all the designed functions are finished, all kinds of tests have passed, and finally deployed online, even if the delivery, everything is ok. I took it for granted the first few years of my career, too. In fact, this kind of understanding is very one-sided, after the online system is still responsible for the system. Why do you say that?
Because there is no such thing as a problem-free, bug-free system. After the system goes online, it is inevitable that there will be failure, failure is sooner or later, is more or less.
What we can do is, after the failure, the first time to know, to deal with it quickly, to prevent the impact of the wider. Ideally, if we can spot the problem before it happens, at the first sign of trouble, it would be better to fix it in advance. Business system monitoring alarm, that’s what it’s for.
Continue to talk about the financial business system that I took over. After the system went online, all kinds of problems did not come out as expected. The developers were busy and very passive. As I continue to lecture you on the importance of monitoring alarms, I begin to set up monitoring alarms. Little by little, we change people from passive to active, and there are fewer and fewer problems. Let me give you some impressive experiences.
1. The most common thing users encounter is bugs
No matter how well you test your system, there’s no guarantee that it’s bug-free, it just hasn’t shown up yet. Bugs are inevitable, but as a developer, you need to hear about bugs from operations or customer service, which can be problematic. Users encounter bugs, users call customer service, customer service to find development, you think about how slow this process is, how inefficient. How many users will this Bug affect if users are too lazy to ask customer service for feedback and development does not know?
Later, through code burying, response codes, exception handling, and so on, it was basically done: the developer knew the Bug the first time. Get it out of the way, get it online.
2. Some problems can be detected before they occur
At that time had such a problem, we need to call a function a third-party interface, when the third party that can return a result, 1, 2 seconds is true for the alignment of the time can return the result soon, so we set the timeout time is 5 seconds, and that is more than 5 seconds has not returned, will think call third-party interface failure. That’s three more seconds on top of two. There was no problem at the beginning of the launch, but after a few months, it often failed to call the third-party interface. After checking the cause, it was found that the interface timed out. The response of the third party was much slower than before, often exceeding 5 seconds.
It’s easy to solve once you know why. After solving the problem, we looked at the history and found that the response time did not suddenly slow down from 1 second, 2 seconds, 3 seconds… It’s going up slowly. If we had monitored the response time, proactively warned when it was close to five seconds, proactively contacted third parties to adjust, then this problem would not have occurred.
3. Rely on a third party, but not too much
In addition to the above paragraph, our business system relies on several third-party systems. Third party systems are also made by people, and they can also have problems. When they go wrong, we will be affected, and our users will think it is our system that has problems. It is useless to explain.
How do we minimize the impact of third parties on us? One is, the same service, as many as possible to connect to the third party, and then we do a route. Even if one fails, we can quickly switch to the other. In addition, we take the initiative to test the services of the third party. Before the peak of users’ use, we take the initiative to launch several businesses to test whether the services of the third party are normal. The more important a third party service (such as payment) is to our system, the more precautions should be taken.
4. The system we developed, users use card or not
System bugs, faults, and another factor are also easy to crash users: “system is too slow”, such as clicking a button for a long time without response, refreshing a page takes a long time. There are many reasons for system gridlock, some can be solved by improving server configuration, some can be solved by optimizing database and SQL, and some can be solved by optimizing code. As long as the system can be found slow, can roughly locate the cause, there is a basic way to optimize.
It wasn’t long after our system went live that we had it under surveillance. Firstly, we made clear the peak hours of the system, and then, from the user’s perspective, recorded the response time of each function service, and found the slow optimization in time. Note that it is important to look at the response time of a single service, or the execution time of a single SQL, from the perspective of the user, because some functions involve multiple services, or multiple SQL executions. Services or SQL alone are not slow, but together, the total time can be quite long.
5. The way of alarm is flexible and varied
Once the problem is detected, how do you alert the developer? When we do monitoring and alarm, many people say to make a system, you can read it on the system, and do a good record after solving the problem. I said our goal is to make sure that developers can see it in a timely manner, so they can’t see it if they’re not logged in. Do things with the purpose in mind, not just the way of doing things. Email, SMS, wechat, QQ can notify the developer of a problem, and SMS, wechat, QQ real-time than do better system.
Later, we used SMS and wechat to notify important problems, and email to notify less important problems. In order to prevent someone not to see, generally inform several people at the same time, can remind each other. In the end, we did not build a system, relying on email, SMS, wechat, and so on, but we can still operate.
In addition to the above, monitoring alarm also includes a lot of content, such as monitoring server storage, memory, CPU these most basic, the most important, these generally have professional people to do this (I am not professional did not write); It also monitors the execution of scheduled tasks; Also for example, there is no one intentionally engaged in the system to send messages, and so on. There are many ways to call the monitoring alarm and realize it. There is no need to be too tangled. “A black cat or a white cat is a good cat when it catches a mouse.”
The longer I work, the more I feel the importance of monitoring and alerting (and perhaps the older I get), because the cost of a business system failure is too high. If the system is intended for Internet users, then bugs, even if users feel stuck, may cause user loss. In a time when acquiring a user is getting more and more expensive, it’s a shame to give them away so easily. Of course not to say, not for Internet users of the system, you can go wrong, no matter who to use the system out of the problem, as a developer you not only color.
Business system monitoring alarm is not brilliant, is the hero behind the scenes. Monitoring alarm is not only a system, more importantly, can reflect our attitude and sense of responsibility.
About me: more than 15 years old program ape, 100 people technical team manager, game start-up did not make money, writing fear patients zhen · si Ape. I was not good at writing before, but recently DECIDED to tackle my weaknesses head-on and share my experience and expertise through writing. Follow my wechat official account (Siape Wai) for more articles.