Author’s brief introduction

CAI Xianghua China Merchants Bank Credit Card Center technology Manager GOPS gold medal lecturer

First of all, I would like to share a sentence with you:

We didn’t do anything wrong, but somehow we’ve lost. 

                                        Stephen Elop (Nokia CEO)

The word VUCA has been mentioned in several posts in the high performance operations community. These are uncertain times, with uncertainty, variability, complexity, ambiguity, and our needs are becoming more and more ambiguous. Previous development used a waterfall model, where a delivery could take months and requirements were fixed. But now with more and more competitors and more and more new demands, we have a lot of uncertainty. For example, when a group business model like Pinduoduo is launched, the business wants to deliver requirements in a matter of weeks.

Therefore, the constant changes brought about by VUCA pose a great challenge to the underlying monitoring system, including the challenge of the technology stack, including the challenge of the professional ability of the personnel. So in many cases, the changes that VUCA brings also mean uncertain challenges.

1. Fintech challenges

What do you think is the top priority of operations?

On the surface, operation and maintenance is a consuming department, consuming a lot of money and resources, so for us, the first priority of operation and maintenance is to stop losses. The only way to avoid unnecessary losses is to ensure high availability of online systems and service stability.

Downtime is costing businesses every minute of Downtime, according to a consultancy. For the financial sector, especially banks, the numbers tend to be much higher.

So the changes brought about by VUCA have brought us a lot of trouble.

The biggest problem is infrastructure expansion: the capacity of the infrastructure grows exponentially.

The second problem is the introduction of new technologies and products: banks tend to use smaller machines, but now we have more and more virtualization, big data platforms. Some enterprises will go to IOE and introduce some domestic brands such as Huawei and Inspur. There are compatibility and stability issues from time to time when forming a virtualized cluster. How to monitor these new products and how to find potential problems is also a big challenge for our operation and maintenance personnel.

Thirdly, the depth of personnel technology stack is becoming more and more demanding. When I first entered the IT industry, I was a system administrator who could only play with Windows server. Now, in addition to Windows, I also need to be proficient in Linux, virtualization, storage technology, server hardware and software, Python scripting language, in addition to some algorithms, security, DevOps, Scrum agility and other technologies. With the continuous innovation of technology in IT industry, the depth of the technical stack of personnel is also increasing, which can be seen from the salary price of the current recruitment positions and the ability requirements of the interviewees.

Finally, collaboration between different teams. Development, business, operations and maintenance personnel will often present requirements from their own perspective:

Development focuses on code delivery effectiveness and efficiency; The business focuses on business availability and some operational metrics; O&m pays more attention to the stability of underlying infrastructure. How to meet the changing needs of various aspects is also the challenge brought to us by the era of VUCA.

Ii. Monitoring targets in Fintech environment

What are the objectives of monitoring in the FinTech environment?

Before the advent of the mobile Internet era, it can be said that we have been in the traditional banking industry for a long time, with a relatively small number of servers and personnel. However, we are now in the era of mobile Internet and in the environment of financial technology, a lot of changes have taken place. We have two apps, one is mobile online banking APP and the other is handheld life APP.

Internet oriented applications have led to exponential growth in our base environment. We now have thousands of servers, even tens of thousands of servers, and our architecture is changing from a single point of application server to more and more advanced technology solutions such as microservices, distributed, and the number of applications is increasing a lot.

The development language will still use Java, C# and other traditional development languages, but will gradually use python, ruby and other popular languages. As the environment changes, we also introduce Devops, Scrum, agile and a series of concepts and methodologies into the organization.

For operations people, there are a lot of things we need to monitor. Generally speaking, the level of monitoring is a pyramidal architecture. We usually monitor from the operating system layer, and on top of that we need to monitor middleware and applications; Underneath it, we also need to monitor the virtualization layer, storage, hardware, and so on. At each level, we need to monitor a variety of things, for example, we monitor database, MySQL, SQLserver and other products.

In addition, each enterprise has its own threshold standards for monitoring. Even within the team, development and operations have their own boundaries. In the enterprise, operations often have to deal with multiple development teams, so there is an increasing number of technologies for operations personnel to understand, sometimes even by reading code to understand how to deploy monitoring and what metrics to monitor.

Total child rouge, to deploy a relatively perfect monitoring system reasonably and effectively, from each Angle of technology, platform and organization, are very complicated.

On the other hand, I believe you will also encounter similar problems: operating system, database and network administrators will say that the system they are responsible for is healthy, but some business feedback is abnormal or even unavailable. In the final analysis, because some monitoring is not in place, monitoring depth is insufficient. I define the monitoring depth into four parts:

The first part is availability monitoring: this is the simplest level of monitoring, such as monitoring whether a port is alive or not.

The second part is performance monitoring: for example, is the CPU utilization always at a high level, even though it is running normally?

The third part is log monitoring: mainly application log monitoring, followed by security audit logs, system logs and so on. These log monitors ensure compliance with our personnel operations. Application logs also provide a reference for follow-up link-wide monitoring and fault location.

The last part is custom monitoring: We will have a lot from the business requirements, define your own indicators, such as how many volume for the past ten minutes, although these kpi can inquire to other ways, but if it can be directly integrated query in the monitoring and control system, provide additional custom monitoring, it can better meet the demand of business monitoring, Understand the operating status of the enterprise business from a business perspective.

To sum up, the monitoring depth includes the following four aspects: availability monitoring, performance monitoring, log monitoring, and custom monitoring.

Why Zabbix?

There are three ways to choose monitoring products: first, completely self-developed (such as Open-Falcon of Xiaomi); second, secondary development based on open source products (such as Zabbix and Nagios); third, pure commercial products (such as Scom and Tivoli).

Open source products can well meet their own needs, but the development cycle is long and the requirements for developers are high, so there may not be much output in the short term.

Although commercial products have perfect after-sales support, they are basically closed source, and manufacturers pay more attention to the spirit of contract. For some personalized needs, it is difficult to implement, or the implementation cycle is very long, and it is difficult to carry out customization or personalized monitoring. At the same time, very few commercial monitoring products are capable of full-stack monitoring.

The secondary development based on open source products, although it requires a certain learning cost, has good compatibility and customization, and does not need to spend a lot of resources to buy commercial licenses and support. For enterprises with certain RESEARCH and development capabilities, it is an ideal choice.

In addition, due to the compatibility of heterogeneous platforms, we also hope to have a product that can cover most of the monitoring needs to achieve unified monitoring. After comprehensive consideration, we finally choose Zabbix, because it has a good balance between the breadth and depth of monitoring.

What are the features of Zabbix?

  1. Open source, free, community support. There is no community or commercial version.

  2. Distributed high availability. A lot of monitoring software may be a single point, without caching, HA, etc. The Zabbix has these necessary features in this regard.

  3. Low-level discovery and automatic discovery. As the number of monitoring devices grew, we found that adding monitoring was annoying, either repeated or omitted, and you realized after a failure that some necessary monitoring was not added. Therefore, low-level discovery and automatic discovery are very useful, which can greatly improve the accuracy and timeliness of monitoring.

  4. Full stack level monitoring. I mentioned the platforms we need to monitor and the different products within each platform. It is true that we can build multiple sets of independent platforms to complete monitoring, but is there a platform that can achieve unified monitoring at the whole stack level? Zabbix can.

  5. Can be customized. Many organizations are now introducing DevOps pipelining, but few have monitoring as part of it. Monitoring should be an important part of the overall DevOps pipeline because CI/CD is more about continuous delivery, but there is a need for continuous monitoring for operations personnel after delivery. Zabbix provides a standard API that integrates well with the DevOps pipeline.

Iv. Best Practice & case sharing

Finally, I’d like to share some best practices about Zabbix that might be useful for using Zabbix (or any other monitoring system) in a real-world environment.

4.1 Distributed automatic monitoring

Previously our monitoring system was a single point, with the increase of servers, we will pay more attention to the stable availability of the monitoring system itself.

As the number of servers continues to grow, so does the number of applications, putting more pressure on monitoring systems. At the same time we need to have one platform – not multiple platforms to do monitoring – multiple platforms will inevitably lead to monitoring noise, or repeated monitoring.

In addition, we need to reduce the possibility of manually adding monitoring, which can result in some missing monitoring, or not adding necessary monitoring items properly even if the monitoring object is added to the monitoring system. Therefore, monitoring must be automated (rather than manual) to ensure the effectiveness of monitoring.

In the figure above, there is a Proxy in each network area, which acts as the monitor. It collects all the information in the class, and the Proxy communicates with the master. This avoids opening too many extra ports on the firewall. This is also distributed monitoring, where users can access Zabbix on the Web to see the status of the current system in real time. Only necessary ports are enabled on the firewall. In this way, you do not need to open many networks, which improves the security of the entire system.

Automated monitoring is easy in Zabbix. The administrator needs to define only three elements: monitored hosts and templates (monitoring items for a certain type of hosts; Monitor the thresholds triggered by the project) and the person in charge (who is responsible for managing a certain type of host).

Here’s how it works: Zabbix automatically scans user-defined network segments periodically

(192.168.0.0/16; 123.1.0.0/24). Once defined, it will be scanned at regular intervals. When a Windows server is found in the network segment, Zabbix automatically adds it to the monitoring system and associates it with a Windows template. At the same time, we define these Windows servers as being managed by the administrators of the Windows team and associate them.

Therefore, the administrator needs to manually define the host network segment, define the association template, and define the association team. Once these are defined, Zabbix will automatically scan them periodically. Once these rules are matched, they will be automatically correlated, eliminating the inaccuracy of human intervention.

This kind of automatic monitoring avoids monitoring loss or monitoring noise. There is no need to manually add monitoring, only need to define rules, and then all the continuous monitoring work is handed over to Zabbix, which minimizes the uncontrollability of human intervention in the whole process.

4.2 Two-Dimension Management

In actual monitoring, we might encounter situations where we monitor databases, middleware, and operating systems on a server.

Because different perspectives have different monitoring requirements: database teams only pay attention to the alarm of database, middleware teams only pay attention to the alarm of middleware, operating system teams only pay attention to the alarm of operating system, and each team only hopes to receive the alarm related to themselves. In addition to ensuring the effectiveness of the alarm, we also need to ensure that the monitoring authority is minimized for the sake of security compliance.

In Zabbix, we define monitoring into two dimensions, a Platform dimension and a service-line dimension.

The platform dimension refers to what database, middleware, or operating system is running on the monitored host.

The service dimension refers to the service line to which the monitored host belongs. For example, a Linux server for a User system that also runs Tomcat middleware will be placed in the P-app-Tomcat, P-OS-Linux, and S-business-user groups.

All P groups will be associated with corresponding templates to achieve standardized monitoring; Group S is associated with personnel to ensure that only personnel associated with services can view monitoring information and receive monitoring alarms. Therefore, the monitoring will not be missed, but also reduce the monitoring noise. It realizes automatic and standardized monitoring.

4.3 Alarm Notification

With monitoring deployed, let’s move on to the alarm notification mechanism. In general, we follow the principle of layered notification, multi-channel notification and detailed notification content.

Layered notification: We set different alarm levels and alarm policies for different severity alarms. There are five alarm levels in Zabbix, and we used three of them: Disaster, Warning and Information in the actual production process.

Disaster is defined as a problem that directly affects production and requires the administrator to deal with it immediately. These alarms will also be notified to the corresponding administrator and his/her immediate leadership as well as the personnel on duty in the 7×24 monitoring center in the first time.

For potential problems or less urgent problems, we set the level to Warning. These alarms are only sent to the corresponding administrator, who can choose to deal with them immediately or later. Information level alarms are generally used for testing or minor level alarms.

Multi-channel notification: Alarm notification is performed through a large screen display, Email, or SHORT message. Make sure these notifications reach users in the first place.

Detailed notification content: As can be seen in the figure below, when an alarm is triggered through SMS or other channels, we will try to include all problems in the alarm, including the alarm status, specific alarm trigger content, alarm server and IP address, specific value of the current status, contact and phone number.

The reason for doing so, can let the corresponding personnel first understand the status of the current monitoring object; At the same time, 7×24 hours of duty staff can also contact the corresponding administrator in the first time, accurate contact.

 

 

4.4 Panel Display

No matter what kind of monitoring system, more need to have a perfect view display function. Traditional commercial software is not as customized or easy to use to meet the needs of customization. Zabbix provides rich graphical presentation capabilities.

With each iteration of Zabbix, the presentation of views has improved. As shown above, this is a panel representation in Zabbix2.2, and we will split the whole panel into two parts. Each cloud in the top half is a class of business systems, such as user login systems, security systems, and so on. You can use different colors to know the health status of the current system:

For example, red means Disaster alarm exists in the system, yellow means Warning alarm exists, which is convenient for the administrator to deal with these faults in the first time. You can also tell if the system is in maintenance mode by its shape (the orange rectangle indicates that the system is in maintenance mode).

The lower part provides an alarm list, through which you can learn a lot of information: the severity of the current alarm, which alarms are active, when the alarm occurred, and how long the alarm occurred.

After version 3.4, including now version 4.0, you can do a lot of customization. Provide more variety of presentation.

If you are not satisfied with Zabbix’s presentation, you can also turn to Grafana, EChart and other third-party plug-ins for implementation.

4.5 Automatic Out-of-band management

Out-of-band management is an advanced feature. Most of the time the server will be down for no reason, we need to enter the machine room to find the server, and then use a mobile workbench to log in, very troublesome.

Zabbix can automate out-of-band management in this way. Out-of-band management pain points are as follows:

  1. Unreliable! Manual inspection of the machine room is not timely and there are omissions.

  2. Too pit dad! Potential problems such as firmware defects cannot be detected in a timely manner.

  3. Too complicated! Multiple management platforms such as HP, DELL, and Huawei cannot be unified.

  4. The high cost! You need to purchase additional KVM devices. Additional KVM switches are supported. Licensing, room capacity are costs.

Various vendors provide their own out-of-band management software. The figure above shows the out-of-band management interfaces of HP, Dell, and Huawei respectively. Their basic functions are similar: restart the server, configure the software and hardware of the server, and upgrade the firmware.

We’ll connect all the management ports on these machines to an out-of-band network. A DHCP server automatically assigns IP addresses to an out-of-band network. OOBProxy is a Zabbix proxy that periodically scans the entire network segment. Once a corresponding server is found online, it is added to Zabbix and the out-of-band template is applied. This template is based on the IPMI standard. Because of the standard protocol, there is no need to consider brand differences. The data collected through Zabbix will also provide standardized data for the CMDB. In fact, with this solution, an alternative KVM platform has been preliminarily implemented, eliminating expensive hardware and licensing costs.

4.6 Continuous integration/continuous delivery

When an application is published, a series of operations must be performed on the server, middleware, and application, which causes monitoring noise. How to reduce the noise in the online process and how to do continuous integration with other platforms are also the monitoring platform needs to consider. On the other hand, in the expansion of the existing business, how to carry out capacity planning scientifically and reasonably is also one of the responsibilities of the monitoring system. For example, Zabbix can provide a reference on how to evaluate how many servers and resources the system will need when we have a new snap app.

Most organizations that are doing continuous integration have version control (git) and continuous integration (Jenkins) platforms, along with configuration management tools (Anisble) to manage the various base platforms.

In the figure above, Jenkins would call the Zabbix standard API to put the corresponding monitoring object into the maintenance mode to avoid the monitoring noise in the process of online release. Zabbix also synchronizes the data it collects with the CMDB, which is also shared with other DevOps platforms to keep our online configurations up to date and available. Zabbix also connects to some notification platforms, such as wechat, SMS and email, to notify users of alarms in the first place.

As a result, Zabbix provides a standard API for the entire continuum of integration and delivery that is highly integrated with DevOps’ pipeline. It can also collect all kinds of data, which can be used as a reference for capacity planning, rather than deciding how much resources are needed later.

In summary, you can use the best practices described above to evaluate which monitoring platform is best suited to your enterprise’s needs. The advantage of Zabbix is that it is open source and free. I believe Zabbix may not be as strong as BAT’s own monitoring platform in terms of functionality, but it is a platform suitable for smes to carry out full-stack monitoring with very few resources.

Zabbix is also very green and efficient in terms of management cost and resource expenditure, and does not require too much manpower in monitoring. But to be clear, no monitoring platform is a panacea, and Zabbix is no exception. There are some very deep requirements, such as detailed analysis of an application, that no platform can automate. But overall, Zabbix covers 80% of the monitoring needs. The biggest benefit of using Zabbix is that it reduces human involvement and enables full-stack, sensorless monitoring through advanced features such as automatic discovery and low-level discovery, as well as high integration with other systems in the DevOps pipeline.

Now we have more and more Chinese documents, more and more Zabbix Chinese community partners are contributing their own experience, community resources are rich, there are regular offline activities, if you are interested, you can follow the open source community official account.