The authors introduce

Wang Wei, senior MySQL DBA of JINGdong Mall, has years of MySQL operation and maintenance experience in game and e-commerce industry. He is committed to MySQL automation and self-service operation and maintenance, focusing on MySQL database architecture, tuning, operation and maintenance, and Zabbix monitoring and other technologies.

The body of the

With the rapid development of JD’s business, the use of MySQL database is becoming more and more popular and the server scale is growing rapidly, which has higher and higher requirements for JD’s MySQL DBA team. Monitoring system provides accurate data basis for database management and maintenance, and is the clairvoyant eye and windward ear of database operation and maintenance personnel.

Accurate, timely and effective monitoring enables operation and maintenance personnel to know everything about the operation of the production service system. By analyzing the monitoring information, judge the running status of the monitored database, predict the possible problems, and formulate the appropriate optimization scheme in time, so as to ensure the normal and efficient operation of the whole system. This ensures the security of the database to a large extent and avoids some unnecessary losses. Therefore, it is necessary for us to have a deeper understanding of Zabbix system.

I. Introduction of Zabbix functions

Zabbix is an enterprise-level open source solution that provides distributed system monitoring and network monitoring capabilities based on a WEB interface. Zabbix can monitor various network parameters to ensure the safe operation of the server system. It also provides a flexible notification mechanism for system administrators to quickly locate and resolve problems. This is a definition of Zabbix on Baidu Encyclopedia. There are many monitoring software on the market. Why choose Zabbix? Let’s take a look at some of its features:

1. Automatically discover servers and network devices.

This function is a bit of a chicken, which is not practical in the scenario of multiple applications and mixed devices, bringing inconvenience to the overall operation and maintenance management of Zabbix. In practice, there are various problems, especially in the case of a large number of devices, it is not recommended to use this function. Zabbix monitors the addition of devices in combination with the CMDB, so that the purpose of locating devices and adding templates will be more accurate.

2. Bottom-layer automatic discovery.

This is very convenient and practical, such as its own automatic discovery system partition, automatic discovery of multiple network cards. This feature can also be customized, such as monitoring the MySQL multi-instance service on a host.

3. Distributed monitoring system and centralized Web management.

Zabbix supports active monitoring and passive monitoring modes (mode is relative to the client, which actively pushes monitoring data to the server or pulls monitoring data from the server. Active mode is recommended to reduce server-side stress), and second monitoring can be achieved, which some monitoring software cannot achieve, but is important for important business;

4. Wide range of support.

Supports the monitoring of various devices and operating systems, services, and logs in the market. You can use its own agent to monitor devices with or without agent to monitor devices, such as SNMP.

5, flexible monitoring item Settings.

Zabbix already supports many common monitoring items. Users can also write scripts to flexibly customize monitoring items and flexibly combine multiple alarm thresholds to accurately alarm. For example, the alarm threshold of the monitoring hard disk can be set to 80% of the hard disk space and the remaining space is less than 50G.

6, high-level business view monitoring resources, monitoring can be displayed vertically and horizontally.

For example, for a set of database sharding, a certain performance indicator of all the master libraries can be included in a Graphs, so that the load balance of each master library can be easily compared. Multiple Graphs can also be made into one Screens, and then various conditions of various performance indicators can be seen under one Screens for easy and intuitive comparison.

7, flexible user authority Settings.

Support custom events and email sending, also support alarm upgrade and log audit;

8. Fault self-healing based on Zabbix alarm.

Zabbix has a standardized fault handling process, grading and classifying the alarm, which can automatically deal with some low-level, solidified treatment methods of fault, so as to achieve the effect of rapid recovery of fault. This is important because doing it right maximizes the stability of business availability and reduces the risk of human error and personnel costs;

9. Powerful built-in apis.

Almost all zabbix server-side Web page configuration operations can be done through its own API, which users can easily redevelop to meet their own automated operation and maintenance needs.

One of Zabbix’s biggest drawbacks is the lack of a combined alarm function, which can cause alarm storms in extreme cases. However, a lot of monitoring software should not achieve this function, the user can carry on the secondary development to it, in order to achieve the effect of merging the alarm.

2. Optimization of Zabbix

Many enterprises use Zabbix to monitor the number of equipment up to 100 or 200, the performance is very poor after running for half a year, it takes a long time to open the monitoring chart, or even can not be opened. This problem is common, mainly because zabbix has not been properly planned and optimized. Zabbix can easily monitor tens of thousands of devices if it is properly optimized and architected.

1. Optimization of configuration file parameters

To monitor large and massive devices, adjust zabbix parameters, including the number of processes, cache size, and timeout duration. Adjust Zabbix parameters based on the actual monitoring situation, and disable the monitoring modes that are not used by VMware and Java.

Review images

2. Optimization of monitoring items and alarm items

The more items you monitor, the greater the performance test of the Zabbix database and itself. You can streamline monitoring items to monitor only necessary monitoring items. You can cancel monitoring items that do not help O&M to reduce system resource waste. The most typical MySQL monitoring template is the zabbix monitoring template produced by Percona, which is popular on the Internet. It has more than 200 monitoring items, including all the items in Show Global Status. Many monitoring items are meaningless for operation and maintenance. However, it has a serious impact on the performance of the database and Zabbix itself. When the monitoring level reaches a certain level, the performance is poor.

You are advised to use numbers instead of characters for monitoring items. Characters use a large amount of storage space in the database, and it is relatively troublesome to set trigger. Moreover, Zabbix itself is relatively efficient in processing numbers. If services need to monitor items of character type, you can reduce the interval for data collection to improve processing efficiency.

Trigger, regular expression functions last(),nodata() have the fastest speed, min(), Max (), avg() have the slowest speed. In the process of use, try to choose a faster function. When configuring Trigger, use the correct logic. Incorrect logic may slow down database queries.

3. Optimization of Zabbix database

Partitioning the database is mandatory to facilitate the removal of historical data. Also turn off zabbix’s own ability to delete historical data. If partitioning and deletion rules are not set, zabbix’s query performance and secondary development will become poor over time, or even fail to query data. Table partitioning of relevant content can refer to the following documents: https://www.zabbix.org/wiki/Docs/howto/mysql_partition.

SQL statement to disable Zabbix itself to delete historical data:

Review images

Set the location on the page to:

Review images

It is recommended to add an index to the time field in the history table. This field is more likely to be used during secondary development. It is recommended to enable InnoDB compression for historical tables as follows:

Review images

For MySQL, it is recommended to use PerconaDB5.6, set thread_handling= pool-of-threads, and enable thread pooling. The MySQL configuration file other parameters optimization here not much said, you can refer to as links in the configuration file: http://wangwei007.blog.51cto.com/68019/1623329.

4. Architecture optimization of Zabbix monitoring system

The main principle of Zabbix architecture optimization is to share the pressure from the server to the proxy, and the pressure from the proxy to the Agent. The passive mode is adopted for monitoring items. High availability is implemented on both the server and proxy to prevent monitoring unavailability caused by a single point. Here is zabbix’s architecture and flow chart for your reference only:

Review images

Zabbix automation

The essence of operation and maintenance automation is to liberate, simplify and facilitate the work of operation and maintenance personnel, improve efficiency and reduce human failures. The basic operation and maintenance idea is to be able to automatically and firmly not manual. Operation and maintenance personnel should cultivate the good habit of being “lazy”. The basis of automation is the accuracy of basic information and standardization of various configuration information rules.

1. Monitoring automation specification

The host name specification is agreed in order to achieve the effect of knowing the meaning by the name. Seeing the host name, you can probably know what business the device is in use and what role it is. When problems occur, o&M can quickly know the scope and severity of the impact, facilitating O&M. Host specifications generally include equipment room information, service information, service registration, service role information, and IP address.

The name of the host group is agreed, which is mainly for the convenience of the same business and the same R&D to check their own host monitoring and receive alarm information. It is also convenient for Zabbix itself to make group pictures to display on screen and make performance comparison pictures. For example, a Sharding cluster in the database can define a host group and make a Graphs summary graph to facilitate the R&D to intuitively compare whether the performance indicators on each slice are balanced.

Alarm level specification, this is mainly used to distinguish the alarm to whom, how to send, how to do alarm upgrade, etc., but also according to the level and monitoring items for automatic processing, the higher level of priority processing, low can be centralized processing, etc.;

Specification for host maintenance suspension alarm. Alarm is very important, suspend monitoring needs to be careful, do not recommend the use of its own Maintenance pre-maintenance, mainly because the Maintenance state of the host will still be displayed on the monitoring home page, although marked, but when the host is not convenient operation and Maintenance view monitoring. It is recommended to carry out secondary development. It is agreed that the host in the maintenance state will close trigger and open it automatically after the maintenance.

It is not recommended to manually modify various configurations of host monitoring. This is easy to forget, and manual operation is inefficient, which may cause confusion of Settings and rules, resulting in complex follow-up problems and poor maintainability. During the secondary development of Zabbix, information such as the reason for the modification and the time period when the modification takes effect should be recorded. The increase, deletion and change of monitoring are automatically completed, and all kinds of specifications are constrained by procedures and completed automatically by procedures.

2. Automation of deployment configuration

The deployment of Zabbix server and client side is relatively simple, and there are many online tutorials. The whole deployment process is scripted, and then combined with CMDB, automatic batch deployment and host addition to monitoring. Deployment process can refer to this link: http://wangwei007.blog.51cto.com/68019/1047953.

3. Daily operation and maintenance automation

Zabbix itself provides a rich API interface, you can call these apis, standardized operation configuration Zabbix. Can go to http://www.zabbix.com/documentation.php for each version of the instructions, contains the zabbix various operations;

In the API specification, the database dictionary of the tables in the Zabbix database is also described, and what each field represents is specified. The secondary development and automatic operation and maintenance of Zabbix are mainly done by calling Zabbix API and reading Zabbix database. It is not recommended to directly write the original table of Zabbix database, and generally it is not necessary. You can consult the python write API:wangwei007.blog.51cto.com/68019/11399… .

4. Automatic processing of alarm

Zabbix can be set up in action to call system commands, which can be used to automatically handle specified alarms if it is safe to do so. The Settings are shown below:

Review images

Users can summarize common alarms, and some fixed processing methods of the alarm, script the process, when reaching a certain threshold, automatic processing, such as cleaning the fixed location of the log, to achieve the purpose of alarm rapid recovery.

5. Use of LLD

LLD in Zabbix is a very good extension to monitor services such as MySQL and Redis, ports, disk partitions, and network cards on hosts with multiple instances. Users can customize discovery rules and automatically generate specified items, Triggers, Graphs, etc., which is relatively flexible and greatly facilitates the operation and maintenance of monitoring.

6. Secondary development of Zabbix

The secondary development of Zabbix focuses on the secondary analysis of monitoring data to maximize the use of these data for better service and operation guidance.

Detailed historical data for Zabbix by type of data collection exists in the following table:

The history, history_log history_str, history_str_sync history_sync, history_text, history_uint, history_uint_sync, events

The trend data of Zabbix is stored in the trends and TRENDS_UINT tables. Trend data is calculated based on detailed monitoring data. Each monitoring item generates the minimum, maximum, and average value every hour.

Using the historical data, you can automatically generate summary reports of performance parameters, such as TOP100 performance indicators with high pressure, to facilitate o&m to identify potential security risks. By analyzing the alarm history, we can find out the frequently alarming monitoring items and evaluate their availability.

Finally, I would like to thank Our DBA team leader Fan Jiangang fan for his guidance and advice on this paper, as well as his constant attention and support to MySQL monitoring.

Recent Technical Activities

  • Operation and Maintenance technology Salon “Operation and maintenance architecture and security System” Beijing Railway Station, August 6th, free

  • DevOps Guangzhou Station | 3 real hybrid cloud architecture practice

  • China Application Performance Management Conference -APMCon, introduced as follows (enter YWB promo code, you can get 500RMB discount)

From August 18 to 19 this year, the 2016APMCon China Application Performance Management Conference, co-hosted by Geobang InfoQ and Tandun and co-organized by O&M, will be officially kicked off in Beijing. Technical leaders from LinkedIn, Alipay, Tencent, JINGdong, netease, Sina, Tmall, Yihaodian and other companies will be invited to the conference. Discuss APM related performance optimization, technical solutions and architectural details, and deliver application architecture optimization and innovation content to more industry practitioners.