In recent years, as the concept of big data continues to be popular, both in the open source community and in large companies’ own research systems, related technologies have gradually become mature. So once we have the ability to reliably collect, process and store big data, what real business scenarios can we use these capabilities for and make data valuable?
At present, the most common big data landing products include advertisement recommendation, BI report, portrait analysis, log search, machine learning, etc. Aside from offline computing, another important application scenario of real-time computing is monitoring. Whether it is PV, UV, transaction volume statistics or abnormal data alarms, the sooner the better.
Traditional single-dimension monitoring has been well handled in this data processing link, but now more and more business data show the characteristics of multi-dimension and multi-index, multi-dimension monitoring has become a new trend in the development of business monitoring.
This paper introduces in detail the design concept and overall architecture of Multidimensional monitoring of Zhiyun, as well as how to use its ability to quickly achieve multidimensional monitoring coverage of Tencent SNG business.
Wecloud multi-dimensional monitoring originates from Tencent SNG Hubble Big data real-time processing platform, formerly known as MM (Mobile Monitor) system. Around 2012, with the explosive growth of smart phones, most of the company’s products moved from PC to mobile. With the rapid growth of mobile users, how to monitor the quality of mobile terminals becomes more and more important.
The MM system comes into being at this stage. It collects “region, operator, version, command word, SET, APN” and other dimensions and “number of requests, success rate, average delay, packet sending, packet returning, and matching rate” from the mobile terminal through storm for real-time monitoring analysis and alarm.
Dimensions: Used to describe data. The attributes and characteristics of the data object, such as carrier, version, etc.
Metrics: Used to measure data. Is the value of the data object, such as success rate, total number, etc.
Although MM system meets the requirements of mobile terminal monitoring for a period of time, with the development of diversified services, personalized requirements also increase, and the use scenarios of MM system become pale.
Limitations are mainly reflected in the following aspects:
• Single data source. Only the data from Tencent’s single system can be accessed, and other data sources cannot be connected;
• Fixed real-time processing logic. Only fixed dimensions and indicators can be cleaned for fixed aggregate statistics. Custom processing logic is not supported.
• Technology stack is not updated. Still using the early Storm version and Impala, unable to adapt to the exponential growth of business data, often lost data, slow query, etc.
• Poor scalability and o&m. Hardcode is everywhere in the code, many function points are not abstracted into configuration items, and the expansion steps are complicated.
In view of the above problems, MM system ushered in the opportunity to rebuild, and officially renamed Hubble.
We mainly consider the following aspects when designing:
• Backward compatibility with original functions. That is, the multi-dimensional monitoring and analysis capability does not affect the normal use of connected services.
• Universal transformation. The commonly used real-time processing functions are encapsulated to make componentization and building blocks;
• Lower the bar. Encapsulate the data processing process into interface configuration, without the need for access personnel to write Storm code;
• Optimize architecture. By upgrading the background architecture, data accuracy and query speed are improved and data link delay is reduced.
• Experience optimization. Unify the interface style, optimize the interaction design, and provide friendly error prompts.
Therefore, the first problem to be solved in multidimensional monitoring of cloud is how to make users who are not able to write codes generate storm topologies according to their processing requirements.
Let’s take a look at storm’s three most common development modes:
1. Java. The highest threshold but the most flexible, logical implementation by inheriting various SPout and Bolt interfaces provided by Storm;
Pig On Storm. Using Pig to develop is like encapsulating a higher level language on top of the Java language, lowering the barrier to development.
3. SQL On Storm. SQL is more pervasive and easier to use than Pig.
But these three methods still do not get rid of the concept of programming, there is always a process of development and debugging. Why not create a system that generates Storm topologies only through interface configuration? The technical implementation is no more complex than the above three methods. The reason most teams don’t do this is because the interface configuration is so abstracted and encapsulated that it loses some of its flexibility to support all business scenarios.
Hubble’s goal was to be compatible with MM systems and meet most new business scenarios, so it chose between ease of use and generality to create a Storm development environment with a full interface configuration. Thus was born weave cloud multidimensional monitoring data processing plant.
The common big data analysis scenarios are firstly classified and summarized as follows:
Then encapsulate these operations as configuration items on the interface.
• Filtering: 1. Ensure data integrity and validity, such as no null value or dirty value; Filter out redundant information by identifying fields, such as plateForm is Android data.
• Formatting: 1. Time format conversion; 2. Data type conversion; 3. URL codec.
• Translation: 1. Internal and external IP information translation; 2. Database table dictionary translation; 3. Delimiter segmentation; 4. Four operations; 5. UDF translation.
• Forwarding: 1. Forwarding to THE DC channel of SNG; 2. Forward to CDB; 3. Forward to Kafka.
• Group: Configure the statistical period, time, and group fields.
• Aggregation statistics (support aggregation filtering) : 1. Number of statistics (PV), number of deduplication statistics (UV); 2. Maximum value and minimum value; 3. The first value, the last value; 4. Sum and average.
With the functions provided above, users can perform various data processing in the multidimensional monitoring of the cloud, resulting in a Storm topology similar to the one shown below.
This step is a preliminary aggregation of data. For example, if the statistical period is one minute, Storm will group and aggregate the received data into the corresponding OLAP storage within a sliding time window of one minute. The final result data displayed on the interface also needs to be aggregated twice. The whole process is as follows:
Druid is currently the main OLAP storage engine promoted by Weaver Cloud Multi-dimensional monitoring. Druid is a timing database that works well in multi-dimensional analysis scenarios, complementing the speed and concurrency of previous Impala queries. In addition, traditional databases such as Pgsql and Mysql are preferred for service data of small volume to save server resources. The business data that requires full-text retrieval is stored in ES.
Overall structure of data processing plant:
Characteristics of data processing plants:
Through the data processing plant, all kinds of business data can be connected to the cloud multidimensional monitoring for processing, and then flow into the application ecology.
Two of the most important in these application ecosystems are multi-dimensional trip analysis and multi-dimensional monitoring of alarms. When the business data is connected to the multidimensional monitoring of cloud, a default page will be generated under the multidimensional analysis menu.
This page is mainly divided into four parts: service tree navigation, dimension filtering conditions, indicator trend chart, and data analysis. Dimensions can be screened and translated again, and indicators can be more complex compound operations and formatting operations.
The following is a typical service fault locating example:
Step 1: Find a pit in the index trend chart, click the corresponding time point to enter the tripping analysis;
Step 2: Sort according to the number of requests, find the Appids with high number of requests and low success rate, so as to determine the abnormal Appids, click the drill-down analysis of the second dimension;
Step 3: Continue to find the abnormal command word and return code two dimensions, so as to determine that the service AppID is A, command word is B, return code is C, the number of requests has increased sharply, resulting in a decline in success rate.
Hubble multi-dimensional drwell analysis is a major scenario to help develop and locate business anomalies under which combination of segmentation dimensions. In addition, it is often used to analyze data such as the percentage of users in each major carrier and the percentage of iOS/Android users.
The analysis interface is not enough. When indicators are abnormal, users prefer to receive alarm notification in the first time. Therefore, the multidimensional monitoring of cloud also supports multidimensional alarm monitoring. Compared with the traditional single-dimension monitoring, the X-axis is the time and the Y-axis is the value, so the design of multi-dimension monitoring is much more complicated. For example, if dimension 1 is A and dimension 2 is B, an alarm is generated if indicator 1 is lower than 90%. For a combination of dimensions 3 and 4, if indicator 2 is lower than 80%, an alarm is generated. Therefore, the overall page configuration is as follows:
You need to configure alarm rules and conditions. Then configure subscription rules and set subscription conditions and methods. You can also configure convergence rules and filtering rules if necessary. When an alarm is generated, the alarm itself is a result of analysis. After receiving the alarm, the development or O&M can quickly rectify the alarm, eliminating the process of fault locating.
With multi-dimensional driller analysis and multi-dimensional alarms, businesses can migrate alarms from traditional monitoring systems to the weaver cloud.
In fact, we sort out common service scenarios. Dimensions and indicators on the client and server are as follows. If configured in traditional single-dimension monitoring, the number of alarm items multiplied by the value of each dimension is required, while in multidimensional monitoring of the cloud, it is only a piece of data. With the multi-dimensional monitoring capability of Wecloud, the monitoring granularity can be enlarged from IP to a micro-service, and from one monitoring item for each indicator to a large market for each micro-service.
At present, we have also made good progress in the field of machine learning. For example, for the above case of manual multidimensional analysis, we have realized the learning of “multidimensional root analysis algorithm” to recommend the combination of abnormal dimensions. There is no need to set thresholds for alarms. You can learn outliers based on the historical data and model to alarm and convergence.
At present, the total number of multi-dimensional monitoring services in Tencent has exceeded 200, and the server scale has exceeded 1,000. Weaves cloud multi – dimensional monitoring has been delivered to customers for use, welcome to consult.