Front-end early chat conference, a new starting point for front-end growth, jointly held with the Nuggets. Add Scott wechat Codingdreamer into the conference surrounding technology group, front-end page building special, 2021-2-27, live online.
The live on page visual/automation/intelligent structures, new style, the lecturer from ali, tencent, bytes, with help, ants, bytes, etc., subject involves the foreground/background/PC/H5 / rich media (such as video, posters, questionnaire), registration stamp: www.huodongxing.com/go/tl21
The text is as follows
This article was written in 2019, the original link 404, re-published synchronization here, for front-end monitoring, early chat to do a more in-depth share, you can contact the conference assistant to join the front-end monitoring wechat group. Decision comes from data, and data comes from collection, and collection comes from rule sorting, which makes all this happen because of the creativity and execution of engineers. This article is 5000 words, which is mainly popular science, and technical depth is suitable for intermediate front end. It takes about 10 minutes to read.
Background: It has been nearly 5 years since Xiaocai launched its first APP in 2014. Although there are 80 people in the technical department and 20 people in the front end, We still know nothing about the user usage of 8 apps, 4 small programs, 6 H5 mall systems, and 10+ PC CRM/IM/ERP/TMS mid-stage operation systems, online abnormalities, equipment distribution, and PV/UV conversion of marketing activities. Because it is not clear to see the scene, this practice is obviously not the Internet, but also does not conform to the front-end “tool is king efficiency first” technical values.
Historical issues and pain points:
- Multi-party data is not collected for toB products, and so is for toC products
- As a business RN APP/PC Web(React single page/multi-page application)/H5/ applets cross-end product
- Data burying point scattering is not defined and classified
- If Ali has location tracking SPM, resource bit tracking SCM, exposure and interaction gold tokens, we don’t have any of them
- The online access of the user is abnormal (B end user, availability stability is required) and cannot be traced
- Such as the whole quality system monitoring without grasp, front end error reported at the end of the error, all rely on experience, human logs and user active feedback
- User access behaviors, device characteristics, and application performance information are not sensed
- Such as active time point (avoid doing release), software and hardware system and equipment ratio (do compatibility), slow page optimization and so on can not start
- The effect of business data cannot be tracked
- For example, the conversion effect of marketing activities and the time consuming of payment links are difficult to serve business decisions
In fact, there are still too many pain points. In order to prevent ourselves from becoming blind, we decided to make a set of tools and products for monitoring buried points of front-end products, whether through community programs, payment or self-research. In recent years, we have made some attempts, and have gone through three stages in total:
Until this year, we finally chose the community open source solution + complete self-development, which is stage three. We almost stopped using any paid products and invested in a front end to make all-out efforts. The whole project took about 60 people days from project establishment to implementation (yes, the front end of small dishes is so hardcore).
In addition, the whole monitoring system introduced in this paper belongs to the quality evaluation and tracking system of the small vegetable front end. As for the end quality system, I have planned two articles, but I haven’t written them yet:
- Part 1 – How to Build a Front-end Quality Tracking Platform from 0 to 1 (Part 1)
- Part 2 – How to build a Front-end quality Tracking Platform from 0 to 1 (Part 2)
In this paper, we do not discuss the quality system, only focus on monitoring and tracking implementation.
What’s going on in the system
In order to give you a more intuitive feeling of what we have done, how much we have done, but also for you to understand some of the more technical selection judgment behind, first put our data flow diagram, product function diagram and system architecture diagram out, forming the first perception and question.
From the moment the user visits the page to the end of making necessary decisions based on the data, the whole data flow is processed step by step, which can be divided into the core processes of collection, forwarding, decryption, persistence and processing:
The whole monitoring and tracking, from collection to tracking, can be divided into the following functional modules from the perspective of products:
From the implementation level, data comes in layer by layer, and tasks of each layer are taken over and processed by corresponding systems. The technical architecture diagram of the whole monitoring and tracking is as follows:
From the core system/module composition, can be simply divided into the following:
- Design and Implementation of acquisition SDK (PC/H5/ applets /RN/NodeJs)
- Design and implementation of data transponder DataTransfer
- Monitor the front-end and back-end dashboards in the background
- Task Controller Controller design and implementation
- Task executor Instpector design and implementation
Limited to space, we only introduce the design of each system, not involving technical details, we want to know more detailed information, or to more detailed hd architecture, you can add my wechat request: Codingdreamer.
There are a lot of nouns in the diagram, so you can do some research to warm up. I’ll try to briefly describe their relationship in plain English, and then we’ll get into the system. The internal relationship is:
- The data flow: After the data is collected from the page, it is sent via Nginx to a data forwarding server, which mindlessly throws rows of access logs to Kafka, a message forwarding center, which passes them to ELK, a dedicated log processing suite that provides many log query capabilities. These capabilities are further sunk into the data warehouse in accordance with the rules, and the data warehouse is then used by the product business students through the report platform through the statement production and presentation tools of SQL-table-Charts.
- Data management: In addition, for developers should have a monitoring management background, real-time data from ES to show data, and do some task generation based on abnormal data, and these business logic is the management background, through Kafka to inform the task controller to complete, task controller to do some more pure task management, It does some of the more detailed execution by notifying the executor through Kafka, writing back tasks and so on.
- Data persistence: All logs are stored in ES, all issues are stored in MySQL, and the tracking management of issues is stored in Redis in the form of task stack. Meanwhile, all data sorted according to rules are sunk into the data warehouse and finally released to the report platform.
The client SDK
Data comes from collection, and collection comes from demand sorting and rule sorting, that is, what kind of data must be collected on the end in order to achieve the purpose of xx. Taking PC/H5 as an example, the data to be collected includes the following categories:
- User data
- Basic information, such as anonymous or non-anonymous user IDS and IP addresses
- Equipment information
- For example, operating system type and version
- For example, the browser type and version
- For example, App version number
- Behavioral data
- Such as user access source
- For example, the user access path
- The user clicks on the slide area and so on
- The performance data
- Such as script load time, interface response time, etc
- Abnormal data
- For example, front-end script loading errors and script running errors
- For example, the back-end API request times out, returned data is abnormal, or parameter exchange errors occur
- Custom data
- Such as activity page special block access data, form submission data, invitation sources, etc
To sort out these data requirements and map them to the browser host environment, the following functions need to be implemented:
- API Request Monitoring
- The Log bread crumbs
- Performance report
- User-defined event reporting
- SPA/MPA Route/page switchover record data is reported
- The Error control (onerror/onunhandledrejection)
- UI event reporting (manually enabled for error backtracking and page heat map)
There is no difficulty in the technology, mainly listening for specific events of the browser (onError, onunHandledrejection, onpopState, etc.) and wrapping some special functions of global objects (such as XHR’s send) in an AOP way. Collect the necessary data without affecting the developer.
In this process, we should pay attention to avoid repeated wrapping when wrapping functions or objects. What really takes time is compatibility debugging, especially on the mobile end, there will be some difficulties. For example, script error frequently occurs on iOS devices, fingerprint loss, and low arrival rate of iOS sendBeacon report.
Since we need to centrally manage multiple SDKS, we build this SDK project using Monorepo, TypeScript code, and rollup + TSC package for part of the SDK.
Data forwarder DataTransfer
Modular diagram
After data is collected from the client, it is reported to the server according to certain policies. After IP information is obtained through Nginx, the data is directly transferred to Kafka, and then the data is sent to Kafka for data synchronization to ELK. Here is how DataTransfer relates to other components throughout the architecture:
Functions of modules
The data forwarder, a Node service, acts as a data porter, forwarding data to the Kafka cluster, and does additional things like decrypting and validating data, adding additional required fields, and so on.
Module implementation
When forwarding to Kafka fails, the data is written to the local log and sent to the Logstash using Filebeat. Why is this degraded? Because our own ES and Kakfa are going through the transition from classic to private network in Alicloud VPS, and the transition period is unstable for a few months, we use two legs here. We send priority to Kafka, and when Kafka is unstable, Filebeat takes over. Then it is pushed to the Logstash. Each transfer corresponds to a Backup Filebeat, which is a transitional solution.
In fact, Openresty would be more suitable for Nginx + Transfer here, with good performance. We didn’t adopt it because the whole facility was built and maintained by the front end. The Transfer service built with Eggjs didn’t want to introduce additional language stack to increase the maintenance cost in the later stage. Nginx + Node Transfer(Filebeat) is used.
As for the implementation of the project, WE still use TS, and the service framework uses Eggjs. For Eggjs, we write an egg-Kafka plug-in for kafka’s message sending and receiving, which is also one of the gains of this project.
Monitors the front and rear dashboards in the background
Modular diagram
If data is not involved in the decision-making of the team, it will lose the value of collection. Since it is decision-making, it is nothing more than the aggregation display of information and the addition, deletion, modification and review of various tasks, and this role is undertaken by the Dashboard of the monitoring background.
The relationship between Dashboard and other modules in the overall architecture is shown below:
Why not use Kibana
Among the three major ELK components, Kibana is the best one to use. Why don’t we use it? In fact, we don’t use it either, but give it to some small scenes for flexible use. Business attributes such as user identity, issue advancement, task tracking, exception correlation of future repository branches, evaluation of quality reports, etc., were too strong, and Kibana was obviously not very flexible for us to scale, so we didn’t use it as a Dashboard.
Dashborad data sources
There are two main data sources for Dashboard: ES and RDS(MySQL) of Aliyun.
Query Elasticsearch logs
ES plays the role of log database and provides query capability, while Dashboard is to obtain real-time data from ES without any brain, such as the number of anomalies in the last 15 minutes, or the PU/V of an application so far today, the distribution of device types accessed, etc., as shown in the figure below, the data of a large market of an application:
Or the application exception occurs:
Each specific exceptions, we can click details to view the corresponding exception details, tracking, equipment, and the manufacturer to the user’s ID, IP, browser version, page, time, location, and even the more detail data information such as the role, coupled with abnormal back replay and Sourcemap, facilitate we know the first time to follow up and repair.
You can even view the error of an API request:
MySQL persists business data
The data stored in MySQL is the issue information obtained by classifying and abstracting the wrong original data stored in ES through the Controller described in the next section, as well as the rules and corresponding task information formulated by the developer to monitor a special issue.
Here is our abstract issue:
For each issue, whether it is the front end or the back end, what has been followed up and repaired, and how the repair situation is, can be tracked as follows:
At the same time, we will have a concept of session for a series of actions of users. When reporting data, a sessionId will be generated and then revealed in the issue details, so that developers can check what users have done when solving errors and solve problems better:
At present this page is still done rather rough, after all, there is no special designer
The rough life cycle of an issue is as follows, with most of the actions for state changes implemented by the developer on the dashboard:
Combined front end and back end linkage view, can solve the problem more quickly, and with the abnormal recognition and judgment mechanism, can be more active to push the abnormal to the front end and back end of the nail group, and even the recurrence of the problem will @ parties, about the issue judgment we will talk later:
Generate alarm task information
When developers need special “care” for an issue, they can set corresponding alarm rules for the issue. The controller mentioned below will generate alarm tasks according to the alarm rules. The following is a simple alarm task:
We can also see the task in action:
There are many other practical functions in the background, such as “issue”, “assignment”, “classification”, “update flow”, “alarm task”, “error message backtracking”, and “weekly quality report”.
Technical implementation of DashBoard
In terms of technical implementation, the back-end project is built based on Eggjs package Cross, which integrates Kafka and Studs, etc. Kafka is used to communicate with other modules. AntDesign Pro(UMIJS/DVA/BizCharts) was used in the front end, but TS was not used this time considering its changeable functions and strong business.
Controller is the Controller
Modular diagram
The biggest function of Dashboard mentioned above is to consume, modify and manage issue data, while the controller abstracts issue data from the original data to consume Dashboard. Of course, the controller has more functions than that. We will expand it in the following, and first post its relationship diagram with other modules in the whole architecture, so that you can have a general understanding of its role in the whole architecture:
Function of controller
Controller plays a crucial role in the entire architecture, mainly reflected in the following aspects:
- Find bugs online, generate new issues, classify and send warnings
- Warning of a bug occurring too many times (e.g. 100 times /1m) in a certain period of time
- An issue that has been marked as resolved has occurred again and needs to be notified to the person in charge of the issue
- Bug reports are generated so that development leaders have a more quantitative understanding of end quality
- The controller, however, does not perform the task. It is only responsible for sending messages via Kafka to the inspector(s), the task executor mentioned below
Technical implementation of controller
The task controller is a pure back-end project that still communicates with other modules via Kafka and is written in TypeScript because of its explicit functionality. The back-end framework is still our team’s own eggJs-based Cross. The alarm task queue is realized by Redis.
Task executor Instpector
The role of the actuator
As the younger brother of the task controller, what the task executor Instpector does is very pure, which is to parse the task sent from the controller, and then query the original and finally return the result. However, considering that there may be many tasks, we designed it as multi-point. Take advantage of Kafka’s message unicast features to execute tasks as quickly as possible.
Implementation of the actuator
The implementation executor is also developed with TS and the framework is Eggjs, which is relatively simple.
Soha’s observation after surveillance
The construction of this system is not accomplished overnight. In the past two years, we have used many existing products in the community and developed imperfect solutions by ourselves, from which we have accumulated a lot of experience. Without these deposits, we would not have been able to build the current system.
The whole system design rather than the implementation of the trouble is that, before you start work, we can consider to be clear about their application scenarios, What is expected to solve the problem, also is figuring out What and according to, after considering whether to invest manpower to carry on the design and implementation, after all, the whole system construction, still need some technical strength and a lot of manpower.
So far, the system has not been optimal:
- For example, both Dashboards and Controllers do something similar with MySQL and ES, which can be abstracted into two specific services
- For example, we don’t have alarms for long page loading times yet
- For example, we have not yet managed to aggregate all the problem types of solutions into a “bug resolution encyclopedia, “or exception Wiki
- For example, we have not bound the issue of the platform to our code repository, and then solved the issue to realize automatic closure, that is, the linkage between the exception and the code branch of the repository, so naturally we cannot make a more valuable evaluation of the code quality of engineers
But even so, launched this three months, we can see the product on the behavior of the data and the abnormal situation, or fall more awareness to the promotion of products experience, the problem of tracking repair also more efficient, user negative feedback for the product are also reduced day by day, I think, this is a good start.
Two more DashBoard screenshots:
About front-end monitoring, the fifth front-end early chat conference has done in-depth sharing, we can go to learn, the article in nuggets have been published.