directory

  • preface
  • Demand background
    • What problem to solve
    • General industry solution
    • customized
  • System architecture and integration
    • Basic composition
    • System association and fusion
    • Efficient operations
    • summary
  • Data collection and analysis
    • The data collection
    • Data entry
    • The data analysis
  • Problem finding and solving
    • Automated integration testing
    • Data aggregation
    • The database
  • conclusion

preface

In the previous article “Front-end Monitoring SDK Development Share”, the implementation of the client SDK was shared. This article will be divided into four sections (requirement background, system architecture and integration, data collection and analysis, problem discovery and solution) to share and introduce how we build front-end monitoring system.

First, demand background

1.1 What problems are solved

Clients often encounter the following problems:

  • hang
  • No response
  • caton
  • Service exceptions
  • The bug cannot be reproduced
  • , etc.

In the face of these problems running on the client side, the front end is often very helpless, before solving these problems, we need to know what is happening on the client side, so we can think of:

  • Collect errors and resolve errors and compatibility problems
  • Collect performance data and solve problems such as slow query and loading
  • Collect interfaces, discover interface errors, and enable server monitoring
  • Collect various auxiliary information and analyze comprehensively

In order to realize the collection function, we need to provide a front-end monitoring platform, which can collect data, process data, store data, query data. There are plenty of platforms or open source projects out there that we can use directly.

1.2 General industry solutions

Front-end technology development so far, I believe that we have been very familiar with the front-end monitoring of this matter, more or less will be used in our projects. For example, build front-end monitoring services using open source project Sentry, payment platform Ali’s ARMS, and even small programs.

(1). Sentry The main function of Sentry is to collect errors. Most popular languages are supported by clients and servers, but small programs are not supported. However, according to the data structure reported by Sentry, some large companies have realized small program SDK and open source by themselves. At present, attention and popularity are low. Errors aside, its other types of front-end monitoring capabilities are relatively weak.

(2). Ali ARMS ARMS provides a complete range of functions and supported clients, as well as small programs. You just have to pay for it. Generally speaking, the function provided is still more comprehensive, in line with the domestic environment.

(3). Small program with its own monitoring micro channel small program constantly improve the internal monitoring, all aspects of the function is slowly enriched, but can only support the small program itself.

When using these open source or platform front-end monitoring services, there are always some shortcomings. Such as:

  • System is fragmented
  • It is difficult to meet the need to add some custom data and queries
  • Features have not been updated,BUGLong solution cycle
  • Secondary development is difficult

1.3 customized

We used Sentry in the early days, and as the company grew in many ways, we found that Sentry didn’t meet several requirements.

Building a front-end monitoring system from zero to one is also expensive. Even in the early days, no one may want to use it, and there is a question of whether the system will be approved or sustainable. So from some open source projects to find a convenient transformation also has a certain functional module of the project. At the end of 2019, we found zanePerfor, because it has a lot of features. Using Node + Mongo fits our front-end technology stack, and it also supports wechat applets.

After that, we started doing long-term modifications and iterations based on its code. Slowly transformed into a more suitable for the company’s internal environment of a front-end monitoring system. In the next section, we will talk about the basic structure of the current system, the association and integration between the internal systems of the enterprise, and how the operation and maintenance services can escort.

Ii. System architecture and integration

2.1 Basic Composition

  • The client SDK

    • web
    • Small program
    • ios
    • andriod
  • The server node + EggJs

  • Redis Mongo+mongooseJs(ORM)

  • Vue + ElementUI management

In order to realize front-end monitoring, the first element is to collect client data. In order to facilitate the integration of the monitoring system by the client, we need to develop and package a unified SDK to help us collect data. In the early stage, we gave priority to supporting Web and wechat small programs on the client side. With the iteration of the system, native is also supported now.

The SDK collects the data, which we also need to receive through the server interface. On the server side, node+EggJs is used. Node is suitable for I/O intensive scenarios and fits the front-end technology stack. Eggjs is easy to use and document-friendly, and most front-end programmers using Node should be able to pick it up quickly.

After the server has collected the data and done some processing, we need to store it in our database. In terms of database, mongo is used for persistent storage, Mongo document model database, data expansion is convenient, jSON-like structure is convenient and node is used together, which is naturally suitable for log system. Using Redis to do data caching, Redis easy-to-use high-performance key-value database, the mainstream of the market, is known to most people.

Finally, a management desk is needed to do data query and management. The admin desk uses Vue+ElementUI, simple and fast.

The following figure is the current technical diagram of the system:

The client SDK collects data and reports it. After the Node server obtains the data, it first stores it in Redis. Node service will pull the Redis data for processing and analysis according to the consumption capacity and then store it in Mongo.

After the initial implementation of our front-end monitoring, we also connected to the company’s existing quality system, which enriched functions, improved ease of use and reduced workload.

2.2 System Integration and Association

  • SSO system
  • Add internal navigation yellow pages
  • Local logging system Finder ElasticSearch
  • The APM system skywalking
  • The alarm platform
  • Operation Log Platform
  • SPA platform

(1) Access the internal SSO system

In the enterprise, we have a single sign-on system for SSO. The system registers an account for each employee, who can log in through account password, enterprise wechat scan code or wechat scan code. Unified login not only solves the problem of account password and login mode for logging in to different systems, but also facilitates the mutual jump between systems and interface requests.

(2) Add internal navigation yellow pages

Join our projects in the internal navigation page for easy access.

(3) Local log system

The front-end monitors the node back-end service of the system and generates local logs of the service. The operations service will collect the logs for us by storing them in the convention directory and provides two search systems: Finder and ElasticSearch. Finder is divided by time and folder structure, and it looks like a local log directly from the server, which is what we’re used to most of the time. Elasticsearch is suitable for searching.

(4) Skywalking of APM system

The node back-end service of the front-end monitoring system should be monitored by the back-end monitoring system in addition to the run logs generated locally by the system. APM application performance management aims to collect data through various probes, collect key indicators, and present data to achieve a systematic solution for application performance management and fault management. We use SkyWalking internally, and most of our services are predominantly Java. Skywalking provides node probes that our projects can tap into. Once connected, the performance and invocation of back-end Node services can be queried through the Skywalking console.

TraceId, the APM tool generates a traceId on the server that identifies the context ID of a call, from which you can query the footprint chain of what was done. The back-end service can return the traceId of this invocation to the client through the response header through the front-end HTTP request. The front-end SDK probe of the front-end monitoring can collect traceId. After collecting traceId, the front-end monitoring can get through the back-end monitoring. In the front-end monitoring management background, you can not only view front-end monitoring network logs, but also query back-end link information through traceId.

(5) Alarm platform

The front-end monitoring system needs to push alarm emails in real time or periodically. The internal alarm platform provides alarm policy configuration and pulls front-end monitoring data. Connecting to the alarm platform can reduce the workload of the front-end monitoring system.

(6) Operation log platform

At the gateway layer, the operation logging platform can intercept our management console operation requests to record user system operations. Help us do sensitive operation tracing and alarm etc.

(7) SPA platform

SPA platform is a static resource publishing platform developed by the company, through which business projects can be semi-automated management, configuration injection, static resource management, etc. When front-end monitoring collects errors from compressed code, the sourceMap file is parsed and converted to source code. Most front-end monitoring schemes require manual uploading of sourceMap files to the monitoring system. After using SPA platform, resources are managed uniformly. We can directly store the sourceMap files in the agreed location through internal configuration, which avoids manual uploading by the business side and improves the ease of use.

When the functions of our system have been realized and the associated systems have been successfully integrated, the program needs our operation and maintenance services to ensure stable and friendly online operation.

2.3 Efficient Operation and Maintenance

  • Log grab
  • Automated build
  • The container is changed
  • Load balancing
  • Health detection, safety shutdown
  • , etc.

Health detection, safety shutdown

Front-end monitoring will receive the data reported by the business side at any time. When the system is restarted, it must ensure that the service is not interrupted, and it must ensure that the data is not lost when the system is shut down. When we update the code and rebuild, we will start the new container in advance. When the new container is started, we will gradually shut down and replace the old service. The old service will also receive notification before shutting down and stop receiving new processing tasks, and will be shut down again when all the ongoing tasks are finished.

A complete online system is inseparable from operation and maintenance services. It does a lot of things that we don’t even know at ordinary times. We developers should pay attention to and understand it.

2.4 summary

Internal independent systems, each responsible for a certain task, interrelated systems, operation and maintenance services escort, build the internal ecological environment of the enterprise, reduce a lot of repetitive work, but also can make a certain field of the system to do more in-depth and perfect.

So far, we have introduced the requirement background and system composition of the front-end monitoring system. In the next section, we will explain our core data collection and analysis in a little detail.

Iii. Data collection and analysis

3.1 Data Collection

(1) Performance Collect Native hot and cold startup, Web page loading, static resources, Ajax interface and other performance information, including loading time, HTTP protocol version, response body size, etc., to provide data support for improving overall service quality and solve slow query problems.

(2) Error collection Native and JS errors, static resource loading errors, Ajax interface loading errors, these general error collection are easy to understand. The following describes business interface errors:

When a client sends an Ajax request to the back-end business interface, the interface returns a JSON data structure, which generally contains two fields: ErrorCode and Message. Errorcode is the status code defined internally by the service interface. Normal business responses have internal conventions such as ErrorCode ==0. If errorCode is not 0, it may be an exception or foreseeable exception, and such error data needs to be collected.

Because different teams or interfaces may have different conventions, we will only provide a default method, which will be called after the Ajax request response. The business side will write the judgment logic control in the default method according to the JSON data of the convention and response. Something like this:

errcodeReport(res) { if (Object.prototype.toString.call(res) === '[object Object]' && res.hasOwnProperty('errcode') && res.errcode ! == 0) { return { isReport: true, errMsg: res.errmsg,code: res.errcode }; } return { isReport: false }; }Copy the code

(3) Auxiliary information In addition to the above two types of hard indicator data, we also need a lot of other information, such as: user access track, user click behavior, user ID, device version, device model, UV/UA identifier, traceId and so on. Most of the time, the problems we need to solve are not so simple and directly can be checked out, and even we need front-end monitoring and other systems to be associated in some cases, so these soft indicator information is also very important.

After the data is collected, it cannot be directly entered into the database, but must go through certain processing. Let’s talk about the two processes of data entry.

3.2 Data Entry

(1) THE SDK filters the business interfaces and resources of some three parties, and even they often report errors if there is a certain amount of requests. Or there will always be some internal interface that we don’t want to collect, and this data collection will pollute our view of the data and affect our view of the management console data. The client SDK provides filtering configuration. The client can filter some interfaces that do not need to be collected based on specific services. Or built-in interface paths filtered by enterprise conventions.

(2) When the server processes the data reported by the SDK, the server also needs to process the data twice before saving it, such as splitting the original data for the convenience of query.

3.3 Data Analysis

When our data has been entered into the database, we can analyze and query our data. In addition to the basic data query function of the management desk, we also provide charts and daily reports to help us meet different scenarios.

3.3.1 daily.

Daily newspapers can screen out key information and analyzed content from existing data. Although the platform is there, people may not pay much attention to it sometimes. The target audience for daily reports is currently developers and testers.

3.3.2 rainfall distribution on 10-12 chart

Charts are just a form of data presentation, such as real-time page loading charts, real-time error charts. In most cases, we probably don’t follow it all the time, but when we launch a new version, we can keep it open for changes. Charts are a complement to the data perspective.

After the first three sections, the front-end monitoring system functions are roughly introduced. In the fourth section, we mainly share what problems we have encountered in the process of building front-end monitoring system and how we solve them.

4. Problem discovery and solution

4.1 Automated integration test

Js-sdk is a separate library that requires long-term maintenance and updates and is used in many business projects. As the amount of code and features increases and manual testing becomes more and more expensive, the development process is accompanied by a strong sense of insecurity, and test coverage is terrifically low. From scratch, we began to refine automated integration testing.

Our JS-SDK mainly adds probes to listen to the running status of business items and collect information, and integration testing is the focus of our attention. Our Web SDK runs in a Web browser environment, not in a Node environment. Right now we have two kinds of testing.

Terminal testing helps us support a continuous integration environment (hosted testing in an environment provided by the hosting platform after the code is submitted to the repository). Through browser testing, you can make your code run in the most realistic environment, and you can also do browser compatibility testing.

4.2 Data Aggregation

From the perspective of management desk, we need to aggregate and categorize the data so that we can view our monitoring data more clearly. But there are several reasons for aggregating the data.

(1) Dynamic routing

Some interface using dynamic routing set parameters, such as xxx.com/api/getuid/15501/detail. This type of interface will cause the same interface cannot be aggregated based on the URL, so the SDK defaults to replacing the dynamic parameter part of this type of interface with *. The link above becomes xxx.com/api/getuid/*/detail.

This allows us to classify them as an interface on the server side. It is currently all digits of the re match, and the service parameter parts that happen to have such interfaces inside are all digits. If there are non-pure numbers in the dynamic parameter part, the parameter cannot be identified. Only THE SDK provides the configuration list. The business side can configure the relevant links one by one and let the SDK replace them automatically. But this is not business-friendly or maintainable. So when that happens, the best advice is to use? Or the way the message body takes parameters!

(2) The interface error information is not fixed

After collecting ajax service error response information, front-end monitoring aggregates the same error in the form of interface path + error information. It was found that an interface responded to an error message with a random value, which might be the unique ID of the configuration currently requested. Roughly as follows:

{errMsg: 'an exception occurred (d1nbj1AZ5) ',// random values in brackets errcode: 1}Copy the code

The solution is to make an agreement with the server that if you need to return an extra set of fields instead of putting them in an error message.

4.3 database

This table design

When iterating native versions, there are several options to consider for table structure design. For the logging system, the final decision was to use a de-normalized design to increase throughput by sacrificing space.

4.3.2 Query Optimization

(1) Index

As the amount of data becomes larger and larger, the query becomes slower and slower. We have adjusted the index to build the index in the time field, and the client query criteria have a reasonable time range defined by default to optimize the query speed. And use composite indexes appropriately.

(2) Slow query

In the early days, some slow queries, such as mapReduce statements used for minute-level scheduled tasks, slowed down database performance. We solved these slow queries through optimization and inefficient statement replacement. The average CPU usage of the database in 24 hours was 78% before online optimization and 27% after optimization. Below is a screenshot of the optimized test environment, with the peak dropping significantly.

4.3.3 Exclusive Database

Initially, the front-end monitoring system and other business systems share the cloud database. In a certain period of time, the front-end monitoring has the problem of slow database query, which happens to be a scheduled task at the minute level. As a result, the shared database has many slow query records, and the CPU consumption is maintained at a high level. Front-end monitoring itself will also maintain a certain amount of concurrent monitoring data in storage. Front-end monitoring drags down the entire database, and due to the large number of slow query logs, it is difficult for other services to find their own logs when checking database slow logs. Later, the operation and maintenance team configured a proprietary database for front-end monitoring.

Before the migration, configure 12 core 32 GB three nodes (one active node and two slave nodes). After the migration, configure 2 core 4 GB three nodes. Although the configuration is smaller, the average response time is faster after the front-end monitoring has exclusive database. The data monitored by the front-end is log data. As a result, the database performance deteriorates and service data services are affected. The separation of database and service in the logging system is also reasonable.

Of course, there are many problems and details in the iterative process of the system, which will not be listed here.

conclusion

At present, the front-end monitoring system has been running for more than a year, serving dozens of applications of the company, it still has a lot of deficiencies, it has also kept planning and iteration. Put it on the record. Keep going.

End

Follow the official account of the great poet, the first time to get the latest articles.