The author:cjinhuo, shall not be reproduced without authorization.

background

Following the previous series of front-end monitoring platform series: JS SDK (open source), the main purpose of this article is to talk about the function design and implementation of the server

Technology stack

nestjs

Nestjs has good TS support, rich decorators, and an out-of-the-box dependency injection container

redis

redis.hash

Due to error reporting is a frequent operation, if the query the database every time is a waste of resources, so used to store apikey contact between project need some commonly used data, such as a hash of all existence is to separately with another project area, equivalent to the role of a namespace (hash of the key can’t set the expiration time)

redis.string

Use to store user ID information, project ID information and some frequently used data, but remember to set the expiration time, otherwise some long-term unused data will remain in Redis

redis.list

It is used to implement functions similar to RabbitMQ, which is used for batch computing and warehousing, as described below

redis.bitmap

A bitmap is used to count the number of a label, which is fast, supports high concurrency, and takes very little memory. Because it uses binary to store the value of each corresponding label. First, learn about bit, byte, and word.

Bit&Byte&Word

Bit = Binary digIT = 0 or 1

Byte = a sequence of 8 bits = 00000000, 00000001, … , or 11111111

Word = a sequence of N bits where N = 16, 32, 64 depending on the computer

Scenario 1: Let’s say I need to count all the check-ins for the entire company (100,000 employees) for 365 days a year:

  1. Use the database to store, create a table containing the time, employee ID, and each time someone checks in, directly in the corresponding tableinsert.disadvantagesIf a large number of people check in at the same time, the database connection pool will explode and cannot be updated in real time.advantages: Has detailed data, can query a certain time in a certain day check-in
  2. Use the date of each daybitmapthekey.valueisuserId, auserIdA binary bit in memory, such as a continuous incrementuserId100,000 people check-in accounts for (12500B ≈ 12K) in memoryhashTable, equivalent to usewordAt least 16 times the size of the memory), to determine whether a person is checked in only needs to determine the currentuserIdWhether the binary bit is occupied.disadvantages: You need to build an extra map in memory to correspond to detailed data, such as the specific time of user check-in on a certain day.advantages: Running memory speed, support high concurrency, real-time query, insert, change

Employee check-in – Bitmap

Scene 2: Take the error tags in this system as an example. One project will have multiple errors, and one error will have multiple events (the relationship between Events and errors). If you want to store the tag set of each error, it will be used for tag visualization reports or search later. An error is made by which browser edition of the newspaper, IP set, custom tag set, etc., there will be at least 10 more tags, implemented with bitmap:

Use a bitmap to store each incorrect set of labels: For example, an error may have many browser versions and many IP addresses. As a result, an error may have 12 to 15 bitmaps, and a bitmap may have a set of labels. For example, there may be hundreds of thousands of IP addresses, which take up only a few KB of memory in the bitmap and can be deduplicated. Therefore, it is necessary to build another table to store only the value corresponding to bitmap and map the IP value. It is not easy to maintain, so mysql is chosen as the storage method

Conclusion:

  1. If you need to do a lot of operations on detailed data, it is recommended to put it into the database

  2. If real-time statistics are required and a large amount of data is required, you can use bitmap to quickly find, reweight, and delete data

Reference links:

  • Comics: What is Bitmap?
  • Storage scheme for user labels

mysql

Store result data for users, teams, projects, errors, error-level tags, and project-level tags

SLS instead of ES

For some reason, Elasticsearch is not used instead of Alicloud’s log service (SLS). However, SLS does not support updating of inserted data, so some tables, such as the tag collection table, are still used in mysql

The function point

An overview of the process

An overview of the process

Table structure design

Overview of table structure

As shown in the figure above, there are 10 tables plus 1 SLS (logging service). Here’s how each table functions and relationships

events

It is used to store user behavior, error stack, label information, and error information. After each error is reported, it is eventually stored in events. The reason for this is that SLS takes hundreds of milliseconds to search multiple conditions for TB data.

The errors in table

It is used to store the error status, the number of users affected by the error, the number of error events, the error level, the project ID of the error, the developer of the error, and so on

The relationship between event and Error

One error corresponds to multiple events. For example, there are two computers accessing the same page at the same time, and an interface error is reported on this page, such as an internal exception of server 500. Mito-sdk will generate errorId through hashCode based on interface address, error type: HTTP, status code, and request method. If these parameters are the same, Then the errorId generated by hashCode should also be the same, so this error will be pushed to mysql, and these two events will be pushed to SLS respectively. Although they are the same error, the labels of these two events are different: IP, browser version, etc

With project_tag error_tag table

After understanding the relationship between event and error, the ERROR_TAG table is used to collect the tag types and quantities of all events below the error level and to count the tag sets under an error. The Project_Tag table is used to collect the tag types and quantities of all events under all the errors under the item level. Drop – down data display for multiple searches

Project table

Used to hold the project name, apikey (the link between SDK and project)

User_project table

The connection table used to associate the user table with the project table. It holds the user ID and the project ID

Team table

Store the team name and team notification mode

User_team table

The user id and team ID are stored in the team table

Sourcemap table

Packaged.js files require.map to restore the production environment code to the addresses of.js and.map files

Collect table

The user ID and error ID are stored, indicating the errors saved by the user

Error collecting

SDK error on the client is responsible for collecting and reporting to the server, the server must provide an interface used to hold error messages, because the SDK is stored in the client, so the concurrency value will be high, if you take a single error come after, the results calculated directly affect users and some data and warehouse directly, so lead to the collapse of database connection pool, The pressure on the server will also gradually collapse due to the increase in concurrency.

Error collecting

The error collection part is handed to mito-SDK to complete, through the SDK to collect client error information, configure DSN, and then report to the specified server

Bulk storage

The cache to redis

In order to relieve the pressure on the server during the peak period, the information required by all events will be extracted and put into the redis.list before all events are thrown, and then a scheduled task will be set to consume the redis.list slowly

sdkToEnd

Get error tags

  1. User’s real IP: added to the request header using the Nginx reverse proxyx-real-ipField, nginx configuration:proxy_set_header x-real-ip $remote_addr;, according to theipObtain Obtain the geographic location and carrier
  2. According to theuser-agentFields can get browser version, system version, device, and so on
  3. According to the data reported by the SDK, you can get the current error type, SDK version, custom tag, trackerId (user unique identifier), traceId (request interface unique identifier), and so on

Collect error tags

There are so many tags, so why collect them? There are several benefits:

  • Better search: Multiple searches can be used to filter out some errors more accurately, as shown in the figure below:

Multiple search
  • See error details for more label information

    Error label information
  • Better collection of statistical labels

Label set statistics

Batch update repository

Overview of batch storage

For example, a 3-minute scheduled task obtains the first 100 pieces of data from the redis.list. During the server processing time, errors may be pushed up again, so the calculation needs to be done again at the end.

Take out the data

Tailoring redis

Functions of cache and scheduled tasks: Reduce the connection pool bursting due to frequent reads and writes to the database during peak periods. The data processed in batches of scheduled tasks can be appropriately increased. For example, 10000 pieces of data can be processed in a 3-minute scheduled task, which is more than enough.

Alarm rules and implementations

When the errors are collected, the developer is notified to resolve the errors in a timely manner, either by updating the error status or by letting the developer know that an error exists, but not frequently, so a set of alarm rules needs to be established, which can be defined for different projects.

The alarm rules

HTTP Alarm Rules

For example, HTTP_ERROR. The first error is a P4 level, and no notification will be made at this time. As the number of events and the number of affected users increase, the number of events will be continuously upgraded. When the number reaches 60, it will be upgraded to level 1 and the corresponding responsible person will be notified again. When the number reaches 500, it will be upgraded to level p1. At this time, the number of events is large enough and we can assume that this error must be solved (or choose to ignore). Of course, this can be customized, and each project needs to set a different hierarchy based on the size of users

Different projects correspond to different levels

If the number of daily active users (number of daily active users) of project A is 50, and the number of daily active users (number of daily active users) of project B is 10000, then you need to adjust the FetchRule:

There are five types of error states

Error status
  • Not solved

When an error is reported, the default status is unresolved. Could be changed to: in process, neglected

  • Solving status

If you see this error on monitoring platform, and think that it needs to be solved, but don’t want to be the same error has been informed, then change the error status to solve, so you will have two hours of time to correct the, in the two hours, the error is come forward again, will not notify the head, The server will automatically change the status to resolved after two hours. If it is triggered again two hours later, the state will be switched back on and the corresponding developer will be notified according to the normal alarm rules. The current status cannot be changed

  • Ignored state

If you’ve reached level P1 but you don’t want to fix it or need to fix it at all, you can ignore it

The error status has been ignored

If the error is ignored, the corresponding person will not be notified if the error is raised to level P1. The current status can be changed to: Resolving

  • To open the

If an error is changed to “Resolved” and another error is reported, the status changes to “reopened”, and the subsequent alarm notification continues to follow the normal procedure. The current status can be changed to: Resolving or ignored

implementation

Change the status to resolved, no notification will be sent within two hours, and automatically change the status to resolved after two hours? How to achieve no notification within 2 hours, and automatically change the status to resolved after 2 hours:

Set the timeout period to 2 hours in redis, and add a setTimeout to the callback function. After 2 hours, you can process some logic, but there is a problem: if the server reissues the version, the setTimeout will disappear. Redis.hash:

Delay changing resolution status

With redis.hash’s persistent cache, where hash. Key is the error ID and hash. Value is the expiration timestamp, you can re-read redis.hash and restart the setTimeout after release. During the two hours, if an error is reported, the alarm is not notified if the status is resolved

Report manually. – Emergency notification

Mito-sdk can support manual data reporting in the following scenarios: Trycatch (trycatch) : trycatch (trycatch) : trycatch (trycatch) : trycatch (trycatch) : trycatch (trycatch) : trycatch (trycatch) : trycatch (trycatch) : trycatch (trycatch) : trycatch (trycatch) This error will be notified every time it is passed in

Mito. log reports error P1

Multiple tag search

Multi-label data display

To create a tag data drop down box, you need to use project_Tag table, which collects all tags at the item level. This table collects all tags in batches as described above (batch update to the database), and the corresponding tag data is different each time you switch items. After the tag is obtained, it needs to search in SLS to retrieve the errorId array, and then filter, sort and page the current errorId array by querying the Errors table, and finally return it to the front end

Multiple tag search process

Error details (front end)

Here is the component front end showing some of the error details:

User behavior stack

The user behavior stack is the information collected by mito-SDK. You can configure the number of stacks. The main function is to view the context of an error, such as an interface error, and to see what the user did before the trigger.

Sourcemap reduction

Use source-map to restore the packaged JS files to.vue or.js files in the development environment.

Sourcemap reduction before

Sourcemap reduction after

For an example, see noError: restore js with the sourcemap package

Other tags

At the end

Open source monitoring -SDK: React, applet, more hooks will be supported

It may be released latersaasFree use of services

Look forward to the next article: Page performance monitoring practices

Click to follow, don’t get lost!! Every week will translate or original high-quality articles!!