The author:cjinhuo, shall not be reproduced without authorization.
background
Following the previous series of front-end monitoring platform series: JS SDK (open source), the main purpose of this article is to talk about the function design and implementation of the server
Technology stack
nestjs
Nestjs has good TS support, rich decorators, and an out-of-the-box dependency injection container
redis
redis.hash
Due to error reporting is a frequent operation, if the query the database every time is a waste of resources, so used to store apikey contact between project need some commonly used data, such as a hash of all existence is to separately with another project area, equivalent to the role of a namespace (hash of the key can’t set the expiration time)
redis.string
Use to store user ID information, project ID information and some frequently used data, but remember to set the expiration time, otherwise some long-term unused data will remain in Redis
redis.list
It is used to implement functions similar to RabbitMQ, which is used for batch computing and warehousing, as described below
redis.bitmap
A bitmap is used to count the number of a label, which is fast, supports high concurrency, and takes very little memory. Because it uses binary to store the value of each corresponding label. First, learn about bit, byte, and word.
Bit&Byte&Word
Bit = Binary digIT = 0 or 1
Byte = a sequence of 8 bits = 00000000, 00000001, … , or 11111111
Word = a sequence of N bits where N = 16, 32, 64 depending on the computer
Scenario 1: Let’s say I need to count all the check-ins for the entire company (100,000 employees) for 365 days a year:
- Use the database to store, create a table containing the time, employee ID, and each time someone checks in, directly in the corresponding table
insert
.disadvantagesIf a large number of people check in at the same time, the database connection pool will explode and cannot be updated in real time.advantages: Has detailed data, can query a certain time in a certain day check-in - Use the date of each day
bitmap
thekey
.value
isuserId
, auserId
A binary bit in memory, such as a continuous incrementuserId
100,000 people check-in accounts for (12500B ≈ 12K) in memoryhash
Table, equivalent to useword
At least 16 times the size of the memory), to determine whether a person is checked in only needs to determine the currentuserId
Whether the binary bit is occupied.disadvantages: You need to build an extra map in memory to correspond to detailed data, such as the specific time of user check-in on a certain day.advantages: Running memory speed, support high concurrency, real-time query, insert, change
Scene 2: Take the error tags in this system as an example. One project will have multiple errors, and one error will have multiple events (the relationship between Events and errors). If you want to store the tag set of each error, it will be used for tag visualization reports or search later. An error is made by which browser edition of the newspaper, IP set, custom tag set, etc., there will be at least 10 more tags, implemented with bitmap:
Use a bitmap to store each incorrect set of labels: For example, an error may have many browser versions and many IP addresses. As a result, an error may have 12 to 15 bitmaps, and a bitmap may have a set of labels. For example, there may be hundreds of thousands of IP addresses, which take up only a few KB of memory in the bitmap and can be deduplicated. Therefore, it is necessary to build another table to store only the value corresponding to bitmap and map the IP value. It is not easy to maintain, so mysql is chosen as the storage method
Conclusion:
-
If you need to do a lot of operations on detailed data, it is recommended to put it into the database
-
If real-time statistics are required and a large amount of data is required, you can use bitmap to quickly find, reweight, and delete data
Reference links:
- Comics: What is Bitmap?
- Storage scheme for user labels
mysql
Store result data for users, teams, projects, errors, error-level tags, and project-level tags
SLS instead of ES
For some reason, Elasticsearch is not used instead of Alicloud’s log service (SLS). However, SLS does not support updating of inserted data, so some tables, such as the tag collection table, are still used in mysql
The function point
An overview of the process
Table structure design
As shown in the figure above, there are 10 tables plus 1 SLS (logging service). Here’s how each table functions and relationships
events
It is used to store user behavior, error stack, label information, and error information. After each error is reported, it is eventually stored in events. The reason for this is that SLS takes hundreds of milliseconds to search multiple conditions for TB data.
The errors in table
It is used to store the error status, the number of users affected by the error, the number of error events, the error level, the project ID of the error, the developer of the error, and so on
The relationship between event and Error
One error corresponds to multiple events. For example, there are two computers accessing the same page at the same time, and an interface error is reported on this page, such as an internal exception of server 500. Mito-sdk will generate errorId through hashCode based on interface address, error type: HTTP, status code, and request method. If these parameters are the same, Then the errorId generated by hashCode should also be the same, so this error will be pushed to mysql, and these two events will be pushed to SLS respectively. Although they are the same error, the labels of these two events are different: IP, browser version, etc
With project_tag error_tag table
After understanding the relationship between event and error, the ERROR_TAG table is used to collect the tag types and quantities of all events below the error level and to count the tag sets under an error. The Project_Tag table is used to collect the tag types and quantities of all events under all the errors under the item level. Drop – down data display for multiple searches
Project table
Used to hold the project name, apikey (the link between SDK and project)
User_project table
The connection table used to associate the user table with the project table. It holds the user ID and the project ID
Team table
Store the team name and team notification mode
User_team table
The user id and team ID are stored in the team table
Sourcemap table
Packaged.js files require.map to restore the production environment code to the addresses of.js and.map files
Collect table
The user ID and error ID are stored, indicating the errors saved by the user
Error collecting
SDK error on the client is responsible for collecting and reporting to the server, the server must provide an interface used to hold error messages, because the SDK is stored in the client, so the concurrency value will be high, if you take a single error come after, the results calculated directly affect users and some data and warehouse directly, so lead to the collapse of database connection pool, The pressure on the server will also gradually collapse due to the increase in concurrency.
Error collecting
The error collection part is handed to mito-SDK to complete, through the SDK to collect client error information, configure DSN, and then report to the specified server
Bulk storage
The cache to redis
In order to relieve the pressure on the server during the peak period, the information required by all events will be extracted and put into the redis.list before all events are thrown, and then a scheduled task will be set to consume the redis.list slowly
Get error tags
- User’s real IP: added to the request header using the Nginx reverse proxy
x-real-ip
Field, nginx configuration:proxy_set_header x-real-ip $remote_addr;
, according to theip
Obtain Obtain the geographic location and carrier - According to the
user-agent
Fields can get browser version, system version, device, and so on - According to the data reported by the SDK, you can get the current error type, SDK version, custom tag, trackerId (user unique identifier), traceId (request interface unique identifier), and so on
Collect error tags
There are so many tags, so why collect them? There are several benefits:
- Better search: Multiple searches can be used to filter out some errors more accurately, as shown in the figure below:
-
See error details for more label information
Error label information -
Better collection of statistical labels
Batch update repository
For example, a 3-minute scheduled task obtains the first 100 pieces of data from the redis.list. During the server processing time, errors may be pushed up again, so the calculation needs to be done again at the end.
Functions of cache and scheduled tasks: Reduce the connection pool bursting due to frequent reads and writes to the database during peak periods. The data processed in batches of scheduled tasks can be appropriately increased. For example, 10000 pieces of data can be processed in a 3-minute scheduled task, which is more than enough.
Alarm rules and implementations
When the errors are collected, the developer is notified to resolve the errors in a timely manner, either by updating the error status or by letting the developer know that an error exists, but not frequently, so a set of alarm rules needs to be established, which can be defined for different projects.
The alarm rules
For example, HTTP_ERROR. The first error is a P4 level, and no notification will be made at this time. As the number of events and the number of affected users increase, the number of events will be continuously upgraded. When the number reaches 60, it will be upgraded to level 1 and the corresponding responsible person will be notified again. When the number reaches 500, it will be upgraded to level p1. At this time, the number of events is large enough and we can assume that this error must be solved (or choose to ignore). Of course, this can be customized, and each project needs to set a different hierarchy based on the size of users
Different projects correspond to different levels
If the number of daily active users (number of daily active users) of project A is 50, and the number of daily active users (number of daily active users) of project B is 10000, then you need to adjust the FetchRule:
There are five types of error states
- Not solved
When an error is reported, the default status is unresolved. Could be changed to: in process, neglected
- Solving status
If you see this error on monitoring platform, and think that it needs to be solved, but don’t want to be the same error has been informed, then change the error status to solve, so you will have two hours of time to correct the, in the two hours, the error is come forward again, will not notify the head, The server will automatically change the status to resolved after two hours. If it is triggered again two hours later, the state will be switched back on and the corresponding developer will be notified according to the normal alarm rules. The current status cannot be changed
- Ignored state
If you’ve reached level P1 but you don’t want to fix it or need to fix it at all, you can ignore it
If the error is ignored, the corresponding person will not be notified if the error is raised to level P1. The current status can be changed to: Resolving
- To open the
If an error is changed to “Resolved” and another error is reported, the status changes to “reopened”, and the subsequent alarm notification continues to follow the normal procedure. The current status can be changed to: Resolving or ignored
implementation
Change the status to resolved, no notification will be sent within two hours, and automatically change the status to resolved after two hours? How to achieve no notification within 2 hours, and automatically change the status to resolved after 2 hours:
Set the timeout period to 2 hours in redis, and add a setTimeout to the callback function. After 2 hours, you can process some logic, but there is a problem: if the server reissues the version, the setTimeout will disappear. Redis.hash:
With redis.hash’s persistent cache, where hash. Key is the error ID and hash. Value is the expiration timestamp, you can re-read redis.hash and restart the setTimeout after release. During the two hours, if an error is reported, the alarm is not notified if the status is resolved
Report manually. – Emergency notification
Mito-sdk can support manual data reporting in the following scenarios: Trycatch (trycatch) : trycatch (trycatch) : trycatch (trycatch) : trycatch (trycatch) : trycatch (trycatch) : trycatch (trycatch) : trycatch (trycatch) : trycatch (trycatch) : trycatch (trycatch) This error will be notified every time it is passed in
Multiple tag search
To create a tag data drop down box, you need to use project_Tag table, which collects all tags at the item level. This table collects all tags in batches as described above (batch update to the database), and the corresponding tag data is different each time you switch items. After the tag is obtained, it needs to search in SLS to retrieve the errorId array, and then filter, sort and page the current errorId array by querying the Errors table, and finally return it to the front end
Error details (front end)
Here is the component front end showing some of the error details:
User behavior stack
The user behavior stack is the information collected by mito-SDK. You can configure the number of stacks. The main function is to view the context of an error, such as an interface error, and to see what the user did before the trigger.
Sourcemap reduction
Use source-map to restore the packaged JS files to.vue or.js files in the development environment.
For an example, see noError: restore js with the sourcemap package
Other tags
At the end
Open source monitoring -SDK: React, applet, more hooks will be supported
It may be released latersaas
Free use of services
Look forward to the next article: Page performance monitoring practices
Click to follow, don’t get lost!! Every week will translate or original high-quality articles!!