Before the introduction
This time by our team yu fei students to bring our own research error monitoring platform, welcome to see the official master correction and ridicule.
â I. Background
Pain points
One day product: XXX advertiser feedback our page registration can not! Another day of operation: this event is dead on XXX media!
In our company online operation is nearly 100 million level of advertising pages, so if the online naked running, what problems do not know, found in the business end, was asked by the business side, this scene is very embarrassing.
choose
The company has four business divisions, and each division has no less than three projects. There are at least 12 projects here. As foreshadowing, there are many business lines here.
We can choose to do it ourselves or we can choose a third party. We compare one item with several common third parties.
- Fundebug: Paid version from 159 yuan/month, data exists in a third party, and data self-preservation needs 300,000 yuan/year. It’s still expensive.
- FrontJS, FrontJS advanced 899/ month, professional edition is 2999/ month.
- Sentry, $80 / month.
For Sentry, calculate these 12 items. Nearly 100,000 a year for 12 projects. It was roughly estimated that it would take 2 people 1.5 months, which is 90 people days, to complete the MVP version. Based on the salary of 15,000 per person/month, the total cost would be 45,000, and it would be once and for all.
Therefore, from the perspective of cost, we will choose self-study, but there are other reasons besides cost. For example, we will do some custom functions based on this system, get through with the company permission user system, and then carry out Todo management for users, ranking users by mistake, etc.
And security based on business data, we want to build a system ourselves.
Therefore, from the perspective of cost, security and expansibility, we choose our own research and development.
Two, product design
What kind of product do we want, based on first principles, to solve the key problem of “how to locate the problem”? What information do we want to know by 5W1H?
The error message
In fact, error monitoring can be described as simple as one sentence, collecting page errors, reporting them, and then analyzing the symptoms.
Analyzing this sentence according to the 5W1H rule, we can find several points that need our attention.
- Logic error, data error, network error, syntax error, etc.
- When, the time period, such as a timestamp.
- Who, how many users are affected, including the number of error events, IP, device information.
- Where, which pages appear, including pages, advertising space (our company), media (our company).
- Error stack, queue, SourceMap
- How to solve the problem, we also need to collect system information.
Architectural layers
First of all, we need to sort out what features we need.
So how do we get the information up there and ultimately mislocate it.
First of all, we definitely need to collect errors, and then how do we know about errors on the user’s device page? That needs to be reported. So the first layer shows up, we need a collection site.
How can it be reported? You have worked with the backend for so long that you must know ð. You need an interface. You need a server to collect the reported errors and filter and aggregate the errors. So layer 2 knows, we need a collection aggregator.
We have collected enough material information, so how to use it next? We need to arrange it according to our rules. It will be very inefficient to organize queries by writing class SQL every time, so we need a visual platform to show them. Hence the third layer, the visual analysis side.
It feels like we’re done, as we all think, with a bug monitoring platform, ð . If so, you will notice a phenomenon, every time we go online and for a while after we go online, the developers will keep staring at the screen, what is this, human eye movement observer mode? Therefore, we need to solve the problem through the code, and naturally, the fourth layer, the monitoring alarm end came into being.
So please speak up what we need ð, collection on the report, collection aggregation, visual analysis, monitoring the alarm end.
â 3. System design
Like functions, define the input and output of each link, and the core needs to deal with the function.
Now let’s see how to implement the four ends mentioned above.
Collect the online report (SDK)
The main input is all errors, and the output is to capture and report errors. The core is to handle the collection of different types of errors. The rest is non-core but essential work.
Wrong type
â
Let’s take a look at what types of errors we need to handle.
Common JS execution errors
- SyntaxError
A syntax error occurred while parsing
// The console is running
const xx,
Copy the code
Window. onError does not catch SyntxError, and SyntaxError is usually found during construction or even local development.
- TypeError
The value is not of the expected type
// The console is running
const person = void 0
person.name
Copy the code
- ReferenceError
References to undeclared variables
// The console is running
nodefined
Copy the code
- RangeError
When a value is not in its permitted range or set
(function fn ( ) { fn() })()
Copy the code
Network error
- ResourceError
Resource loading error
new Image().src = '/remote/image/notdeinfed.png'
Copy the code
- HttpError
Http request error
// The console is running
fetch('/remote/notdefined', {})
Copy the code
Collect wrong
All causes are errors, so how do we catch errors?
try/catch
Regular runtime errors are caught, but syntax and asynchronous errors are not
// Normal runtime errors can be caught â
try {
console.log(notdefined);
} catch(e) {
console.log('Exception caught:', e);
}
// Syntax error, cannot catch â
try {
const notdefined,
} catch(e) {
console.log('Exception caught:', e);
}
// Async error, cannot catch â
try {
setTimeout(() = > {
console.log(notdefined);
}, 0)}catch(e) {
console.log('Exception caught:',e);
}
Copy the code
Try /catch has the advantage of careful processing, but it also has obvious disadvantages.
window.onerror
Pure JS error collection, window.onError. When an error occurs while js is running, window raises an error event to the ErrorEvent interface.
/ * * *@param {String} Message Error message *@param {String} Source Error file *@param {Number} Lineno line number *@param {Number} Colno column number *@param {Object} Error Error object */
window.onerror = function(message, source, lineno, colno, error) {
console.log('Exception caught:', {message, source, lineno, colno, error});
}
Copy the code
First verify that the next few errors can be caught.
// Normal runtime errors can be caught â
window.onerror = function(message, source, lineno, colno, error) {
console.log('Exception caught:',{message, source, lineno, colno, error});
}
console.log(notdefined);
// Syntax error, cannot catch â
window.onerror = function(message, source, lineno, colno, error) {
console.log('Exception caught:',{message, source, lineno, colno, error});
}
const notdefined,
// Async error, cannot catch â
window.onerror = function(message, source, lineno, colno, error) {
console.log('Exception caught:',{message, source, lineno, colno, error});
}
setTimeout(() = > {
console.log(notdefined);
}, 0)
// Resource error, cannot catch â
<script>
window.onerror = function(message, source, lineno, colno, error) {
console.log('Exception caught:',{message, source, lineno, colno, error});
return true;
}
</script>
<img src="https://yun.tuia.cn/image/kkk.png">
Copy the code
What if window. onError cannot catch resource errors?
window.addEventListener
When a resource (such as an image or script) fails to load, the element that loaded the resource triggers an Error Event on the Event interface. These error events do not bubble up to the window, but can be caught. While window.onError cannot detect capture.
// Image, script, CSS loading error, can be caught â
<script>
window.addEventListener('error'.(error) = > {
console.log('Exception caught:', error);
}, true)
</script>
<img src="https://yun.tuia.cn/image/kkk.png">
<script src="https://yun.tuia.cn/foundnull.js"></script>
<link href="https://yun.tuia.cn/foundnull.css" rel="stylesheet"/>// New Image error, cannot capture â<script>
window.addEventListener('error'.(error) = > {
console.log('Exception caught:', error);
}, true)
</script>
<script>
new Image().src = 'https://yun.tuia.cn/image/lll.png'
</script>// Fetch error, cannot catch â<script>
window.addEventListener('error'.(error) = > {
console.log('Exception caught:', error);
}, true)
</script>
<script>
fetch('https://tuia.cn/test')
</script>
Copy the code
The new Image is used less and can handle its own errors by itself.
But what about the generic fetch, where the fetch returns a Promise, but the Promise’s error cannot be caught?
Promise error
- Common Promise error
The try/catch cannot catch an error in a Promise
// Try /catch cannot handle the error of json.parse because it is in a Promise
try {
new Promise((resolve,reject) = > {
JSON.parse(' ') resolve(); })}catch(err) {
console.error('in try catch', err)
}
// The catch method is required
new Promise((resolve,reject) = > {
JSON.parse(' ')
resolve();
}).catch(err= > {
console.log('in catch fn', err)
})
Copy the code
- Async error
Try /catch cannot catch async package error
const getJSON = async() = > {throw new Error('inner error')}// This is handled by try/catch
const makeRequest = async() = > {try {
// Failed to capture
JSON.parse(getJSON());
} catch (err) {
console.log('outer', err); }};try {
/ / try/catch neither
makeRequest()
} catch(err) {
console.error('in try catch', err)
}
try {
// Need to await to catch
await makeRequest()
} catch(err) {
console.error('in try catch', err)
}
Copy the code
- Import the chunk error
Import actually returns a promise, so you can catch an error in two ways
// Promise catch method
import(/* webpackChunkName: "incentive" */'./index').then(module= > {
module.default()
}).catch((err) = > {
console.error('in catch fn', err)
})
// await method, try catch
try {
const module = await import(/* webpackChunkName: "incentive" */'./index');
module.default()
} catch(err) {
console.error('in try catch', err)
}
Copy the code
Summary: Globally catch errors in Promise
All three of these actually boil down to promise-type errors that can be caught with unhandledrejection
// Promise is handled globally
window.addEventListener("unhandledrejection".function(e){
console.log('Exception caught:', e);
});
fetch('https://tuia.cn/test')
Copy the code
To prevent missed Promise exceptions, unhandledrejection can be used to listen for Uncaught Promise errors globally.
Vue error
Since Vue catches all Vue single-file components or code inherited from vue. extend, errors in Vue are not caught directly by window. onError, but are thrown to vue.config. errorHandler.
/** * Catch Vue errors globally and throw them directly to onError processing */
Vue.config.errorHandler = function (err) {
setTimeout(() = > {
throw err
})
}
Copy the code
The React error
React declares an error bounding component with componentDidCatch
class ErrorBoundary extends React.Component {
constructor(props) {
super(props);
this.state = { hasError: false };
}
static getDerivedStateFromError(error) {
// Update state so that the next render can show the degraded UI
return { hasError: true };
}
componentDidCatch(error, errorInfo) {
// You can also report the error log to the server
logErrorToMyService(error, errorInfo);
}
render() {
if (this.state.hasError) {
// You can customize the demoted UI and render it
return <h1>Something went wrong.</h1>;
}
return this.props.children; }}class App extends React.Component {
render() {
return (
<ErrorBoundary>
<MyWidget />
</ErrorBoundary>)}}Copy the code
But error boundaries do not catch the following errors: React event handling, asynchronous code, or errors that error boundaries themselves throw.
Cross-domain problem
In general, if an error such as a Script error occurs, it is almost certain that a cross-domain problem has occurred.
If the current page and cloud JS are located in different domain names, if the cloud JS Error, window.onerror will appear Script Error. It can be solved in the following two ways.
- Configure access-control-allow-Origin on the back end and crossorigin on the front end script.
<script src="http://yun.tuia.cn/test.js" crossorigin></script>
const script = document.createElement('script');
script.crossOrigin = 'anonymous';
script.src = 'http://yun.tuia.cn/test.js';
document.body.appendChild(script);
Copy the code
- If you cannot modify the server’s request header, consider using a try/catch bypass to throw an error.
<! doctypehtml>
<html>
<head>
<title>Test page in http://test.com</title>
</head>
<body>
<script src="https://yun.dui88.com/tuia/cdn/remote/testerror.js"></script>
<script>
window.onerror = function (message, url, line, column, error) {
console.log(message, url, line, column, error);
}
try {
foo(); // Call the foo method defined in testeror.js
} catch (e) {
throw e;
}
</script>
</body>
</html>
Copy the code
If you don’t add a try catch, console.log will print a Script error. Try catch.
Let’s take a look at the scenario. In general, the remote JS is called. There are three common situations.
- Error in method calling remote JS
- There is a problem with the event inside the remote JS
- Either an error occurs in a callback such as setTimeout
â
Call method scenario
By encapsulating a function, you can decorate the original method so that it can be tried/caught.
<! doctypehtml>
<html>
<head>
<title>Test page in http://test.com</title>
</head>
<body>
<script src="https://yun.dui88.com/tuia/cdn/remote/testerror.js"></script>
<script>
window.onerror = function (message, url, line, column, error) {
console.log(message, url, line, column, error);
}
function wrapErrors(fn) {
// don't wrap function more than once
if(! fn.__wrapped__) { fn.__wrapped__ =function () {
try {
return fn.apply(this.arguments);
} catch (e) {
throw e; // re-throw the error}}; }return fn.__wrapped__;
}
wrapErrors(foo)()
</script>
</body>
</html>
Copy the code
You can try to get rid of the wrapErrors.
Event scenarios
Native methods can be hijacked.
<! doctypehtml>
<html>
<head>
<title>Test page in http://test.com</title>
</head>
<body>
<script>
const originAddEventListener = EventTarget.prototype.addEventListener;
EventTarget.prototype.addEventListener = function (type, listener, options) {
const wrappedListener = function (. args) {
try {
return listener.apply(this, args);
}
catch (err) {
throwerr; }}return originAddEventListener.call(this, type, wrappedListener, options);
}
</script>
<div style="height: 9999px;">http://test.com</div>
<script src="https://yun.dui88.com/tuia/cdn/remote/error_scroll.js"></script>
<script>
window.onerror = function (message, url, line, column, error) {
console.log(message, url, line, column, error);
}
</script>
</body>
</html>
Copy the code
You can try to get rid of encapsulation EventTarget. Prototype. The code of addEventListener, feeling.
Report to the interface
Why can’t GET/POST/HEAD request interface be used to report directly?
It’s easy to think of why. Generally speaking, the dot domain name is not the current domain name, so all interface requests constitute cross-domain.
Why can’t I request another file resource (JS/CSS/TTF)?
After the resource node is created, the browser will not actually send the resource request until the object is injected into the browser DOM tree. Loading JS/CSS resources can also block page rendering and affect user experience.
Not only do you not need to insert the DOM to construct the Image dot, you can initiate the request as long as the Image object is new in JS, and there is no blocking problem. In the browser environment without JS, you can also use the IMG tag to dot normally.
The new Image is used to report the interface. As for the last problem, they are all pictures, and 1×1 transparent GIF files are used for reporting instead of other PNG/JEPG/BMP files.
First of all, 1×1 pixels is the smallest legal image. And, because it is through the picture dot, so the picture is best transparent, so it will not affect the page itself display effect, the two show that the picture is transparent as long as the use of a binary bit mark image is transparent color, do not store color space data, can save volume. Because transparent color is required, JEPG can be ruled out directly.
For the same response, GIF can save 41% traffic compared to BMP and 35% traffic compared to PNG. GIF is the best choice.
- You can cross domains
- No cookies
- No need to wait for the server to return data
Use 1*1 GIFs
Non-blocking loading
Try to avoid the impact of SDK JS resource loading.
First, the error record of window.onerror is cached, and then the SDK is loaded asynchronously, and the error report is processed in the SDK.
<! DOCTYPE html><html lang="en">
<head>
<script>
(function(w) {
w._error_storage_ = [];
function errorhandler(){
// To record the current error
w._error_storage_&&w._error_storage_.push([].slice.call(arguments));
}
w.addEventListener && w.addEventListener("error", errorhandler, true);
var times = 3,
appendScript = function appendScript() {
var sc = document.createElement("script");
sc.async = !0,
sc.src = './build/skyeye.js'.// Depends on where you put it
sc.crossOrigin = "anonymous",
sc.onerror = function() {
times--,
times > 0 && setTimeout(appendScript, 1500)},document.head && document.head.appendChild(sc);
};
setTimeout(appendScript, 1500); }) (window);
</script>
</head>
<body>
<h1>This is a test page (new)</h1>
</body>
</html>
Copy the code
Collection and Aggregation end (Log server)
At this stage, the input is a record of errors received, and the output is a valid data entry. The core function requires data cleaning, which eliminates excessive service pressure. Another core function is the warehousing of data.
The overall process can be seen as error identification -> error filtering -> error receiving -> error storage.
Error identifier (with SDK)
Before aggregation, we need the ability to identify errors in different dimensions, which can be understood as the ability to locate a single error entry, a single error event.
Single error entry
Generate a corresponding error entry ID from date and random values.
const errorKey = `The ${+new Date()}@${randomString(8)}`
function randomString(len) {len = len | |32;
let chars = 'ABCDEFGHJKMNPQRSTWXYZabcdefhijkmnprstwxyz2345678';
let maxPos = chars.length;
let pwd = ' ';for (let i = 0; i < len; I++) {PWD += charars.charat (Math.floor(MathThe random () * maxPos)); }return pwd;
}
Copy the code
â
Single error event
First, you need the ability to locate the same error event (different users, same error type, error message).
The value of aske code can be calculated by adding message, Colno and lineno, and the error errorKey can be generated.
const eventKey = compressString(String(e.message), String(e.colno) + String(e.lineno))
function compressString(str, key) {
let chars = 'ABCDEFGHJKMNPQRSTWXYZ';
if(! str || ! key) {return 'null';
}
let n = 0,
m = 0;
for (let i = 0; i < str.length; i++) {
n += str[i].charCodeAt();
}
for (let j = 0; j < key.length; j++) {
m += key[j].charCodeAt();
}
let num = n + ' ' + key[key.length - 1].charCodeAt() + m + str[str.length - 1].charCodeAt();
if(num) {
num = num + chars[num[num.length - 1]];
}
return num;
}
Copy the code
In the figure below, an error event (event list), each subordinate is the actual error entry.
Error Filtering (with SDK)
Domain filtering
Filter this page script error, may be inserted by webView other JS.
We are only concerned about our own remote JS problems, so we do the filtering according to the company domain name.
/ / pseudo code
if(! e.filename || ! e.filename.match(/^(http|https):\/\/yun./)) return true
Copy the code
Repeat report
How to avoid duplicate data reporting? The errorKey is cached to prevent the number of repeated errors from being reported exceeding the threshold.
/ / pseudo code
const localStorage = window.localStorage;
const TIMES = 6; // Number of cache entries
export function setItem(key, repeat) {
if(! key) { key ='unknow';
}
if (has(key)) {
const value = getItem(key);
// Core code, more than the number of pieces, jump
if (value >= repeat) {
return true;
}
storeStorage[key] = {
value: value + 1.time: Date.now()
}
} else {
storeStorage[key] = {
value: 1.time: Date.now()
}
}
return false;
}
Copy the code
â
Error received
When dealing with the receiving interface, pay attention to traffic control. This is where the most effort of back-end development is put into dealing with highly concurrent traffic.
Error logging
The receiving end uses Koa, simple implementation of receiving and printing to disk.
/ / pseudo code
module.exports = async ctx => {
const { query } = ctx.request;
// Do a simple check for the fields
check([ 'mobile'.'network'.'ip'.'system'.'ua'. ] , query); ctx.type ='application/json';
ctx.body = { code: '1'.msg: 'Data reported successfully' };
// Log to disk code, according to their own log library selection
};
Copy the code
Peak clipping mechanism
For example, set the threshold of 2000 per second, then decrease the upper limit based on the number of requests, and reset the upper limit periodically.
/ / pseudo code
// 1000ms
const TICK = 1000;
// The upper limit of 1 second is 2000
const MAX_LIMIT = 2000;
// Maximum number of requests per server
let maxLimit = MAX_LIMIT;
/** * start the reset function */
const task = () = > {
setTimeout(() = > {
maxLimit = MAX_LIMIT;
task();
}, TICK);
};
task();
const check = () = > {
if (maxLimit <= 0) {
throw new Error('More times than reported');
}
maxLimit--;
// Execute the business code...
};
Copy the code
Sampling process
If the threshold is exceeded, samples can be collected.
// Collect only 20%
if(Math.random() < 0.2) {
collect(data) // Record the error message
}
Copy the code
Error storing
For logs printed on disk, how can we aggregate them? Here we need to consider using a storage solution.
After a storage scheme is selected and configured, the storage scheme periodically obtains data from disks. So we need to choose a storage solution.
For storage solutions, we compared the daily common solutions, Alibaba Cloud Log Service – Log Service (SLS), ELK (Elastic, Logstash, Kibana), Hadoop/Hive (to store data in Hadoop, Query using Hive.
Log Service is selected after comparison from the following aspects. The main consideration is that it does not need to build, the cost is low, and the query function is satisfied.
feature | ELK class system | Hadoop + Hive | The log service |
---|---|---|---|
Log time delay | 1 to 60 seconds | Minutes to hours | real-time |
The query delay | Less than 1 second | Minutes of class | Less than 1 second |
Query capabilities | good | good | good |
scalability | Prepare the machine in advance | Prepare the machine in advance | Second level 10 times capacity expansion |
The cost of | higher | The lower | Very low |
Log Delay: specifies the time after a log is generated. Query delay: The amount of data scanned per unit of time. Query ability: keyword query, condition combination query, fuzzy query, numerical comparison, context query. Scalability: Rapid response to a hundredfold increase in traffic. Cost: cost per GB.
You can view the log service for API usage.
Visual Analysis side (visual platform)
At this stage, the input is a record of errors received, and the output is a valid data entry. The core function requires data cleaning, which eliminates excessive service pressure. Another core function is the warehousing of data.
The main function
This part is mainly about the reasonable design of product functions, making it small and beautiful. For specific aggregation, refer to Aliyun SLS.
- The home page chart can be one day, four hours, or one hour. The number of aggregation errors is divided into 24 parts per day.
- The home page list aggregates data within the selected time and displays error files, error keys, number of events, error types, time, and error information.
- Error details, event list, basic information, device information, device ratio chart (see above event list).
â
list
At the beginning, I made the list of errors to be handled, my list of errors, and the list of solved errors, but there was no binding relationship between the errors and human beings. It was too dependent on human initiative, and everyone needed to take the initiative to deal with the errors on the platform, so the effect was not good.
After the wrong author list, through the nail daily to remind the corresponding personnel to deal with. Critical error, through the real-time alarm to blame the person, the alarm will say.
Specific principles:
- Use git commands to pack the author, author email, and time in the header.
- In the visualization service, to request the corresponding error URL to match the corresponding author, return to the display end.
SourceMap
Build with webPack’s hidd-source-map. Compared with source-map, there is less comment at the end, but there is no less index.js.map in the output directory. Avoid source-map leak in online environment.
webpackJsonp([1], [function(e,t,i){... },function(e,t,i){... },function(e,t,i){... },function(e,t,i){... },... ] )// No source-map link is generated
Copy the code
Based on the URL of the file that reported the error, locate the sourceMap address that was previously packaged and uploaded according to the agreed directory and rules within the team.
const sourcemapUrl = ('xxxfolder/' + url + 'xxxHash' +'.map')
Copy the code
â
Obtain the reported line, column, and source, and locate them using the third-party library sourceMap.
const sourceMap = require('source-map')
// Obtain the number of lines in the source file based on the number of lines
const getPosition = async(map, rolno, colno) => {
const consumer = await new sourceMap.SourceMapConsumer(map)
const position = consumer.originalPositionFor({
line: rolno,
column: colno
})
position.content = consumer.sourceContentFor(position.source)
return position
}
Copy the code
If you’re interested in how SourceMap works, you can go further,SourceMap and front-end exception monitoring.
False alarm
Alarm set
- Each line of service sets its own threshold, error time span, and alarm polling interval
- Alarm to the corresponding group through the nail hook
- List the wrong authors in daily form
â Expansion
Behavior to collect
By collecting user operations, you can clearly find out why errors occur.
classification
- UI behavior: Click, scroll, focus/out of focus, long press
- Browser behavior: Request, forward/back, jump, new page, close
- Console behavior: log, WARN, error
Collect way
- Click on the behavior
Use addEventListener to listen for the click event globally and collect the event and DOM element names. This is reported with an error message.
- Send the request
The onReadyStatechange callback function that listens to XMLHttpRequest
- Page jump
Listen for window.onpopState, which is triggered when the page is jumping.
- Console behavior
Overrides the info and other methods on the console object.
You can refer to behavioral monitoring if you are interested.
Problems encountered
Due to some privacy involved, the following will do desensitization.
Empty log problem
After running grayscale online, we found some empty logs in SLS log ðĒ, ðĶĒ, what happened?
First, let’s recall what parts of the link might have problems.
Check the link. Before SLS collection, disk logs were collected, received by the server and reported by the SDK. Let’s check in turn.
If you go one step further and find that the disk log already has an empty log, you have to look at the receiver and the SDK.
Start to use the control variable method to perform empty judgment on the SDK to prevent empty logs from being reported. Result: Invalid ð was found.
The Node continues to process the received data. If the received data is empty and logs are not printed, the result is still invalid ðģ.
So start locating is there something wrong with the log printing itself? Studied the API of log third party log library, conducted various attempts, found still useless, my face black ð.
What is the case, “something is not determined” to see the source code. Check log library source code what problems exist. The main call process for the source code to go through, and did not find any problems, confused ð.
The whole code logic was fine, which led us to wonder if there was something wrong with the data, and we began to shrink the number of fields reported, eventually defining it as a single field. There is no problem after finding on-line ðĒ.
Is the data stored in some fields too long? However, the possibility of this error is not reflected in the code logic or the process log.
So we use dichotomy, add fields dichotomously, and finally locate a certain field. If there is a field to report, there is a problem. This is quite unexpected.
We thought about the link again, except for the log library, the other code was basically our own logic, so we checked the log library and suspected that it had done something to a certain field.
Therefore, through the search, we found that the log library uses a certain field to indicate the meaning in the valet mode (we can understand the Node’s master-slave mode), which conflicts with the field we reported, so we lost ðĪŠ.
Log Loss
Solved the last problem, happy, a sense of achievement welled up in my heart. But was immediately hit in the head, I found I was happy too early ðĪŪ.
During the local test, a classmate of the team kept refreshing the page to report the errors of the current page because he was having a good time. However, he found that the number of entries reported locally did not match the number of entries in the actual log service, which had far fewer entries.
Since I had worked in back-end development for more than two years when I just graduated, I was still a little sensitive to data loss in IO operation. The intuition is that it might be a multi-process direction problem. It is suspected that the file deadlock problem is caused by multiple processes.
So let’s get rid of multithreading, and with single threading, we’re going to repeat the process of reproducing the problem. Found no omission ðĪ.
We found that there are two places where we can configure the Cluster (master/slave mode), the log library and the deployment tool.
By default, the log library uses the master-slave process mode. However, the deployment tool does not have the concept of master-slave process mode, which will inevitably cause deadlock problems in writing I/OS, resulting in log loss. I wondered if the community had third-party support to solve this problem.
A Google search quickly found the corresponding third-party library, which provides message communication between the master process and the servant process. The principle is that the master process is responsible for all messages written to the log, and the servant process passes the messages to the master process.
5. Recommended Reading and References
Handle exceptions
How do I gracefully handle front-end exceptions?
source-map
SourceMap and front-end exception monitoring
The React error
React, elegant catch exception
Script Error
The Capture and report the JavaScript errors with the window. The onerror | Product Blog, Sentry What the heck is “Script error”? | Product Blog, Sentry
As a whole
Front end to monitor | Allan – how to implement a set of multiterminal error monitoring platform Step by step to build the front-end monitoring system: JS error monitoring Lu a front-end monitoring system
I gave a speech at the open house
End streaking PPT
Contribution from: a transparent self – developed front – end error monitoring