○ I. Background

Pain points

One day product: XXX advertiser feedback our page registration can not! Another day of operation: this event is dead on XXX media!

In our company online operation is nearly 100 million level of advertising pages, so if the online naked running, what problems do not know, found in the business end, was asked by the business side, this scene is very embarrassing.

choose

The company has four business divisions, and each division has no less than three projects. There are at least 12 projects here. As foreshadowing, there are many business lines here.

We can choose to do it ourselves or we can choose a third party. We compare one item with several common third parties.

  • Fundebug: Paid version from 159 yuan/month, data exists in a third party, and data self-preservation needs 300,000 yuan/year. It’s still expensive.
  • FrontJS, FrontJS advanced 899/ month, professional edition is 2999/ month.
  • Sentry, $80 / month.

For Sentry, calculate these 12 items. Nearly 100,000 a year for 12 projects. It was roughly estimated that it would take 2 people 1.5 months, which is 90 people days, to complete the MVP version. Based on the salary of 15,000 per person/month, the total cost would be 45,000, and it would be once and for all.

Therefore, from the perspective of cost, we will choose self-study, but there are other reasons besides cost. For example, we will do some custom functions based on this system, get through with the company permission user system, and then carry out Todo management for users, ranking users by mistake, etc.

And security based on business data, we want to build a system ourselves.

Therefore, from the perspective of cost, security and expansibility, we choose our own research and development.

Two, product design

What kind of product do we want, based on first principles, to solve the key problem of “how to locate the problem”? What information do we want to know by 5W1H?

The error message

In fact, error monitoring can be described as simple as one sentence, collecting page errors, reporting them, and then analyzing the symptoms.

Analyzing this sentence according to the 5W1H rule, we can find several points that need our attention.

  1. Logic error, data error, network error, syntax error, etc.
  2. When, the time period, such as a timestamp.
  3. Who, how many users are affected, including the number of error events, IP, device information.
  4. Where, which pages appear, including pages, advertising space (our company), media (our company).
  5. Error stack, queue, SourceMap
  6. How to solve the problem, we also need to collect system information.

Architectural layers

First of all, we need to sort out what features we need.

So how do we get the information up there and ultimately mislocate it.

First of all, we definitely need to collect errors, and then how do we know about errors on the user’s device page? That needs to be reported. So the first layer shows up, we need a collection site.

How can it be reported? You have worked with the backend for so long that you must know 🙃. You need an interface. You need a server to collect the reported errors and filter and aggregate the errors. So layer 2 knows, we need a collection aggregator.

We have collected enough material information, so how to use it next? We need to arrange it according to our rules. It will be very inefficient to organize queries by writing class SQL every time, so we need a visual platform to show them. Hence the third layer, the visual analysis side.

It feels like we’re done, as we all think, with a bug monitoring platform, 🙅. If so, you will notice a phenomenon, every time we go online and for a while after we go online, the developers will keep staring at the screen, what is this, human eye movement observer mode? Therefore, we need to solve the problem through the code, and naturally, the fourth layer, the monitoring alarm end came into being.

So please speak up what we need 🙈, collection on the report, collection aggregation, visual analysis, monitoring the alarm end.

○ 3. System design

Like functions, define the input and output of each link, and the core needs to deal with the function.

Now let’s see how to implement the four ends mentioned above.

Collect the online report (SDK)

The main input is all errors, and the output is to capture and report errors. The core is to handle the collection of different types of errors. The rest is non-core but essential work.

Wrong type

​

Let’s take a look at what types of errors we need to handle.

Common JS execution errors

  1. SyntaxError

A syntax error occurred while parsing

// The console is running
const xx,
Copy the code

Window. onError does not catch SyntxError, and SyntaxError is usually found during construction or even local development.

  1. TypeError

The value is not of the expected type

// The console is running
const person = void 0
person.name
Copy the code
  1. ReferenceError

References to undeclared variables

// The console is running
nodefined
Copy the code
  1. RangeError

When a value is not in its permitted range or set

(function fn ( ) { fn() })()
Copy the code

Network error

  1. ResourceError

Resource loading error

new Image().src = '/remote/image/notdeinfed.png'
Copy the code
  1. HttpError

Http request error

// The console is running
fetch('/remote/notdefined', {})
Copy the code

Collect wrong

All causes are errors, so how do we catch errors?

try/catch

Regular runtime errors are caught, but syntax and asynchronous errors are not

// Normal runtime errors can be caught ✅
try {
  console.log(notdefined);
} catch(e) {
  console.log('Exception caught:', e);
}

// Syntax error, cannot catch ❌
try {
  const notdefined,
} catch(e) {
  console.log('Exception caught:', e);
}

// Async error, cannot catch ❌
try {
  setTimeout(() = > {
    console.log(notdefined);
  }, 0)}catch(e) {
  console.log('Exception caught:',e);
}
Copy the code

Try /catch has the advantage of careful processing, but it also has obvious disadvantages.

window.onerror

Pure JS error collection, window.onError. When an error occurs while js is running, window raises an error event to the ErrorEvent interface.

/ * * *@param {String}  Message Error message *@param {String}  Source Error file *@param {Number}  Lineno line number *@param {Number}  Colno column number *@param {Object}  Error Error object */

window.onerror = function(message, source, lineno, colno, error) {
   console.log('Exception caught:', {message, source, lineno, colno, error});
}
Copy the code

First verify that the next few errors can be caught.

// Normal runtime errors can be caught ✅

window.onerror = function(message, source, lineno, colno, error) {
  console.log('Exception caught:',{message, source, lineno, colno, error});
}
console.log(notdefined);

// Syntax error, cannot catch ❌
window.onerror = function(message, source, lineno, colno, error) {
  console.log('Exception caught:',{message, source, lineno, colno, error});
}
const notdefined,
      
// Async error, cannot catch ✅
window.onerror = function(message, source, lineno, colno, error) {
  console.log('Exception caught:',{message, source, lineno, colno, error});
}
setTimeout(() = > {
  console.log(notdefined);
}, 0)

// Resource error, cannot catch ❌
<script>
  window.onerror = function(message, source, lineno, colno, error) {
  console.log('Exception caught:',{message, source, lineno, colno, error});
  return true;
}
</script>
<img src="https://yun.tuia.cn/image/kkk.png">
Copy the code

What if window. onError cannot catch resource errors?

window.addEventListener

When a resource (such as an image or script) fails to load, the element that loaded the resource triggers an Error Event on the Event interface. These error events do not bubble up to the window, but can be caught. While window.onError cannot detect capture.

// Image, script, CSS loading error, can be caught ✅<script>
  window.addEventListener('error'.(error) = > {
  	console.log('Exception caught:', error);
	}, true)
</script>
<img src="https://yun.tuia.cn/image/kkk.png">
<script src="https://yun.tuia.cn/foundnull.js"></script>
<link href="https://yun.tuia.cn/foundnull.css" rel="stylesheet"/>// New Image error, cannot capture ❌<script>
  window.addEventListener('error'.(error) = > {
    console.log('Exception caught:', error);
  }, true)
</script>
<script>
  new Image().src = 'https://yun.tuia.cn/image/lll.png'
</script>// Fetch error, cannot catch ❌<script>
  window.addEventListener('error'.(error) = > {
    console.log('Exception caught:', error);
  }, true)
</script>
<script>
  fetch('https://tuia.cn/test')
</script>
Copy the code

The new Image is used less and can handle its own errors by itself.

But what about the generic fetch, where the fetch returns a Promise, but the Promise’s error cannot be caught?

Promise error

  1. Common Promise error

The try/catch cannot catch an error in a Promise

// Try /catch cannot handle the error of json.parse because it is in a Promise
try {
  new Promise((resolve,reject) = > { 
    JSON.parse(' ') resolve(); })}catch(err) {
  console.error('in try catch', err)
}

// The catch method is required
new Promise((resolve,reject) = > { 
  JSON.parse(' ')
  resolve();
}).catch(err= > {
  console.log('in catch fn', err)
})
Copy the code
  1. Async error

Try /catch cannot catch async package error

const getJSON = async() = > {throw new Error('inner error')}// This is handled by try/catch
const makeRequest = async() = > {try {
        // Failed to capture
        JSON.parse(getJSON());
    } catch (err) {
        console.log('outer', err); }};try {
    / / try/catch neither
    makeRequest()
} catch(err) {
    console.error('in try catch', err)
}

try {
    // Need to await to catch
    await makeRequest()
} catch(err) {
    console.error('in try catch', err)
}
Copy the code
  1. Import the chunk error

Import actually returns a promise, so you can catch an error in two ways

// Promise catch method
import(/* webpackChunkName: "incentive" */'./index').then(module= > {
    module.default()
}).catch((err) = > {
    console.error('in catch fn', err)
})

// await method, try catch
try {
    const module = await import(/* webpackChunkName: "incentive" */'./index');
    module.default()
} catch(err) {
    console.error('in try catch', err)
}
Copy the code

Summary: Globally catch errors in Promise

All three of these actually boil down to promise-type errors that can be caught with unhandledrejection

// Promise is handled globally
window.addEventListener("unhandledrejection".function(e){
  console.log('Exception caught:', e);
});
fetch('https://tuia.cn/test')
Copy the code

To prevent missed Promise exceptions, unhandledrejection can be used to listen for Uncaught Promise errors globally.

Vue error

Since Vue catches all Vue single-file components or code inherited from vue. extend, errors in Vue are not caught directly by window. onError, but are thrown to vue.config. errorHandler.

/** * Catch Vue errors globally and throw them directly to onError processing */
Vue.config.errorHandler = function (err) {
  setTimeout(() = > {
    throw err
  })
}
Copy the code

The React error

React declares an error bounding component with componentDidCatch

class ErrorBoundary extends React.Component {
  constructor(props) {
    super(props);
    this.state = { hasError: false };
  }

  static getDerivedStateFromError(error) {
    // Update state so that the next render can show the degraded UI
    return { hasError: true };
  }

  componentDidCatch(error, errorInfo) {
    // You can also report the error log to the server
    logErrorToMyService(error, errorInfo);
  }

  render() {
    if (this.state.hasError) {
      // You can customize the demoted UI and render it
      return <h1>Something went wrong.</h1>;
    }

    return this.props.children; }}class App extends React.Component {
   
  render() {
    return (
    <ErrorBoundary>
      <MyWidget />
    </ErrorBoundary>)}}Copy the code

But error boundaries do not catch the following errors: React event handling, asynchronous code, or errors that error boundaries themselves throw.

Cross-domain problem

In general, if an error such as a Script error occurs, it is almost certain that a cross-domain problem has occurred.

If the current page and cloud JS are located in different domain names, if the cloud JS Error, window.onerror will appear Script Error. It can be solved in the following two ways.

  • Configure access-control-allow-Origin on the back end and crossorigin on the front end script.
<script src="http://yun.tuia.cn/test.js" crossorigin></script>

const script = document.createElement('script');
script.crossOrigin = 'anonymous';
script.src = 'http://yun.tuia.cn/test.js';
document.body.appendChild(script);
Copy the code
  • If you cannot modify the server’s request header, consider using a try/catch bypass to throw an error.
<! doctypehtml>
<html>
<head>
  <title>Test page in http://test.com</title>
</head>
<body>
  <script src="https://yun.dui88.com/tuia/cdn/remote/testerror.js"></script>
  <script>
  window.onerror = function (message, url, line, column, error) {
    console.log(message, url, line, column, error);
  }

  try {
    foo(); // Call the foo method defined in testeror.js
  } catch (e) {
    throw e;
  }
  </script>
</body>
</html>
Copy the code

If you don’t add a try catch, console.log will print a Script error. Try catch.

Let’s take a look at the scenario. In general, the remote JS is called. There are three common situations.

  • Error in method calling remote JS
  • There is a problem with the event inside the remote JS
  • Either an error occurs in a callback such as setTimeout

​

Call method scenario

By encapsulating a function, you can decorate the original method so that it can be tried/caught.


<! doctypehtml>
<html>
<head>
  <title>Test page in http://test.com</title>
</head>
<body>
  <script src="https://yun.dui88.com/tuia/cdn/remote/testerror.js"></script>
  <script>
  window.onerror = function (message, url, line, column, error) {
    console.log(message, url, line, column, error);
  }

  function wrapErrors(fn) {
    // don't wrap function more than once
    if(! fn.__wrapped__) { fn.__wrapped__ =function () {
        try {
          return fn.apply(this.arguments);
        } catch (e) {
          throw e; // re-throw the error}}; }return fn.__wrapped__;
  }

  wrapErrors(foo)()
  </script>
</body>
</html>

Copy the code

You can try to get rid of the wrapErrors.

Event scenarios

Native methods can be hijacked.


<! doctypehtml>
<html>
<head>
  <title>Test page in http://test.com</title>
</head>
<body>
  <script>
    const originAddEventListener = EventTarget.prototype.addEventListener;
    EventTarget.prototype.addEventListener = function (type, listener, options) {
      const wrappedListener = function (. args) {
        try {
          return listener.apply(this, args);
        }
        catch (err) {
          throwerr; }}return originAddEventListener.call(this, type, wrappedListener, options);
    }
  </script>
  <div style="height: 9999px;">http://test.com</div>
  <script src="https://yun.dui88.com/tuia/cdn/remote/error_scroll.js"></script>
  <script>
  window.onerror = function (message, url, line, column, error) {
    console.log(message, url, line, column, error);
  }
  </script>
</body>
</html>
Copy the code

You can try to get rid of encapsulation EventTarget. Prototype. The code of addEventListener, feeling.

Report to the interface

Why can’t GET/POST/HEAD request interface be used to report directly?

It’s easy to think of why. Generally speaking, the dot domain name is not the current domain name, so all interface requests constitute cross-domain.

Why can’t I request another file resource (JS/CSS/TTF)?

After the resource node is created, the browser will not actually send the resource request until the object is injected into the browser DOM tree. Loading JS/CSS resources can also block page rendering and affect user experience.

Not only do you not need to insert the DOM to construct the Image dot, you can initiate the request as long as the Image object is new in JS, and there is no blocking problem. In the browser environment without JS, you can also use the IMG tag to dot normally.

The new Image is used to report the interface. As for the last problem, they are all pictures, and 1×1 transparent GIF files are used for reporting instead of other PNG/JEPG/BMP files.

First of all, 1×1 pixels is the smallest legal image. And, because it is through the picture dot, so the picture is best transparent, so it will not affect the page itself display effect, the two show that the picture is transparent as long as the use of a binary bit mark image is transparent color, do not store color space data, can save volume. Because transparent color is required, JEPG can be ruled out directly.

For the same response, GIF can save 41% traffic compared to BMP and 35% traffic compared to PNG. GIF is the best choice.

  • You can cross domains
  • No cookies
  • No need to wait for the server to return data

Use 1*1 GIFs

Non-blocking loading

Try to avoid the impact of SDK JS resource loading.

First, the error record of window.onerror is cached, and then the SDK is loaded asynchronously, and the error report is processed in the SDK.

<! DOCTYPE html><html lang="en">
<head>
    <script>
        (function(w) {
            w._error_storage_ = [];
            function errorhandler(){
                // To record the current error
                w._error_storage_&&w._error_storage_.push([].slice.call(arguments));
            } 
            w.addEventListener && w.addEventListener("error", errorhandler, true);
            var times = 3,
            appendScript = function appendScript() {
                var sc = document.createElement("script");
                sc.async = !0,
                sc.src = './build/skyeye.js'.// Depends on where you put it
                sc.crossOrigin = "anonymous",
                sc.onerror = function() {
                    times--,
                    times > 0 && setTimeout(appendScript, 1500)},document.head && document.head.appendChild(sc);
            };
            setTimeout(appendScript, 1500); }) (window);
    </script>
</head>
<body>
    <h1>This is a test page (new)</h1>
</body>
</html>
Copy the code

Collection and Aggregation end (Log server)

At this stage, the input is a record of errors received, and the output is a valid data entry. The core function requires data cleaning, which eliminates excessive service pressure. Another core function is the warehousing of data.

The overall process can be seen as error identification -> error filtering -> error receiving -> error storage.

Error identifier (with SDK)

Before aggregation, we need the ability to identify errors in different dimensions, which can be understood as the ability to locate a single error entry, a single error event.

Single error entry

Generate a corresponding error entry ID from date and random values.

const errorKey = `The ${+new Date()}@${randomString(8)}`

function randomString(len) {len = len | |32;
    let chars = 'ABCDEFGHJKMNPQRSTWXYZabcdefhijkmnprstwxyz2345678';
    let maxPos = chars.length;
    let pwd = ' ';for (let i = 0; i < len; I++) {PWD += charars.charat (Math.floor(MathThe random () * maxPos)); }return pwd;
}
Copy the code

​

Single error event

First, you need the ability to locate the same error event (different users, same error type, error message).

The value of aske code can be calculated by adding message, Colno and lineno, and the error errorKey can be generated.

const eventKey = compressString(String(e.message), String(e.colno) + String(e.lineno))

function compressString(str, key) {
    let chars = 'ABCDEFGHJKMNPQRSTWXYZ';
    if(! str || ! key) {return 'null';
    }
    let n = 0,
        m = 0;
    for (let i = 0; i < str.length; i++) {
        n += str[i].charCodeAt();
    }
    for (let j = 0; j < key.length; j++) {
        m += key[j].charCodeAt();
    }
    let num = n + ' ' + key[key.length - 1].charCodeAt() + m + str[str.length - 1].charCodeAt();
    if(num) {
        num = num + chars[num[num.length - 1]];
    }
    return num;
}
Copy the code

In the figure below, an error event (event list), each subordinate is the actual error entry.

Error Filtering (with SDK)

Domain filtering

Filter this page script error, may be inserted by webView other JS.

We are only concerned about our own remote JS problems, so we do the filtering according to the company domain name.

/ / pseudo code
if(! e.filename || ! e.filename.match(/^(http|https):\/\/yun./)) return true
Copy the code

Repeat report

How to avoid duplicate data reporting? The errorKey is cached to prevent the number of repeated errors from being reported exceeding the threshold.

/ / pseudo code

const localStorage = window.localStorage;
const TIMES = 6; // Number of cache entries

export function setItem(key, repeat) {
    if(! key) { key ='unknow';
    }
  
    if (has(key)) {
        const value = getItem(key);
        
      	// Core code, more than the number of pieces, jump
        if (value >= repeat) {
            return true;
        }
        storeStorage[key] = {
            value: value + 1.time: Date.now()
        }
    } else {
        storeStorage[key] = {
            value: 1.time: Date.now()
        }
    }
    return false;
}
Copy the code

​

Error received

When dealing with the receiving interface, pay attention to traffic control. This is where the most effort of back-end development is put into dealing with highly concurrent traffic.

Error logging

The receiving end uses Koa, simple implementation of receiving and printing to disk.

/ / pseudo code

module.exports = async ctx => {
  const { query } = ctx.request;
  
 	// Do a simple check for the fields
  check([ 'mobile'.'network'.'ip'.'system'.'ua'. ] , query); ctx.type ='application/json';
  ctx.body = { code: '1'.msg: 'Data reported successfully' };

  // Log to disk code, according to their own log library selection
};
Copy the code

Peak clipping mechanism

For example, set the threshold of 2000 per second, then decrease the upper limit based on the number of requests, and reset the upper limit periodically.

/ / pseudo code

// 1000ms
const TICK = 1000;
// The upper limit of 1 second is 2000
const MAX_LIMIT = 2000;
// Maximum number of requests per server
let maxLimit = MAX_LIMIT;

/** * start the reset function */
const task = () = > {
  setTimeout(() = > {
    maxLimit = MAX_LIMIT;
    task();
  }, TICK);
};
task();

const check = () = > {
  if (maxLimit <= 0) {
    throw new Error('More times than reported');
  }
  maxLimit--;
  // Execute the business code...
};
Copy the code

Sampling process

If the threshold is exceeded, samples can be collected.

// Collect only 20%
if(Math.random() < 0.2) {
  collect(data)      // Record the error message
}
Copy the code

Error storing

For logs printed on disk, how can we aggregate them? Here we need to consider using a storage solution.

After a storage scheme is selected and configured, the storage scheme periodically obtains data from disks. So we need to choose a storage solution.

For storage solutions, we compared the daily common solutions, Alibaba Cloud Log Service – Log Service (SLS), ELK (Elastic, Logstash, Kibana), Hadoop/Hive (to store data in Hadoop, Query using Hive.

Log Service is selected after comparison from the following aspects. The main consideration is that it does not need to build, the cost is low, and the query function is satisfied.

feature ELK class system Hadoop + Hive The log service
Log time delay 1 to 60 seconds Minutes to hours real-time
The query delay Less than 1 second Minutes of class Less than 1 second
Query capabilities good good good
scalability Prepare the machine in advance Prepare the machine in advance Second level 10 times capacity expansion
The cost of higher The lower Very low

Log Delay: specifies the time after a log is generated. Query delay: The amount of data scanned per unit of time. Query ability: keyword query, condition combination query, fuzzy query, numerical comparison, context query. Scalability: Rapid response to a hundredfold increase in traffic. Cost: cost per GB.

You can view the log service for API usage.

Visual Analysis side (visual platform)

At this stage, the input is a record of errors received, and the output is a valid data entry. The core function requires data cleaning, which eliminates excessive service pressure. Another core function is the warehousing of data.

The main function

This part is mainly about the reasonable design of product functions, making it small and beautiful. For specific aggregation, refer to Aliyun SLS.

  1. The home page chart can be one day, four hours, or one hour. The number of aggregation errors is divided into 24 parts per day.
  2. The home page list aggregates data within the selected time and displays error files, error keys, number of events, error types, time, and error information.
  3. Error details, event list, basic information, device information, device ratio chart (see above event list).

​

list

At the beginning, I made the list of errors to be handled, my list of errors, and the list of solved errors, but there was no binding relationship between the errors and human beings. It was too dependent on human initiative, and everyone needed to take the initiative to deal with the errors on the platform, so the effect was not good.

After the wrong author list, through the nail daily to remind the corresponding personnel to deal with. Critical error, through the real-time alarm to blame the person, the alarm will say.

Specific principles:

  • Use git commands to pack the author, author email, and time in the header.
  • In the visualization service, to request the corresponding error URL to match the corresponding author, return to the display end.

SourceMap

Build with webPack’s hidd-source-map. Compared with source-map, there is less comment at the end, but there is no less index.js.map in the output directory. Avoid source-map leak in online environment.

webpackJsonp([1], [function(e,t,i){... },function(e,t,i){... },function(e,t,i){... },function(e,t,i){... },... ] )// No source-map link is generated
Copy the code

Based on the URL of the file that reported the error, locate the sourceMap address that was previously packaged and uploaded according to the agreed directory and rules within the team.

const sourcemapUrl = ('xxxfolder/' + url + 'xxxHash' +'.map')
Copy the code

​

Obtain the reported line, column, and source, and locate them using the third-party library sourceMap.

const sourceMap = require('source-map')

// Obtain the number of lines in the source file based on the number of lines
const getPosition = async(map, rolno, colno) => {
  const consumer = await new sourceMap.SourceMapConsumer(map)

  const position = consumer.originalPositionFor({
    line: rolno,
    column: colno
  })

  position.content = consumer.sourceContentFor(position.source)

  return position
}
Copy the code

If you’re interested in how SourceMap works, you can go further,SourceMap and front-end exception monitoring.

False alarm

Alarm set

  1. Each line of service sets its own threshold, error time span, and alarm polling interval
  2. Alarm to the corresponding group through the nail hook
  3. List the wrong authors in daily form

○ Expansion

Behavior to collect

By collecting user operations, you can clearly find out why errors occur.

classification

  • UI behavior: Click, scroll, focus/out of focus, long press
  • Browser behavior: Request, forward/back, jump, new page, close
  • Console behavior: log, WARN, error

Collect way

  1. Click on the behavior

Use addEventListener to listen for the click event globally and collect the event and DOM element names. This is reported with an error message.

  1. Send the request

The onReadyStatechange callback function that listens to XMLHttpRequest

  1. Page jump

Listen for window.onpopState, which is triggered when the page is jumping.

  1. Console behavior

Overrides the info and other methods on the console object.

You can refer to behavioral monitoring if you are interested.

Problems encountered

Due to some privacy involved, the following will do desensitization.

Empty log problem

After running grayscale online, we found some empty logs in SLS log ðŸ˜Ē, ðŸĶĒ, what happened?

First, let’s recall what parts of the link might have problems.

Check the link. Before SLS collection, disk logs were collected, received by the server and reported by the SDK. Let’s check in turn.

If you go one step further and find that the disk log already has an empty log, you have to look at the receiver and the SDK.

Start to use the control variable method to perform empty judgment on the SDK to prevent empty logs from being reported. Result: Invalid 😅 was found.

The Node continues to process the received data. If the received data is empty and logs are not printed, the result is still invalid ðŸ˜ģ.

So start locating is there something wrong with the log printing itself? Studied the API of log third party log library, conducted various attempts, found still useless, my face black 🌚.

What is the case, “something is not determined” to see the source code. Check log library source code what problems exist. The main call process for the source code to go through, and did not find any problems, confused 🙃.

The whole code logic was fine, which led us to wonder if there was something wrong with the data, and we began to shrink the number of fields reported, eventually defining it as a single field. There is no problem after finding on-line ðŸ˜Ē.

Is the data stored in some fields too long? However, the possibility of this error is not reflected in the code logic or the process log.

So we use dichotomy, add fields dichotomously, and finally locate a certain field. If there is a field to report, there is a problem. This is quite unexpected.

We thought about the link again, except for the log library, the other code was basically our own logic, so we checked the log library and suspected that it had done something to a certain field.

Therefore, through the search, we found that the log library uses a certain field to indicate the meaning in the valet mode (we can understand the Node’s master-slave mode), which conflicts with the field we reported, so we lost ðŸĪŠ.

Log Loss

Solved the last problem, happy, a sense of achievement welled up in my heart. But was immediately hit in the head, I found I was happy too early ðŸĪŪ.

During the local test, a classmate of the team kept refreshing the page to report the errors of the current page because he was having a good time. However, he found that the number of entries reported locally did not match the number of entries in the actual log service, which had far fewer entries.

Since I had worked in back-end development for more than two years when I just graduated, I was still a little sensitive to data loss in IO operation. The intuition is that it might be a multi-process direction problem. It is suspected that the file deadlock problem is caused by multiple processes.

So let’s get rid of multithreading, and with single threading, we’re going to repeat the process of reproducing the problem. Found no omission ðŸĪ­.

We found that there are two places where we can configure the Cluster (master/slave mode), the log library and the deployment tool.

By default, the log library uses the master-slave process mode. However, the deployment tool does not have the concept of master-slave process mode, which will inevitably cause deadlock problems in writing I/OS, resulting in log loss. I wondered if the community had third-party support to solve this problem.

A Google search quickly found the corresponding third-party library, which provides message communication between the master process and the servant process. The principle is that the master process is responsible for all messages written to the log, and the servant process passes the messages to the master process.

5. Recommended Reading and References

Handle exceptions

How do I gracefully handle front-end exceptions?

source-map

SourceMap and front-end exception monitoring

The React error

React, elegant catch exception

Script Error

The Capture and report the JavaScript errors with the window. The onerror | Product Blog, Sentry What the heck is “Script error”? | Product Blog, Sentry

As a whole

Front end to monitor | Allan – how to implement a set of multiterminal error monitoring platform Step by step to build the front-end monitoring system: JS error monitoring Lu a front-end monitoring system

I gave a speech at the open house

End streaking PPT