1 Requirement Background: Why does the front end need monitoring

As front end engineers, we do too many pages. Do you know the average daily PV of your page? Do you know when users will peak for certain features? Do you suffer from the lack of data to establish a voice when a requirement review dispute arises? Do you always get pulled into a group with no log to prove your innocence when there is an online problem?

As front end engineers on the front line of the business, we need to know this information. Whether it is business understanding, requirements discussion, problem solving, experience optimization, monitoring is our first step from “front-end engineer” to “engineer”.

Requirement analysis: What does the front end need to monitor

From the perspective of business, we need to know the user usage, including PV, UV, access period, access duration, and so on.

In terms of troubleshooting, we need to know the user’s usage snapshot, including the interface request and page error when the problem occurs.

In terms of experience optimization, we need to know real performance data, including page load and resource load time.

Thus, we had the first version of the requirement for front-end monitoring.

Demand Story Monitoring items
As a front end engineer, I want the user usage data of the pages/features I work on after they go live so I can have a deeper understanding of the business value and help me feel a sense of accomplishment. The PV page is customized
As a front-end engineer, I want the interface request and error data of the page I make, so that I can predict whether the business function on the page is in normal state and check the data when there is a problem. An error occurs on the interface request page
As a front-end engineer, I wanted performance data on the pages I was working on so I could see how many “cards” the user was talking about and see before and after the optimization. Page performance resources are loaded

3 Summary design: How to design a monitor

According to the above requirement Story, we can easily design the three blocks of “data collection, log reporting and log query”.

  • Data collection: Collects data to be monitored from the client. The client here can refer to the browser, APP and small program, so an SDK can be implemented for each container to collect monitoring data. This article describes the design for browsers only.
  • Log reporting: The collected data is sent to the server through HTTP requests and stored in the database as logs.
  • Log query: Supports query of logs based on conditions and provides corresponding UI functions.

Fortunately, there is a mature open source solution for logging storage and querying, ELK, which is an acronym for the three Open Source projects.

  • ElasticSearch is a distributed search engine that collects, analyzes, and stores data.
  • Logstash is a server-side data processing pipeline that captures, transforms, and sends data from multiple sources at the same time to “repositories” such as ElasticSearch.
  • Kibana provides a user-friendly Web interface for Logstash and ElasticSearch that helps users aggregate, analyze, and search important data logs.

We use “E” and “K” in ELK, and the log data is processed in HTTP Server. After processing, the log data is directly dropped into ElasticSearch.

The following will focus on the specific design and implementation of JS-SDK.

4 Detailed design: How to achieve monitoring SDK

4.1 Define the usage mode

As with most components and functions, detailed design begins by defining how it will ultimately be used — how you want others to use it.

<script src="xxxx/{version}/jssdk.js? token=123" async></script>
Copy the code

We want jS-SDK to be as simple as possible externally, as simple as one line

4.2 SDK overall split

4.3 SDK Initialization

As shown above, the SDK executes with a main entry, which is essentially a self-executing function.

(function() {
  // Get the query parameter in script SRC
  constparams = getScriptQuery(); initSDK(params); }) ();Copy the code

Main is simple; it simply extracts the parameters initialized by the SDK and then calls the initialized function.

function initSDK(opt: SdkConfig) {
  // Built-in default parameters
  const config: SdkConfig = assign({
    sendPV: true.// Whether the page PV is reported
    sendApi: true.// Whether to report the API request
    sendResource: true.// Whether to report the resource request
    sendError: true.// Whether to report js error
    sendPerf: true.// Whether to report page performance
  }, opt, window.$watchDogConfig);

  window.$watchDogConfig = config;

  // Execute each monitoring module
  config.sendPV && watchPV(config);
  config.sendApi && watchApi(config);
  config.sendResource && watchResource(config);
  config.sendError && watchError(config);
  config.sendPerf && watchPerf();
  watchCustom(); // Supports custom log reporting
}
Copy the code

InitSDK is also relatively simple. It takes the initialization parameters, merges the built-in default parameters, SRC query parameters, and global variables, and writes them back to the global variables. Then execute the initialization function of each monitoring module in turn. The realization principle of each monitoring item will be introduced below.

4.4 Implementation Principles of monitoring Items

4.4.1 Interface Monitoring

Interface monitoring is implemented by overriding the prototype function against the browser’s built-in XMLHttpRequest object.

export function watchApi(config: SdkConfig) {
  function hijackXHR() {
    const proto = window.XMLHttpRequest.prototype;
    const originalOpen = proto.open;
    const originalSend = proto.send;

    proto.open = function(method: string, url: string) {
      this._ctx = {
        method,
        url: url || ' '.start: getNow(),
      };
      const args = [].slice.call(arguments);
      originalOpen.apply(this, args as any);
    };

    proto.send = function(body: any) {
      const that = this;
      const ctx = that._ctx;

      function handler() {
        if (ctx && that.readyState === 4) {
          try {
            const url = that.responseURL || ctx.url;

            // Discard the SDK report request
            if (url.indexOf(SERVER_HOST) >= 0) {
              return;
            }
            // Report an API request
            sender.report('api', [{
              url,
              httpMethod: ctx.method,
              httpCode: that.status,
              time: getNow() - ctx.start,
            }]);
          } catch(err) { sender.reportError(err); }}}const originalFn = that.onreadystatechange;

      if (originalFn && isFunction(originalFn)) {
        that.onreadystatechange = function () {
          const args = [].call(arguments);
          handler.apply(this, args as any);
          originalFn.apply(this, args as any);
        };
      } else {
        that.onreadystatechange = handler;
      }

      const args = [].slice.call(arguments);
      return originalSend.apply(this, args as any);
    };
  }

  window.XMLHttpRequest && hijackXHR();
};
Copy the code

If fetch is used to initiate requests in the business side website connected to monitoring, we also need to override the built-in FETCH function of the browser in consideration of monitoring SDK integrity. I will not post the implementation code here, but leave it to readers to try.

4.4.2 JS error monitoring

JS error can be detected by window.onerror. Let’s check the definition on MDN first.

So this code is relatively easy to write, but note that if window. onError is already being listened for on the business side of the web site, we want to keep the original event callback.

export function watchError(config: SdkConfig) {
  const originalOnError = window.onerror;

  function errorHandler (message: string, source? : string, lineno? : number, colno? : number, error? :Error) {
    if (originalOnError) {
      try {
        originalOnError.call(win, message, source, lineno, colno, error);
      } catch (err) {}
    }

    if(error ! =null) {
      // Script error reported
      sender.report('error', [{
        message,
        file: source || ' '.line: ' ' + (lineno || ' '),
        col: ' ' + (colno || ' '),
        stack: error.stack || ' ',}]); }}window.onerror = function (message, source, lineno, colno, error) {
    errorHandler(message asstring, source, lineno, colno, error); }};Copy the code

If the

One more detail is that most of our business front-end page development currently uses Promise. When a Promise is rejected, if we write code without a reject callback, it throws an unhandledrejection event (see MDN definition). As for how to report errors in this part, I leave it to readers to practice.

4.4.3 Monitoring resource loading

For front-end resource loading information, we want to know which resources are loaded on the page, what their urls are, whether they were loaded successfully, and how long they were loaded. All resources loaded on the page can be retrieved from performance.getentries (). Also before implementation, let’s look at PerformanceEntry defined by MDN.

In the concrete implementation, you can use performance. GetEntriesByType (‘ resource ‘) access to all information resource loading, including static and dynamic resources, so need to combine initiatorType further filtering.

export function watchResource(config: SdkConfig) {
  function reportAssets () {
    if (isFunction(performance.getEntriesByType)) {
      const entries = performance.getEntriesByType('resource') as PerformanceResourceTiming[];
      
      // Filter out performance entries for non-static resources
      const resourceEntries = arrayFilter(entries, function(entry) {
        return ['fetch'.'xmlhttprequest'.'beacon'].indexOf(entry.initiatorType) === -1;
      });

      // Reports logs
      if (resourceEntries.length) {
        sender.report('resource', arrayMap(resourceEntries, function(entry) {
          return {
            url: entry.name,
            httpCode: 200.time: Math.round(entry.duration),
          };
        }));
      }

      // Clear the performance entries for this round
      if (isFunction(performance.clearResourceTimings)) {
        performance.clearResourceTimings();
      }

      // It can be collected every 2 seconds
      setTimeout(reportAssets, 2000); }}if (document.readyState === 'complete') {
    reportAssets();
  } else {
    addEventListener(window.'load', reportAssets); }};Copy the code

In a real business page, the loading of resources must be gradual. Some resources are lazily loaded, and some resources are requested only after some user interaction. So we want to collect all the resource loading information at once, which is not possible, we can only collect it periodically and repeatedly. Overall, when a page is DOM ready or onload, a round of resource loading logs can be collected.

After this round of collection, be sure to clear performance Entries by calling clearResourceTimings to avoid retrieving duplicate resources in the next round of collection.

Here is a little question, through the performance. GetEntriesByType (‘ resource ‘) access to the entry object without the HTTP status code, then how to distinguish between the success of resource loading? This is where the combination comes in. Remember, when a 404 is loaded in the page, will there be an error in the browser console?

This error can be detected in window.onerror, and the solution is there. You can either extract the failed resource URL from the error message, or use the SRC attribute of the element from the error event. Target as the resource URL. Find the resources after the url, you can use the performance, finds out its corresponding entry object getEntriesByName, can report the log.

4.4.4 Monitoring Page Performance

The implementation of page performance monitoring is relatively simple. In general, the performance data is collected when the page DOM is ready or onload. You only need to report performance data once. For details about the reported performance data, see the properties defined in PerformanceTiming.

4.4.5 PV Monitoring

Before SPA single-page applications became popular, collecting page PV was as simple as reporting a PV log once during page onload. But in the SPA scenario, the implementation is a little more complicated, so let’s look at a simple implementation.

export function watchPV(config: SdkConfig) {
  let lastVisit = ' ';

  function onLoad() {
    sender.reportPV();
    lastVisit = location.href;
  }

  function onHashChange() {
    sender.reportPV({
      spa: config.spa,
      from: lastVisit
    });
    lastVisit = location.href;
  }

  addEventListener(win, 'hashchange', onHashChange);
  addEventListener(win, 'load', onLoad);

  addEventListener(win, 'beforeunload'.function () {
    removeEventListener(win, 'hashchange', onHashChange);
    removeEventListener(win, 'load', onLoad);
  });
};
Copy the code

For a monitored JS-SDK, we can not care about whether the page is SPA or not. PV logs are reported when the page onload and single-page route changes, as long as they are marked in the log data, they can be differentiated in PV statistics later.

Why is the IMPLEMENTATION complex in the SPA scenario? Because the hashchange event is listened for in this code, it would most likely not be listened for in a real SPA project. The reason is that the front end often relies on single-page routing libraries, which are implemented differently, and single-page routing is divided into hash and Browser history modes (also called Hash history and Browser History in React Router).

To accommodate various routing scenarios, we also need to override the pushState and replaceState functions in the browser’s built-in history object. Concrete implementation is not given temporarily, also left to readers to explore the friends.

4.4.6 Custom Report

In real business pages, we sometimes need to proactively collect user actions, such as button clicks. This requires our monitoring JS-SDK to also provide the function of custom reporting. The author referred to baidu statistics, Umeng and other professional user behavior analysis platforms, for the custom report function, is usually used in this way.

Before using it, define an empty array
window.$watchDogEvents = window.$watchDogEvents || [];
// Add data to an array
window.$watchDogEvents.push(['Custom name'.'Custom content']);
Copy the code

The $watchDogEvents in the above code is a global variable of the JS-SDK and user conventions. Why is the custom reporting API designed to be a weird array? Since the introduction of JS-SDK

class CustomEventTrigger {
  push(args: string[]) {
    if (args instanceof Array && args[0]) {
      sender.report('custom'[{ext1: args[0].// eventName, mandatory
        ext2: args[1] | |' '.ext3: args[2] | |' '.ext4: args[3] | |' '.ext5: args[4] | |' ',}]); }}}export function watchCustom() {
  const originalLogs = window.$watchDogEvents || [];

  const trigger = new CustomEventTrigger();
  window.$watchDogEvents = trigger;

  // Clear custom logs that existed before SDK initialization
  setTimeout(() = > {
    for (let i = 0; i < originalLogs.length; i++) { trigger.push(originalLogs[i]); }},0);
};
Copy the code

As shown in the code above, when jS-SDK initializes execution, the $watchDogEvents array becomes an object instance with the push function, the CustomEventTrigger in the code. If custom data has been pushed to the $watchDogEvents array in the business page before this, it will be consumed after jS-SDK execution to ensure that the data is not lost. Further calls to $WatchDogEvents. push go directly to the CustomEventTrigger.

Pushing data into a global array with an array as an input parameter is ugly. It doesn’t matter, we can write our own NPM toolkit to wrap it around, such as the following function definition.

function sendEvent(eventName: string, field1? :string, field2? :string, field3? :string, field4? :string) :void
Copy the code

4.5 Design of log Reporting

At this point, the implementation principle of all monitoring items in JS-SDK has been introduced. Finally, let’s look at how logs are reported to the server.

4.5.1 Report Request

Network requests reported to logs can be reported in three ways.

Request way new Image().src POST WebSocket
advantages Simple, good compatibility, no cross-domain problems. The body parameter of the request has no length limitation and can be reported by log aggregation. If a connection is established only once, the performance of subsequent logs is better.
disadvantages The URL length is limited, which is not conducive to log aggregation reporting. It needs to be cross-domain, and the POST request cannot be sent when the page exits. The pressure on the reported server is high and the number of connections overflows easily.

Based on what we reported, we adopted the POST request method and designed the following two optimizations to address its shortcomings.

4.5.2 Log Queue Mechanism

POST cross-domain requests to send log data. If the requests are too frequent, the requests on normal business pages may be affected, because the browser has a limit on the number of TCP connections that can be processed at the same time. Therefore, a log queue is designed so that a request is not sent immediately after a log is collected, but every two seconds.

let queue: ReportLog[] = []; // Log queue to be reported
let t: any = null; // Record the setTimeout ID

// The actual request to report
function sendByXhr(body: ReportLog) {
  const xhr = new XMLHttpRequest();
  xhr.open('POST'.`${SERVER_HOST}/api/collect`);
  xhr.onreadystatechange = function () {
    if (xhr.readyState === 4&& +xhr.status ! = =200) {
      const retry = body._re || 0;
      // Re-queue within the allowed number of retries
      if (retry < MAX_RETRY_OF_REPORT_LOG) {
        body._re = retry + 1; sendLog(body); }}}; xhr.withCredentials =true;
  xhr.setRequestHeader('Content-Type'.'application/json; charset=UTF-8');
  xhr.send(JSON.stringify(body));
}

// Report log aggregation
function batchFlushQueue() {
  clearTimeout(t);
  t = null;
  // Aggregate the logs in the queue by type and send the request again
  sendByXhr(mergedLogFromQueue);
  // Clears the log queue for the current round
  queue = [];
}

// The method used for reporting monitoring items is delayed
export function sendLog(log: ReportLog) {
  queue.push(log);
  if(! t) {// Report every 2 seconds
    t = setTimeout(batchFlushQueue, 2000); }}Copy the code

Another minor detail is that POST requests can fail when they are actually reported, so a retry mechanism is added. If a request fails within the maximum number of retries, the log content is queued and reported again in the next round.

4.5.3 Handling of page exit

Due to the queue mechanism, unreported logs may exist in the log queue when the page is closed or displayed. So you have to do something special, otherwise you lose data. Try listening to the beforeUnload event and POST a POST request in it. Then you will find that the browser will not issue the request at all.

window.addEventListener('beforeunload'.function () {
  // Send a POST request => fail to send ❌
  // Send a new Image().src request ✅
});
Copy the code

Since we chose to use POST for log reporting, the above method is not feasible. I looked it up and found a great thing called sendBeacon. As it has browser compatibility requirements, it should be considered as code in the JS-SDK and degraded if it is not supported.

function sendByBeacon(body: ReportLog) {
  try {
    if (navigator.sendBeacon && navigator.sendBeacon((`${SERVER_HOST}/api/collect`), JSON.Stringify(body))) {
      return; }}catch (e) {}

  // XHR pocket is not supported when sendBeacon or send fails
  sendByXhr(body);
}
Copy the code

Finally, should we just use sendBeacon to report logs in beforeUnload? On the mobile web, the user will probably hit the HOME button right back on the desktop, and our last round of logs will still be lost. The more secure approach is to use visibilityChange and PageHide two events, specific will not demonstrate, readers can test their effect.

4.6 Monitoring SDK completion

At this point, the front-end monitoring JS-SDK is successfully completed.

Monitoring items such a comprehensive JS-SDK, the final build code volume within 10KB, gzip compressed less than 4KB. Compared with similar monitoring products, it can be said to be very refined.

5 Summary and prospect: the effect of front-end monitoring

Finally, take a look at the front-end monitoring effect in real business pages. On the server side of log collection, we use ELK, log data is stored in ElasticSearch after processing, and you can easily query logs in Kibana.

Figure above shows the curve of an important interface request after gray scale expansion of a business page. With front-end monitoring, we can observe each release or gray operation after the heart will be more assured. In addition, we can add the service status code report to the monitoring items of interface requests, so that it is easier to observe whether the current service is normal.

In the image above, a log visualization using Grafana allows you to view the bulk data from the same data source used by Kibana.

The basic capabilities of the front-end monitoring platform were built with Grafana and Kibana.

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

In the near future, front-end monitoring will continue to focus on three things:

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

1. More convenient log removal tools

After using Kibana, you will find that when you really check the user case barrier, you need to check many logs and input many query conditions, which is very inefficient. What we want is to be able to trace all the matching logs based on user ids and a given time period and build a user behavior trail that includes which page was opened at what time, which requests were made within the page, and when errors were caught. You can even do real-time troubleshooting for online user cases.

2, complete the small program monitoring ecology

At present, small program ecological flowers bloom, but the supporting monitoring ability only wechat is the most adequate. Actual products often need to be put into many small program platforms, we have also made a complete function, framework independent, multi-terminal consistent small program monitoring SDK, will be promoted in a wider range.

3. Enable the alarm monitoring capability

Logs are the foundation, and troubleshooting tools are used to locate and analyze problems once they occur. The more important capability of monitoring is alarm, in order to detect problems in advance. Based on the PV, API, AND JS error logs of front-end monitoring, a reasonable alarm rule can give full play to the value of logs.

In fact, we have experienced several system failures that have proven the effectiveness of front-end alarms. It is worth exploring to customize the most appropriate alarm rules according to the service characteristics of different projects.

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

If you are interested in what we do, welcome to join us!

Resumes can be sent to: [email protected]

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =