During the live broadcast of IMWebConf 2020, the scene page of FLV streaming in Tencent classroom crashes: this stability problem is quite serious, and the solution process is recorded here as a warning.

The phenomenon of

During the live broadcast of IMWebConf 2020, the scene page of FLV streaming in Tencent classroom crashes:

This stability problem is serious and the resolution is documented here as a cautionary note.

Positioning process

Determine it as a memory leak

A search reveals that a page crash is usually due to a memory leak (other than a network problem).

Stable repeat

Students in the audio and video team first performed experiments to reproduce this problem, and made test pages for the code of live broadcast:

  • Empty test page: XHR requests FLV resources, the page memory grows to a certain threshold, then suddenly returns to 50 MB, and then stops growing, the request continues, no crash.
  • Blank test page: flv.js only pulls stream without playing, no additional parameters are added, and there is more than 200 MB fluctuation in the page, without crash.
  • Blank test page: flv.js was used to pull the stream and play it without adding additional parameters. There were more than 200 MB fluctuations in the page without crash.
  • Blank test page: flv.js was used to pull and play, using the same parameters as the classroom page, there were more than 200 MB fluctuations in the page, no crash.
  • Sample test page: Use the FLV mode of Loki-Player to play the pull stream, and the memory surges to crash.
  • Sample test page: The package encapsulated by IMweb-TCPlayer was streamed in the classroom page, and the memory surged to crash.
  • Blank test page: use TCPlayer to pull stream playback, stable memory, no crash.
  • Sample test page: use flv.js pull stream playback, or there is a problem, page has eruda.

Stable reproduces this problem, and it is worth mentioning that this is not the same as a traditional JS memory leak:

The Memory debugger doesn’t see any problems, and the task manager can see that the Tab Memory usage is skyrocketing:

Reduce code scope

Based on the above experiments on 5/6/8 and memory leaks from non-javascript Heap, comparing the code differences, memory leaks are most likely related to network hijacking logic:

Both the ErUDA debugging tool and the internal reporting tool that the player relies on have logic that hijacks network requests. Because erUDA is impossible for ordinary users, the internal reporting tool is the problem.

Web request implementation in Chrome can refer to:

  • Multi-process architecture
  • Multi-process resource loading

Supplementary reference:

  • From browser multi process to JS single thread, JS running mechanism is the most comprehensive combing
  • Inside look at modern web browser (part 2)

Our multi-process application can be viewed in three layers. At the lowest layer is the Blink engine which renders pages. Above that are the renderer process (simplistically, one-per-tab), each of which contains one Blink instance. Managing all the renderers is the browser process, which controls all network accesses.

Multi-process applications can be divided into three layers. At the bottom is the Blink engine that renders the page. Above it are the renderer processes (simply, one per TAB), each of which contains a Blink instance. The browser process manages all the renderers, and it controls all network access.

The Renderer process uses IPC to read Browser’s request response data. Inter-process communication (IPC) is implemented through shared memory.

In conjunction with Chrome’s consensus notes on memory computing:

  • Fix Memory Problems
  • Memory Usage Backgrounder
  • Consistent Chrome Memory Metrics

Define the memory footprint of a process to be the amount of memory that would become available to the system if the process were killed. More specifically, define:

  • Physical Memory Footprint: (Number of physical pages that would become available if the process were killed).
  • Swapped Memory Footprint: (Number of pre-compression pages in swap or compressed memory that would become available if the process were killed).
  • Memory Footprint: Physical + Swapped.

The memory footprint of a process is defined as the amount of memory available to the system if the process is killed. More specifically, definitions:

  • Physical memory footprint :(to the number of physical pages available if the process is killed).
  • Swap memory footprint :(swap or compress the number of pre-compressed pages in memory if the process is terminated).
  • Memory usage: Physical memory + swap memory.

So shared Memory is calculated as Swapped Memory Footprint in the Memory column of the task manager, so that the problem can be found. Then manually comment out the internal reporting tool hijacking logic, as expected did not reappear. Review hijacked the code and found:

origFetch.apply(this, args).then((response) = >{
  try {
      loggerFetch.status = response.status;
      loggerFetch.cost = Date.now()- loggerFetch.start;

      // Here the clone().then() call results in a reference count of FLV stream Blob data
      loggerFetch.resContentType = response.headers.get('Content-Type');
      const cloned = response.clone();
      
      // Since the live HTTP FLV stream is constantly responding to data, this Promise will not actually be triggered until the live stream ends
      // Most of the cases crashed long ago due to memory problems
      cloned.text().then((text) = >{
        loggerFetch.response = text;
        Network.trigger('fetch', loggerFetch);
      });
  } catch (e){
    Network.trigger('fetch', loggerFetch);
  }
  return response;
});
Copy the code

Because the clone().then() call of the Response instance waits for the Blob Response data of the HTTP FLV stream, this part of the Blob data is reference-counted (see Javascript GC introduction) until the end of the live broadcast, and this part of the memory is released. In most cases, pages have long since crashed due to memory problems.

The problem to repair

Whitelist filtering is used to obtain the response content. Only the application/json, text/plain, text/ XML response content is obtained:

const defaultOpts = {
    // Whether to record the blacklist and whitelist of response content
    resContentTypePatterns: {
      includes: [/application\/json/./text\/xml/./text\/plain/].excludes: [/video/,}}if (multiMatch(loggerFetch.resContentType, defaultOpts.resContentTypePatterns)) {
  // ...
}
Copy the code

Other supplementary

Why would an internal reporting tool be hijackedwindow.fetch

Because the internal reporting tool will automatically capture the page Ajax request time, return code and other data, to monitor and statistics CGI performance, success rate.

conclusion

This experience once again reminds us that we must be in awe of the code we have written, and the implementation of such hijacking global methods should be careful, rigorous, and good disaster recovery logic. Only in this way can we ensure the availability and stability of our business and be responsible to our customers.

In addition, how to do a good job in monitoring the core page of audio and video scene memory leak to expose the problem is a very challenging thing, the team is trying to supplement this part later!










Scan code to pay attention to IMWeb front-end community public number, get the latest front-end good articles

Micro blog, nuggets, Github, Zhihu can search IMWeb or IMWeb team to follow us.