Bytedance interviewer, I also implemented large file upload and breakpoint resume

preface

I saw an article a few days ago and was deeply touched

Bytedance Interviewer: Please implement a large file upload and breakpoint resume

The author from 0 to achieve the large file section upload, breakpoint continued to pass, second pass, pause and other functions, the in-depth and simple to carry out a comprehensive analysis of the interview questions

With no more rainbow fart blowing, I decided to take the heat, record the video, include the author’s full coding process, and continue with this article, so watch the above article before eating it, I’ve made some extensions below

To calculatehashTime-consuming issues not only can be passedweb-workder, can also refer toReacttheFFiberArchitecture, throughrequestIdleCallbackTo take advantage of idle browser time without getting stuck in the main thread
filehashThe calculation is to judge whether the file exists, and then achieve the second transmission function, so we can refer toBloom filterThe idea of sacrificing a little bit of recognition for time, like we canSampling is a hash
In this paper, throughweb-workderlethashCalculation does not caton the main thread, but large files due to too many slices, too manyHTTPLink to the browser, will also be suspended (I tried 4 gigabytes, directly blocked), we can control the asynchronous requestconcurrencyI think this is one of the top interview questions
The uploading progress of each slice does not need to be displayed in a table. Instead, we use a square progress bar to make it more straight (as shown in the figure).
How to retry if there is an error when uploading concurrently? For example, we allow two retries for each slice, and three retries before termination
Since the file size varies, it is also a little awkward to set the size of each slice to fixed, as we can refer toTCPOf the agreementSlow startPolicy, set an initial size, according to the upload task is completed, to dynamically adjust the size of the next slice, to ensure that the size of the file slice and the current network speed match
Small experience optimizations, like when uploading
File defragmentation

The existing slices in seconds are green, the ones being uploaded are blue, and the concurrency is 4. Without further discussion, let’s code bloom together

Time slice computes file hash

In fact, it is the time-slice concept, the core concept of the Fiber architecture in React. It uses the idle time of the browser to calculate the large diff process, and any high-priority tasks, such as animation and input, will interrupt the DIff task during the process. Although the whole calculation amount is not reduced, it greatly improves the interactive experience of users

This is probably the most popular way to open React Fiber

requestIdleCallback

Window. RequestIdleCallback () method will be called function in browser free time line. This enables the developer to perform background and low-priority work on the main event loop. The requestIdelCallback execution method, which passes a deadline parameter, knows the remaining time of the current frame, as follows

    requestIdelCallback(myNonEssentialWork);
    
    
    function myNonEssentialWork (deadline) {
    
      // Deadline.timeremaining () obtains the remaining time of the current frame
      // The current frame has time and the task queue is not empty
      while (deadline.timeRemaining() > 0 && tasks.length > 0) {
        doWorkIfNeeded();
      }
      if (tasks.length > 0){ requestIdleCallback(myNonEssentialWork); }}Copy the code

The structure of deadline is as follows

interface Dealine {
  didTimeout: boolean // Indicates whether the task execution exceeds the agreed time
  timeRemaining(): DOMHighResTimeStamp // The remaining time the task is available for execution
}
Copy the code

In the two frames in the figure, within each frame, the TASK and redering only take part of the time, instead of occupying the entire frame. In this case, the part of the idle period in the figure is the idle time, and the idle time in each frame varies according to the amount and complexity of the process in the frame. So free time is not equal.

The return value for each deadline.timeremaining () is the time (ms) until the end of the frame in which the Idle Callback resides.

Time slice calculation

Let’s take up the code from the previous article and remodel our calculateHash


    async calculateHashIdle(chunks) {
      return new Promise(resolve= > {
        const spark = new SparkMD5.ArrayBuffer();
        let count = 0;
        // Append based on the file content
        const appendToSpark = async file => {
          return new Promise(resolve= > {
            const reader = new FileReader();
            reader.readAsArrayBuffer(file);
            reader.onload = e= > {
              spark.append(e.target.result);
              resolve();
            };
          });
        };
        const workLoop = async deadline => {
          // There is a task and the current frame is not finished
          while (count < chunks.length && deadline.timeRemaining() > 1) {
            await appendToSpark(chunks[count].file);
            count++;
            // There is no complete calculation
            if (count < chunks.length) {
              / / calculation
              this.hashProgress = Number(((100 * count) / chunks.length).toFixed(2));// console.log(this.hashProgress)
            } else {
              // Finish the calculation
              this.hashProgress = 100; resolve(spark.end()); }}window.requestIdleCallback(workLoop);
        };
        window.requestIdleCallback(workLoop);
      });
    },
Copy the code

The calculation process, the page put an input box, input no pressure, the power of the time slice

React1
Fiber

Sampling the hash

The function of calculating the MD5 value of a file is to determine whether the file exists. We can consider designing a sampling hash to improve efficiency while sacrificing some hit ratio. The design idea is as follows

Cut the files into 2M slices
The entire contents of the first and last slices, and 2 bytes each in the first, middle and last places of the other slices
After merging the contents, calculatemd5And call itShadow copy Hash
thishashThe result is that the file exists, with a small probability of misjudgment, but if it does not exist, it is 100% accurate, which is similar to the idea of the Bloom filter, you can consider twohashTogether with
I tried 1.5g files on my computer. It takes about 20 seconds for the full amount, but a sampling of about 1 second is still good. It can be used to determine whether the file does not exist first
I’m such a clever little guy

Sample MD5:1028.006103515625ms Full MD5:21745.13916015625msCopy the code

    async calculateHashSample() {
      return new Promise(resolve= > {
        const spark = new SparkMD5.ArrayBuffer();
        const reader = new FileReader();
        const file = this.container.file;
        // File size
        const size = this.container.file.size;
        let offset = 2 * 1024 * 1024;

        let chunks = [file.slice(0, offset)];

        / / in front of 100 k

        let cur = offset;
        while (cur < size) {
          // Add the last piece
          if (cur + offset >= size) {
            chunks.push(file.slice(cur, cur + offset));
          } else {
            // Remove two bytes from the middle
            const mid = cur + offset / 2;
            const end = cur + offset;
            chunks.push(file.slice(cur, cur + 2));
            chunks.push(file.slice(mid, mid + 2));
            chunks.push(file.slice(end - 2, end));
          }
          // Start with two bytes
          cur += offset;
        }
        / / stitching
        reader.readAsArrayBuffer(new Blob(chunks));
        reader.onload = e= > {
          spark.append(e.target.result);

          resolve(spark.end());
        };
      });
    }
Copy the code

Network request concurrency control

After hashing hundreds of HTTP requests at a time, the process of setting up TCP kills the browser, and I remember the control of the number of concurrent asynchronous requests itself was a headline interview question

In fact, the idea is not difficult, is that we put asynchronous requests in a queue, for example, the number of concurrent 3, first three requests at the same time, and then a request to the end of the next request, the idea is clear, the code is ready

We manage the number of concurrent requests by Max. We initiate a request Max — and close a request Max ++


+async sendRequest(forms, max=4) {
+ return new Promise(resolve => {
+ const len = forms.length;
+ let idx = 0;
+ let counter = 0;
+ const start = async ()=> {
+ // There are requests, there are channels
+ while (idx < len && max > 0) {
+ max--; // Occupies the channel
+ console.log(idx, "start");
+ const form = forms[idx].form;
+ const index = forms[idx].index;
+ idx++
+ request({
+ url: '/upload',
+ data: form,
+ onProgress: this.createProgresshandler(this.chunks[index]),
+ requestList: this.requestList
+ }).then(() => {
+ max++; // Release the channel
+ counter++;
+ if (counter === len) {
+ resolve();
+ } else {
+ start();
+}
+});
+}
+}
+ start();
+});
+}Async uploadChunks(uploadedList = []) {// uploadChunks is chunks = [] {// uploadChunks is chunks = []; Const list = this.chunks.filter (chunk => uploadedList.indexof (chunk.hash) == -1).map(({ chunk, hash, index }, i) => { const form = new FormData(); form.append("chunk", chunk); form.append("hash", hash); form.append("filename", this.container.file.name); form.append("fileHash", this.container.hash); return { form, index }; })- .map(({ form, index }) =>
- request({
- url: "/upload",
- data: form,
- onProgress: this.createProgresshandler(this.chunks[index]),
- requestList: this.requestList
-})
-);
- // Direct full concurrency
- await Promise.all(list);// Control concurrency+ const ret = await this.sendRequest(list,4)

  if (uploadedList.length + list.length === this.chunks.length) {// merge await this. MergeRequest (); }},Copy the code

You know, I also did bytedance’s other interview questions, so I’m not sure if I can get through them

Slow start policy implementation

The problem with TCP congestion control is to dynamically resize slices based on current network conditions

chunkBring insizeValue, but the number of progress bars is uncertain, modifycreateFileChunk, request plus time statistics)
Let’s say we want to pass one in 30 seconds
The initial size is set at 1M. If the upload takes 10 seconds, the next block size becomes 3M
If the upload takes 60 seconds, the next block size becomes 500KB and so on
The logic of concurrent + slow start is a bit complicated, but I haven’t understood it yet, so I only pass one slice at a time to demonstrate this logic, and create a new onehandleUpload1function

Async handleUpload1(){// @todo data scaling rate can be more gentle // @todo concurrent + slow start // slow start upload logic const file = this.container.file if (! file) return; this.status = Status.uploading; const fileSize = file.size let offset = 1024*1024 let cur = 0 let count =0 this.container.hash = await this.calculateHashSample(); While (cur<fileSize){// cut offfset size const chunk = file.slice(cur, cur+offset) cur+=offset const chunkName = this.container.hash + "-" + count; const form = new FormData(); form.append("chunk", chunk); form.append("hash", chunkName); form.append("filename", file.name); form.append("fileHash", this.container.hash); form.append("size", chunk.size); let start = new Date().getTime() await request({ url: '/upload',data: Form}) const now = new Date().gettime () const time = ((now-start)/1000).tofixed (4) let rate = time/30 // If (rate<0.5) rate=0.5 if(rate>2) rate=2 console.log(' slice ${count} size ${this.format(offset)}, ${time} seconds, ${this.format(offset/rate)} ') // offset = parseInt(offset/rate) // if(time) count++}}Copy the code

Adjust slow 3G network speed to see the effect

Slice 0: size: 1024.00KB; time: 13.2770 seconds; modified: size: 2.00MB; slice 1: size: 2.00MB; time: 25.4130 seconds; time: 0.8471 times; Slice 2 is 2.36MB in size, takes 14.1260 seconds, 0.5 times as long as 30 seconds, is modified to 4.72MB in sizeCopy the code

done

Progress bar optimization

This is a small optimization that allows us to see the number of existing file blocks and concurrency, and is inspired by hard disk scanning

  <div class="cube-container" :style="{width:cubeWidth+'px'}">
    <div class="cube" 
      v-for="chunk in chunks" 
      :key="chunk.hash">
      <div           
        :class="{ 'uploading':chunk.progress>0&&chunk.progress<100, 'success':chunk.progress==100 }" 
        :style="{height:chunk.progress+'%'}"
        >
        
        <i v-if="chunk.progress>0&&chunk.progress<100" class="el-icon-loading" style="color:#F56C6C;"></i>
      </div>
    </div>
  </div>
Copy the code


.cube-container
  width 100px
  overflow hidden
.cube
  width 14px
  height 14px
  line-height 12px;
  border 1px solid black
  background  #eee
  float left
  >.success
    background #67C23A
  >.uploading
    background #409EFF

Copy the code

// The width of the progress bar is controlled by the square root of the number of square slices as possible

cubeWidth(){
  return Math.ceil(Math.sqrt(this.chunks.length))*16
},

Copy the code

The effect can also be seen again

Concurrent retry + error

Request error. Catch requeues the task
If an error occurs, the progress bar is set to -1 and the progress bar is displayed in red
This array stores the number of retries for each hash request,0,2 [1]The 0 file slice will report one error, and the 2 file slice will report two errors
More direct than 3reject

First, the back-end simulation reports an error

      if(Math.random()<0.5) {// Probability error
        console.log('The probability is wrong.')
        res.statusCode=500
        res.end()
        return 
      }

Copy the code

async sendRequest(urls, max=4) {
- return new Promise(resolve => {
+ return new Promise((resolve,reject) => {
         const len = urls.length;
         let idx = 0;
         let counter = 0;
+ const retryArr = []Const start = async ()=> {// Request, channel- while (idx < len && max > 0) {
+ while (counter < len && max > 0) {max--; Console. log(idx, "start");- const form = urls[idx].form;
- const index = urls[idx].index;
- idx++
+ // Tasks should not be acquired only by accumulation, but by state
+ // Wait and error requests can be sent to facilitate retries
+ const I = urls. FindIndex (v = > v. tatus = = Status. Wait | | v. tatus = = Status. The error) / / wait or error
+ urls[i].status = Status.uploading
+ const form = urls[i].form;
+ const index = urls[i].index;
+ if(typeof retryArr[index]=='number'){
+ console.log(index,' start retry ')
+}
             request({
               url: '/upload',
               data: form,
               onProgress: this.createProgresshandler(this.chunks[index]),
               requestList: this.requestList
             }).then(() => {
+ urls[i].status = Status.donemax++; // release the channel counter++;+ urls[counter].done=true
               if (counter === len) {
                 resolve();
               } else {
                 start();
               }
-});
+ }).catch(()=>{
+ urls[i].status = Status.error
+ if(typeof retryArr[index]! =='number'){
+ retryArr[index] = 0
+}
+ // The number of times
+ retryArr[index]++
+ // a request has three errors
+ if(retryArr[index]>=2){
+ return reject()
+}
+ console.log(index, retryArr[index],' error ')
+ // Restart within 3 times
+ this.chunks[index]. Progress = -1
+ max++; // Release the channel currently occupied, but counter does not accumulate
+              
+ start()
+})
           }
         }
         start();

}

Copy the code

As shown in the figure, the block will turn red after an error, but it will be retried

File defragmentation

If a lot of people leave the file in the middle of the upload, these slices are useless. Consider periodically cleaning them up. Of course, we can use node-schedule to manage scheduled tasks, such as scanning target once a day, or deleting files that were modified a month ago

// In order to test, I changed to scan every 5 seconds, expired 1 minute delete for demonstration
const fse = require('fs-extra')
const path = require('path')
const schedule = require('node-schedule')


// Delete an empty directory
function remove(file,stats){
    const now = new Date().getTime()
    const offset = now - stats.ctimeMs 
    if(offset>1000*60) {// Fragments greater than 60 seconds
        console.log(file,'Expired, waste of space, delete')
        fse.unlinkSync(file)
    }
}

async function scan(dir,callback){
    const files = fse.readdirSync(dir)
    files.forEach(filename= >{
        const fileDir = path.resolve(dir,filename)
        const stats = fse.statSync(fileDir)
        if(stats.isDirectory()){
            return scan(fileDir,remove)
        }
        if(callback){
            callback(fileDir,stats)
        }
    })
}
// * * * * * *
// ┬ ┬ ┬ ┬ ┬ ┬
// │ │ │
// │ │ │ ├ day │ ├ day (0-7) (0 or 7 is Sun)
// │ │ ├ ─── month (1-12)
// │ │ │ ├ ────────── ────────── ────────── ────────── ─── the day of celebration (1-31)
// │ ├ ─────────────── ─────────────── ───── hour (0-23)
// │ ├ ──────────────────── ────────────────
/ / └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ the second (0 to 59, OPTIONAL)
let start = function(UPLOAD_DIR){
    / / every 5 seconds
    schedule.scheduleJob("*/5 * * * * *".function(){
        console.log('Start scanning')
        scan(UPLOAD_DIR)
    })
}
exports.start = start

Copy the code

Upload /target/625c... /625c... Delete /upload/target/625c... /625c... Delete /upload/target/625c... /625c... Delete /upload/target/625c... /625c... Delete /upload/target/625c... /625c... -12 Delete it if it is expiredCopy the code

Follow-up expansion and thinking

Leave a few thinking questions, next time to write an article to achieve convenience to continue to warm

RequestIdleCallback compatibility, how to implement one
1. React is also the scheduling logic written by myself. I will write an article about it sometime in the future
2. React’s own implementation of requestIdleCallback
Concurrent + slow start coordination
Sampling hash+ full hash+ time slice
Large file slice download
1. The same slicing logic is used to obtain content-length via the AXIos. head request
2. Using the HTTP Range header, you can download slices. Otherwise, the logic is similar to uploading
Small experience optimizations
1. Tips such as reminders to leave the page and so on
Slow start changes should be smoother, such as using trigonometric functions to limit the smooth rate of change to between 0.5 and 1.5
Websocket push progress

Thinking and summarizing

Any seemingly simple need becomes complex as it increases in magnitude, and so does life
File upload is simple, large file is complex, single simple, distributed difficult
Even a simple leftPad function (left complement character), considering the performance, dichotomy performance is more than array join, causing a lot of discussion
1. On the implementation of left-pad function
2. Any knowledge point is worth digging deep
The next time the product manager says what needs to be simple, he will kan
I’m going to combine the last post, record a large file upload hand touch hand analysis video stay tuned

code

The first half of the @yeyan1996 copy of the code, the code behind the main to explain the idea, the implementation of the more rough, for light spray github.com/shengxinjin…

Welcome to attention

The resources

I also copied some of the images directly from the link below

For the novice front end of the various file upload guide
Bytedance Interviewer: Please implement a large file upload and breakpoint resume
This is probably the most popular way to open React Fiber
The vernacular Bloom filter
requestIdleCallback
TCP congestion control problems
Download the Python implementation file
RequestIdleCallback – Indicates background task scheduling