How to efficiently upload large files?

Overview of the overall process

The overall process is roughly divided into:

  1. A hash of the subcontract;
  2. Concurrent upload;
  3. Process request responses and collect statistics on upload progress

The following is an analysis of these processes for different situations. In addition, according to the PRINCIPLE of HTTP, breakpoint continuation, retransmission mechanism, subcontracting mechanism and slow start mechanism are optimized accordingly

A hash of the subcontract

  • Create the file hash method

Single file case, using Spark-MD5 is the simplest subcontracting

import SparkMD5 from 'spark-md5'
const spark = new SparkMD5.ArrayBuffer()

  function hash(file) {
    return new Promise(resolve= > {
      const read = new FileReader()
      read.readAsArrayBuffer(file)
      read.onload = function () {
        spark.append(this.result)
        const result = spark.end();
        resolve(result)
      }
    })
  }
Copy the code

If the file is too large, it will take a long time to compute the total hash of the file. For example, it takes several minutes to compute the hash of a 10GB file. This cost is too high, so we need to do slice hash for the large file

  • Sampling hash for large files

For files larger than 10M, we slice and sample and then compute hash, which improves the efficiency of computation by tens of times

var blobSlice = File.prototype.slice || File.prototype.mozSlice || File.prototype.webkitSlice; / / function hash(file) {return new Promise(async resolve => {let file_ = file const read = new FileReader() if (file.size > 10 * 1024 * 1024) { file_ = await getSample(file) } read.readAsArrayBuffer(file_) read.onload = function  () { spark.append(this.result) const result = spark.end(); / function getSample(file) {return new Promise(resolve => {const start = 2 * 1024 *) 1024 const chunks = [blobSlice.call(file, 0, start)] let surplusFile = blobSlice.call(file, Start) const step = parseInt(surplusfile.size / 10) // slice 10% const useSize = parseInt(step / 5) // let I = 0 while (i < 10) { const slice = blobSlice.call(surplusFile, start + step * i, start + step * (i + 1)) chunks.push( blobSlice.call(slice, 0, useSize) ) i++ } resolve(new Blob(chunks, { type: file.type || '' })) }) }Copy the code
  • File subcontracting batch hash

File hash can be subpartitioned in two ways. One is that a single file hash triggers a request

This leads to frequent function stack switching. Another option is to continuously compute hash after subcontracting, and then unify request. However, batch one-time hash will inevitably occupy JS threads.

  1. Use requestIdleCallback to schedule calculations of hash files in idle time
  2. The webWorker computes the hash and notifies the JS main process when the calculation is complete

The former can be interrupted by other tasks that occupy the JS thread for a long time and the latter can avoid the former problem but cannot control the statistics of hash progress so we use requestIdleCallback

/** / function splitHash(files, chunkSize) { const res = [] return new Promise(resolve => { let current = 0 var chunksCount = Math.ceil(files.size / chunkSize); const read = new FileReader() function fn() { let start = chunkSize * current let end = start + chunkSize > files.size ?  files.size : start + chunkSize read.readAsArrayBuffer(files, start, end) } read.onload = function () { current++; // Set hash progress setProgress1((current/chunksCount).tofixed (2) * 100) spark.append(this.result) res.push({hash: spark.end(), file: this.result }) if (current >= chunksCount) { return resolve(res) } requestIdleCallback(fn) } requestIdleCallback(fn) }) }Copy the code

Request encapsulation

  • Local control request timed out
/** function request(options, timeOut = 30 * 1000) { const controller = new AbortController() const single = controller.single const timeoutFunction = Function (timer) {return new Promise(resolve => {setTimeout(() => {resolve(new Response(// simulate the request to return the content JSON.stringify({ code: 999, message: 'time out', success: False,})) controller.abort() // Abort the local close request, Request status set to 'cancel'}, timer)})} return promise. race([timeoutFunction(timeOut), fetch(options. Url, {single: single, method: "post", body: options.data, }) ]) }Copy the code
  • With regard to the number of HTTP concurrent requests for the browser, an HTTP connection request can only be accessed by one thread at a time, so a manager is required to manage this, which is the connection pool manager.

Thread pool Creates a number of threads and places them in an idle queue before a task arrives. These threads are all in the sleep state, that is, they are all started, do not consume CPU, but only occupy a small memory space. When the client has a new request, it wakes up one of the sleeping threads in the thread pool to handle the request from the client. When the request is processed, the thread is asleep again

Different browsers have different strategies for controlling concurrency

The default number of concurrent requests is 3, which gives space for other HTTP operations under the same domain name

/ function concurrentRequest(formDataArr) {let loadedNum = 0; / function concurrentRequest(formDataArr) {let loadedNum = 0; const total = formDataArr.length const results = [] const concurrentNun = 3 let i = 0; Return new Promise(resolve => {async function doing() {while (I < concurrentNun && formDataArr.length) { i++ const data = formDataArr.shift() const response = await request({ url: url_, data: data, }) const res = await response.json(); i--; if (res.success && res.code === 0) { loadedNum++; ToFixed (2) * 100)} else if(failMap(data.get('hash')) < 3) {if(data &&) data.get) { failMap[data.get('hash')] = failMap(data.get('hash')) ? FailMap [data.get('hash')] + 1:1 formDataarr. unshift(data) // If the number of failed attempts does not exceed three times, retransmission}}else {// User-defined error code resolve({code: -1, success: false }) } if (formDataArr.length) doing() results.push({ hash: data.hash, code: res.code }) if (loadedNum === total) resolve({ code: 0, data: results, success: true }) } } doing() }) }Copy the code

Slow start to find the desired package size

Here we refer to the slow start principle of HTTP. The initial IMPLEMENTATION of TCP is to send large packets to the network after the connection is established successfully. If the network has problems, many such large packets will accumulate on the router, which is easy to exhaust the cache space of the router in the network, resulting in congestion. So the TCP agreement, new established connection will not be able to send large size data from the start, but to start with a small size package to send, in the process of sending and data confirmed by the other party receiving rate to calculate each other, to gradually increase the volume of data every time send finally reached a steady value, enter the stage of high speed transmission. In the process of slow start, the TCP channel is in the low-speed transmission phase. That strategy is slow start. However, the problem of a slow start causing the client and server to take hundreds of milliseconds to reach near maximum speed is not significant for large streaming services, because the slow start time can be absorbed over the entire transmission cycle. However, for many HTTP connections, especially those that are transient and burst, it is common to terminate before the maximum window request is reached. In other words, the performance of many Web applications is often constrained by round-trip times between the server and the client. This is because slow starts limit the throughput available, which is detrimental for small file transfers. In addition to adjusting the transmission speed of new connections, TCP also implements the slow-start (SSR) mechanism. Slow start has its drawbacks, so we only expect 10 seconds to upload a single package by default when we start slow start, rather than subcontracting, and can adapt the incoming parameters to the server and network

/ * * slow start | get package * / function that has been uploaded slowBegin (file, totalHash, chunkSize = 1024 * 1024 * 2, ExpectTime = 10 * 1000) {let slowTestCount = 0 // slowTestCount = 0 return new Promise(resolve => {let lastFile = file let size = chunkSize let startTime = new Date() const request_test = async function () { let f = blobSlice.call(lastFile, 0, chunkSize) lastFile = blobSlice.call(lastFile, chunkSize) let formDataArr = new FormData(); formDataArr.append("totalHash", totalHash); formDataArr.append("hash", await hash(f)); formDataArr.append("file", f); formDataArr.append("fileName", file.name); formDataArr.append("index", slowTestCount); request({ url: url_, data: formDataArr, }).then(async response => { slowTestCount++ let res = await response.json(); if (res && res.success) { if ((new Date() - startTime) < expectTime) { resolve({ chunkSize: size, lastFile, loadedHash: res.data, slowTestCount, }) } else { size = size / 2 request_test() } } }) } request_test() }) }Copy the code
  • Entrance to open
Async function work({file}) {if (file) {const totalHash = await hash(file) const {lastFile, chunkSize, async function work({file}) {const totalHash = await hash(file) const {lastFile, chunkSize, loadedHash, slowTestCount } = await slowBegin(file, totalHash) let splitRes = await splitHash(lastFile, SplitRes = splitres.foreach (item => {if (loadedhash.includes (item.hash)) {item. event = true } }) const formDataArr = []; splitRes.forEach((item, index) => { if(! item.uploader) { let formData = new FormData(); formData.append("hash", item.hash); formData.append("file", item.file); formData.append("totalHash", totalHash); formData.append("index", index + slowTestCount); formDataArr.push(formData); } }) concurrentRequest(formDataArr) } }Copy the code

thinking

The overall implementation process is nothing more than emulating HTTP’s slow start, breakpoint continuation, and hash subcontracting. Breakpoint continuation requires the server to return an array of the package hash that has been uploaded, and each time it needs to be repackaged and marked. Other ways: you can store the sorted packages in the local indexDB, but after the experiment in indexDB, it will be escaped to garble. Later, it is found that you can transfer files to Base64 for storage, but this increases the cost of transmission