Large file upload technology implementation

Demand analysis

For large file uploads, there are at least a few things we want to do

  • Large files are cut and uploaded in fragments
  • If part of the slice fails to upload, we want to remind the user to re-upload, and the upload succeeds without uploading
  • It is better to have a progress reminder of the upload

The project architecture

In this project, we plan to use Vite +vue3+ Element-Plus in the foreground and KOA framework in the background

Vite creates the vue3 project

// https://cn.vitejs.dev/guide/#scaffolding-your-first-vite-project
// Execute the following statement to complete the creation
yarn create vite big-upload --template vue
cd big-upload-ui
// Install the required libraries
yarn add element-plus
Copy the code

Koa scaffolding creates background projects

// KoA2 scaffolding
npm install koa-generator -g
// Scaffolding creates the project
koa2 server
cd server
yarn
// Install the corresponding library
yarn add koa-body fs-extra
// Delete some unnecessary files and import koa-body globally and configure to create the upload route
Copy the code

Upload large files in fragments

The front-end thinking

File section

Our selection file is the input input box we use, and it’s easy to get the selected value. The core of a file slice is the slice method of a file object, which is similar to an array. We can call this method to obtain a certain section of a file. If you are not familiar with file object, please make up for it.

File unique value

Now there’s the big problem of how to tell the back end whether the two files we uploaded are the same or not. Obviously, it’s not a good idea to use the file name as the unique identifier. At this point it occurred to us that we could use MD5 to encrypt the file to get a unique hash value.

To generate the hash value, we call the spark-MD5 library, which consumes a lot of CPU and causes the browser to stop working. In order to optimize the experience, we use the web-worker to calculate the hash in the worker thread. If you are not familiar with it, please make up for the lessons. Jennifer. I’m not really familiar with him

Create a hash.js file to compute the hash

// Import the script
self.importScripts('/spark-md5.js');

// Generate the hash of the file
self.onmessage = (e) = > {
  const { fileChunkList } = e.data;
  const spark = new self.SparkMD5.ArrayBuffer();
  let percentage = 0;
  let count = 0;
  const loadNext = (index) = > {
    const reader = new FileReader();
    reader.readAsArrayBuffer(fileChunkList[index].file);
    reader.onload = (e) = > {
      count++;
      spark.append(e.target.result);
      if (count === fileChunkList.length) {
        self.postMessage({
          percentage: 100.hash: spark.end(),
        });
        self.close();
      } else {
        percentage += 100 / fileChunkList.length;
        self.postMessage({
          percentage,
        });
        // Recursively calculate the next sliceloadNext(count); }}; }; loadNext(0);
};
Copy the code

In worker thread, accept file slice fileChunkList, use FileReader to read the ArrayBuffer of each slice and continuously pass it into Spark-MD5. A progress event is sent to the main thread via postMessage after each slice is computed, and the final hash is sent to the main thread when all slices are completed

File upload

Assuming that all of our file slices have been uploaded successfully, the server will not actively help us merge slices at this time. We need to send a request to merge slices to inform the server to help us send slices.

Defining data structures

Ok, so now we can get the file and we can shard it and make sure that the file is unique, at this point in time to ensure the integrity of the project, let’s define the data structure.

Slice (chunk) : + chunk: corresponds to the slice returned by our file.slice + size :chunk.size + index: The current block is the subscript of the file + fileHash: the hash value of the file + chunkHash: the hash value of the slice after the shard --> here we use`${fileHash}-${index}`Hash value of the block + percentage: indicates the upload progress of the current blockCopy the code

Code implementation

<template> <h1> Large file upload </h1> <input type="file" @change="handleFileChange" /> <el-button @click="handleUpload"> </el-button> </template> <script> const SIZE = 3 * 1024 * 1024; Export default {data() {return {file: null, // file hash: ", // file hash chunkList: [], // Slice list}; }, methods: { handleFileChange(e) { const [file] = e.target.files; if (! file) { this.file = null; return; } this.file = file; CreateFileChunk (file, size = size) {const fileChunkList = []; let cur = 0; Filechunklist.push ({file: file.slice(cur, cur + size)}); filechunklist.push ({file: file.slice(cur, cur + size)}); cur += size; } return fileChunkList; Const requestList = this.chunklist.map (({chunk, chunkHash, index, fileHash }) => { const formData = new FormData(); formData.append('chunk', chunk); formData.append('chunkHash', chunkHash); formData.append('fileHash', fileHash); return { formData, index }; }) .map(async ({ formData, index }) => this.request({ url: 'http://localhost:8080/upload-chunk', method: 'post', data: formData, }) ); await Promise.all(requestList); Await this. MergeRequest (); / / merge section}, / / notification service section of the merger of the async mergeRequest () {await this. Request ({url: 'http://localhost:8080/merge', method: 'post', headers: { 'content-type': 'application/json' }, data: JSON.stringify({ filename: this.file.name, fileSize: this.file.size, size: SIZE, hash: this.hash }), }); }, // Click the event async handleUpload() {if (! This.file) {console.log(' please select a file '); return; } // File fragment const fileChunkList = this.createFilechunk (this.file); Hash this.hash = await this.calculateHash(fileChunkList); This.chunklist = filechunkList.map (({file}, index) => ({chunk: file, size: file.size, chunkHash: `${this.hash}-${index}`, fileHash: this.hash, index, percentage: uploadedList.includes(`${this.hash}-${index}`) ? 100:0})); // Upload chunk await this. UploadChunks (uploadedList); ,}}}; </script>Copy the code

The back-end idea

From the analysis above, the backend first accepts the chunks and stores them in the specified directory, and then merges the chunks into the original file upon receiving the merge request

Here we specify that the final directory structure for a successfully uploaded file will be the following

+ target
	+ fileHash-chunks
		+ chunkHash
	file
Copy the code

We save all the files in the target directory, and add all the chunks to our file, and then merge all the chunks into a file with a fileHash name. (PS: folder we added a suffix because the system does not allow files and folders with the same name, it was a long time in the beginning). If you don’t understand my description, look at the following code, which is mainly a few constraints on the logic.

Configuration of koa – body

app.use(koaBody({ multipart: true.formidable: { maxFileSize: 200 * 1024 * 1024}}));Copy the code

Upload the routing

const router = require('koa-router') ();const path = require('path');
const fse = require('fs-extra');

// Large file storage directory
const UPLOAD_DIR = path.resolve(__dirname, '.. '.'target');
// Extract file name extensions
const extractExt = (filename) = > filename.slice(filename.lastIndexOf('. '), filename.length);

/** * Create a readStream for path and write to writeStream, and delete the file after writing *@param {String} path
 * @param {String} writeStream* /
const pipeStream = (path, writeStream) = >
  new Promise((resolve) = > {
    const readStream = fse.createReadStream(path);
    readStream.on('end'.() = > {
      fse.unlinkSync(path);
      resolve();
    });
    readStream.pipe(writeStream);
  });

/** * Read all chunks into filePath *@param {String} FilePath File storage path *@param {String} ChunkDir Name of the chunk storage folder *@param {String} Size Size of each chunk */
async function mergeFileChunk(filePath, chunkDir, size) {
  // Get the chunk list
  const chunkPaths = await fse.readdir(chunkDir);
  // Sort by slice subscript. Otherwise, direct reading of the directory may get out of order
  chunkPaths.sort((a, b) = > a.split(The '-') [1] - b.split(The '-') [1]);
  await Promise.all(
    chunkPaths.map((chunkPath, index) = >
      pipeStream(
        path.resolve(chunkDir, chunkPath),
        // Specify a location to create a writable stream
        fse.createWriteStream(filePath, {
          start: index * size,
          end: (index + 1) * size,
        })
      )
    )
  );
  fse.rmdirSync(chunkDir); // Delete the directory where the slices are saved after merging
}

/ / upload the chunk
router.post('/upload-chunk'.async (ctx, next) => {
  const { chunkHash, fileHash } = ctx.request.body;
  const { chunk } = ctx.request.files;
  const chunkDir = path.resolve(UPLOAD_DIR, `${fileHash}-chunks`);
  // The slice directory does not exist. Create a slice directory
  if(! fse.existsSync(chunkDir)) {await fse.mkdirs(chunkDir);
  }
  await fse.move(chunk.path, `${chunkDir}/${chunkHash}`);
  ctx.body = { code: 0.data: ' '.msg: 'Upload successful' };
});

/ / merge
router.post('/merge'.async (ctx, next) => {
  const { filename, fileSize, size, hash } = ctx.request.body;
  const ext = extractExt(filename);
  const filePath = path.resolve(UPLOAD_DIR, `${hash}${ext}`);
  const chunkDir = path.resolve(UPLOAD_DIR, `${hash}-chunks`);
  await mergeFileChunk(filePath, chunkDir, size);
  ctx.body = { code: 0.data: ' '.msg: 'Merge successful' };
});
Copy the code

At this point a simple large file upload is complete.

Progress bar function

There are two kinds of upload progress, one is the upload progress of each slice, and the other is the upload progress of the whole file. The upload progress of the whole file is calculated based on the upload progress of each slice, so we first realize the upload progress of the slices

Slice progress bar

XMLHttpRequest native support upload progress listener, only need to listen to upload. Onprogress can, we in the original request based on the onprogress parameter, to XMLHttpRequest registered listener event

Each slice needs to correspond to an upload progress. In this case, you should write a method to progress bar the slice object.

// Item is our chunk object
createProgressHandler(item) {
  return (e) = > {
    item.percentage = parseInt(String((e.loaded / e.total) * 100));
  };
},
// Bind this method to the onProgress parameter when uploading slices
Copy the code

The progress bar of the file

The upload progress of the current file can be obtained by adding up the uploaded parts of each slice and dividing by the size of the entire file, so Vue is used here to calculate the attributes

computed: {
  // Calculate the total upload progress for each chunk
  uploadPercentage() {
    if (!this.file || !this.chunkList.length) return 0;
    const loaded = this.chunkList.map((item) = > item.size * item.percentage).reduce((acc, cur) = > acc + cur);
    return parseInt((loaded / this.file.size).toFixed(2)); }},Copy the code

The basic function of uploading large files is almost complete.

After the function I will not post the code to write code specifications and ideas, I post the specific code.

A pass from the file

If you don’t have the file, I’ll start uploading it. If you have the file, I’ll be lazy. If you have the file, I’ll stop uploading it.

So we need to implement a verify interface to ask the server if it has this file, because we hash the file before, to ensure that the file is unique. Just use this hash to uniquely identify the file. The idea behind this interface is to determine if the file exists in our target directory.

Breakpoint continuingly

Resumable means that when we upload a file, if the file fails to upload, we will upload the file that we failed before, and we will skip the file that succeeded.

pause

We first manually implement a button, click to stop the current upload situation. Simulated upload failure

The idea, of course, is to modify our request method, but before we do that we need to know that the XMLHttpRequest object can stop the current network connection on its own initiative, if you don’t know.

So all we need to do is use a public array, every time we make a request, we’ll save our current XMLHttpRequest object, and when the request succeeds, we’ll remove that object, When we hit the pause button we’re going to walk through this array and call each of the XMLHttpRequest’s abrot methods to cancel the upload.

Restore to upload

To restore upload is to start uploading again, but we need to upload the chunk array from the server that has not been successfully uploaded before. So there are two requirements

  • Know which chunks were successfully uploaded to the server
  • Before uploading a chunk, remove the chunk successfully

For requirement 1, we modified the previous masterpass interface, which not only tells us whether the file exists on the server, but also tells us how successfully the current file block has been uploaded. In other words, get the list of file names in the FileHash-chunks folder under target and return them.

Requirement 2: We only need to determine whether the current chunk is uploaded when building the chunk array. If the chunk is uploaded, the modification progress is 100. During the upload of the chunk, the request is sent only when the current chunk is not uploaded.

conclusion

At this point our large file upload is complete.

The complete code

Refer to the article

Ask questions

  • Failed slice uploads are not handled
  • The browser might crash if you upload too many slices and send too many network requests
  • If the file is too large, the hash computation will be very, very slow even if we are using a Web worker
  • Can you do a reverse (large file download)

To optimize the

It takes too long to compute the hash of a large file

In the beginning, we used requestIdleCallback to improve the process of calculating hash values, referring to the implementation of Fiber in React. If the file is too large, it will still lag for a long time. Finally, we intend to use the sampling approach to compute the hash, giving up a fraction of accuracy in exchange for time

Set it to a smaller size, such as 2M

  • When we calculate the hash, we split the large file by 2M to get another chunks array,
  • We need both the first element (chunks[0]) and the last element (chunks[-1])
  • Other elements (chunks[1,2,3,4….] ) Let’s do a partition again, this time the partition is a super small size such as 2KB, let’s take the head, tail, and middle 2KB of each element.
  • Finally, we combine them into a new file, and we fully compute the hash value of the new file.

At this point it won’t take much time to compute the hash of a large file. Note that the hash is calculated using a sampling calculation, but we will use the previous slicing method for uploading

The method of calculating the hash after modification

// Compute the hash using the web-worker
calculateHash(fileChunkList) {
return new Promise((resolve) = > {
  // Add worker attributes
  // this.worker = new Worker('/hash.js');
  // this.worker.postMessage({ fileChunkList });
  // this.worker.onmessage = (e) => {
  // const { percentage, hash } = e.data;
  // this.hashPercentage = percentage;
  // if (hash) {
  // resolve(hash);
  / /}
  // };
  const spark = new SparkMD5.ArrayBuffer();
  const reader = new FileReader();
  const file = this.file;
  // File size
  const size = this.file.size;
  let offset = 2 * 1024 * 1024;
  let chunks = [file.slice(0, offset)];
  / / in front of 100 k
  let cur = offset;
  while (cur < size) {
    // Add the last piece
    if (cur + offset >= size) {
      chunks.push(file.slice(cur, cur + offset));
    } else {
      // Remove two bytes from the middle
      const mid = cur + offset / 2;
      const end = cur + offset;
      chunks.push(file.slice(cur, cur + 2));
      chunks.push(file.slice(mid, mid + 2));
      chunks.push(file.slice(end - 2, end));
    }
    // Start with two bytes
    cur += offset;
  }
  / / stitching
  reader.readAsArrayBuffer(new Blob(chunks));
  reader.onload = (e) = > {
    spark.append(e.target.result);
    this.hashPercentage = 100;
    resolve(spark.end());
  };
});
},
Copy the code

Too many file slices cause too many concurrent HTTP requests

Starting with Promise.all(requestList), if there were 100 web requests that would cause the browser to create 100 web requests at that moment, the browser would be extremely slow.

This time we need to control the number of concurrent, at the same time, the best we can do is to send Max a network request, using a while loop, a network of each request Max -, only if the Max is greater than 0 continue to send, after getting response when the request, there are two kinds of circumstances, one is we have already sent the number of requests is not enough, So we’re going to continue sending the network request, and the other way is that the network request that needs to be sent is sent, so we’re just going to end this while loop

The modified code is as follows

// Control the number of concurrent requests
async sendRequest(forms, max = 4) {
  return new Promise((resolve) = > {
    const len = forms.length;
    let idx = 0;
    let counter = 0;
    const start = async() = > {// There are requests, there are channels
      while (idx < len && max > 0) {
        max--; // Occupies the channel
        console.log(idx, 'start');
        let { formData, index } = forms[idx];
        idx++;
        await this.request({
          url: 'http://localhost:8080/upload-chunk'.method: 'post'.data: formData,
          onProgress: this.createProgressHandler(this.chunkList[index]),
          requestList: this.requestList,
        }).then(() = > {
          max++; // Release the channel
          counter++;
          if (counter === len) {
            resolve();
          } else{ start(); }}); }}; start(); }); },// Upload file slices
async uploadChunks(uploadedList = []) {
  // Construct the request list
  const requestList = this.chunkList
    .filter((chunk) = >! uploadedList.includes(chunk.chunkHash)) .map(({ chunk, chunkHash, index, fileHash }) = > {
      const formData = new FormData();
      formData.append('chunk', chunk);
      formData.append('chunkHash', chunkHash);
      formData.append('fileHash', fileHash);
      return { formData, index };
    });
  // .map(async ({ formData, index }) =>
  // this.request({
  // url: 'http://localhost:8080/upload-chunk',
  // method: 'post',
  // data: formData,
  // onProgress: this.createProgressHandler(this.chunkList[index]),
  // requestList: this.requestList,
  / /})
  // );
  // Wait for all sending to complete
  // await Promise.all(requestList); // Concurrent slicing
  // Control concurrency
  await this.sendRequest(requestList, 4);
  // When all chunks have been sent, notify the background to unmerge slices
  if (uploadedList.length + requestList.length === this.chunkList.length) {
    await this.mergeRequest(); }},Copy the code

Failed to upload slices

Requirements:

  • A failure message must be displayed after the first sending error
  • After the first attempt fails, we retransmit the request 3 times -> a maximum of 4 times
  • A message is displayed when retransmission fails for three times
  • All requests need to be processed before the next step can be taken

In the first step, we define a state, and each request corresponds to one of the following four states. At the beginning, all requests are in the wait state, and will change to the Error state when an error occurs. After three retransmission failures, the request will change to the Fail state, and the request will change to the done state successfully

const Status = { wait: 1.error: 2.done: 3.fail: 4 };
Copy the code

There is an equivalence relationship to keep in mind, all requests = wait+error+done+ FAIL, and the final status is only done and fail

So let’s write some code. First, we’re going to modify our request method, because it doesn’t give us error feedback when a request fails, we reject it

xhr.onreadystatechange = function () {
  if (xhr.readyState === 4) {
    if (xhr.status === 200) {
      // The response succeeded
    } else {
      // Control the progress
      onProgress({ loaded: 0.total: 100 });
      // Error handlingreject(xhr.statusText); }}};Copy the code

The retryArr[1]=2 means that we sent two retries for the first network request. At the beginning, we don’t iterate through the form array. We look for the status, we send any requests that are in the error or wait state, and when we don’t find any, we reject all the states that are now done and fail, and then we catch them in a catch and mark the number of retransmissions, and mark the status as failed after three.

async sendRequest(forms, max = 4) {
  return new Promise((resolve, reject) = > {
    const len = forms.length;
    let counter = 0; // The request has been sent successfully
    const retryArr = []; // Record the number of errors
    // Start by setting all form states to wait
    forms.forEach((item) = > (item.status = Status.wait));
    const start = async() = > {// There are requests, there are channels
      while (counter < len && max > 0) {
        max--; // Occupies the channel
        // If it is not finished, we will send it again
        let idx = forms.findIndex((v) = > v.status == Status.wait || v.status == Status.error);
        if (idx == -1) {
          // Failed state and wait state not found
          return reject();
        }
        let { formData, index } = forms[idx];
        await this.request({
          url: 'http://localhost:8080/upload-chunk'.method: 'post'.data: formData,
          onProgress: this.createProgressHandler(this.chunkList[index]),
          requestList: this.requestList,
        })
          .then(() = > {
            forms[idx].status = Status.done;
            max++; // Release the channel
            counter++;
            if (counter === len) {
              resolve();
            }
          })
          .catch(() = > {
            forms[idx].status = Status.error;
            if (typeofretryArr[index] ! = ='number') {
              this.$message.info(The first `${index}Block upload failed, the system is ready to retry);
              retryArr[index] = 0;
            }
            // Add the number of times
            retryArr[index]++;
            // A request has three errors
            if (retryArr[index] > 3) {
              this.$message.error(The first `${index}Block retry multiple times invalid, abort upload ');
              forms[idx].status = Status.fail;
            }
            max++; // Release the channel}); }}; start(); }); },Copy the code

Overall code in the above, or very recommended to write their own, because I this rookie in the time of writing also stepped on a lot of pit points, write for a long time to finish.

Periodic chunk cleanup

Many students would be asked, the chunk is not merged all deleted, why regular cleaning, we found that after long time application of project, some students test our code when uploading large files failed no matter he, lead to the large file because there is no completely uploaded successfully, there is no clear merge and the chunk.

So the requirement comes, we need to detect the change time of the file. If there is no change after a period of time, we will think that the chunk is no longer needed by the user, and we will proactively clean it up.

Overall: scheduled tasks, file detection, file operations

The code is directly on, the program is only executed, as long as the app in app.js introduce a line.

const fse = require('fs-extra');
const path = require('path');
const schedule = require('node-schedule');
// Large file storage directory
const UPLOAD_DIR = path.resolve(__dirname, '.. '.'target');

// Delete empty files
function remove(file, stats) {
  const now = new Date().getTime();
  const offset = now - stats.ctimeMs;
  if (offset > 1000 * 60) {
    // Fragments greater than 60 seconds
    fse.unlinkSync(file);
    console.log(file, 'File expired, deleted'); }}async function scan(dir, callback) {
  const files = fse.readdirSync(dir);
  files.forEach((filename) = > {
    const fileDir = path.resolve(dir, filename);
    const stats = fse.statSync(fileDir);
    if (stats.isDirectory()) {
      // Delete the file
      scan(fileDir, remove);
      // Delete the empty folder
      if (fse.readdirSync(fileDir).length == 0) {
        fse.rmdirSync(fileDir);
      }
      return;
    }
    if(callback) { callback(fileDir, stats); }}); }// * * * * * *
// ┬ ┬ ┬ ┬ ┬ ┬
// │ │ │
// │ │ │ ├ day │ ├ day (0-7) (0 or 7 is Sun)
// │ │ ├ ─── month (1-12)
// │ │ │ ├ ────────── ────────── ────────── ────────── ─── the day of celebration (1-31)
// │ ├ ─────────────── ─────────────── ───── hour (0-23)
// │ ├ ──────────────────── ────────────────
/ / └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ the second (0 to 59, OPTIONAL)
let start = function () {
  / / every 5 seconds
  schedule.scheduleJob('*/5 * * * * *'.function () {
    console.log('Start periodic cleanup of chunks');
    scan(UPLOAD_DIR);
  });
};

start();
Copy the code