preface
I saw an article a few days ago and was deeply touched
Bytedance Interviewer: Please implement a large file upload and breakpoint resume
The author from 0 to achieve the large file section upload, breakpoint continued to pass, second pass, pause and other functions, the in-depth and simple to carry out a comprehensive analysis of the interview questions
With no more rainbow fart blowing, I decided to take the heat, record the video, include the author’s full coding process, and continue with this article, so watch the above article before eating it, I’ve made some extensions below
- To calculate
hash
Time-consuming issues not only can be passedweb-workder
, can also refer toReact
theFFiber
Architecture, throughrequestIdleCallback
To take advantage of idle browser time without getting stuck in the main thread - file
hash
The calculation is to judge whether the file exists, and then achieve the second transmission function, so we can refer toBloom filter
The idea of sacrificing a little bit of recognition for time, like we canSampling is a hash
- In this paper, through
web-workder
lethash
Calculation does not caton the main thread, but large files due to too many slices, too manyHTTP
Link to the browser, will also be suspended (I tried 4 gigabytes, directly blocked), we can control the asynchronous requestconcurrency
I think this is one of the top interview questions - The uploading progress of each slice does not need to be displayed in a table. Instead, we use a square progress bar to make it more straight (as shown in the figure).
- How to retry if there is an error when uploading concurrently? For example, we allow two retries for each slice, and three retries before termination
- Since the file size varies, it is also a little awkward to set the size of each slice to fixed, as we can refer to
TCP
Of the agreementSlow start
Policy, set an initial size, according to the upload task is completed, to dynamically adjust the size of the next slice, to ensure that the size of the file slice and the current network speed match - Small experience optimizations, like when uploading
- File defragmentation
The existing slices in seconds are green, the ones being uploaded are blue, and the concurrency is 4. Without further discussion, let’s code bloom together
Time slice computes file hash
In fact, it is the time-slice concept, the core concept of the Fiber architecture in React. It uses the idle time of the browser to calculate the large diff process, and any high-priority tasks, such as animation and input, will interrupt the DIff task during the process. Although the whole calculation amount is not reduced, it greatly improves the interactive experience of users
This is probably the most popular way to open React Fiber
requestIdleCallback
Window. RequestIdleCallback () method will be called function in browser free time line. This enables the developer to perform background and low-priority work on the main event loop. The requestIdelCallback execution method, which passes a deadline parameter, knows the remaining time of the current frame, as follows
requestIdelCallback(myNonEssentialWork);
function myNonEssentialWork (deadline) {
// Deadline.timeremaining () obtains the remaining time of the current frame
// The current frame has time and the task queue is not empty
while (deadline.timeRemaining() > 0 && tasks.length > 0) {
doWorkIfNeeded();
}
if (tasks.length > 0){ requestIdleCallback(myNonEssentialWork); }}Copy the code
The structure of deadline is as follows
interface Dealine {
didTimeout: boolean // Indicates whether the task execution exceeds the agreed time
timeRemaining(): DOMHighResTimeStamp // The remaining time the task is available for execution
}
Copy the code
In the two frames in the figure, within each frame, the TASK and redering only take part of the time, instead of occupying the entire frame. In this case, the part of the idle period in the figure is the idle time, and the idle time in each frame varies according to the amount and complexity of the process in the frame. So free time is not equal.
The return value for each deadline.timeremaining () is the time (ms) until the end of the frame in which the Idle Callback resides.
Time slice calculation
Let’s take up the code from the previous article and remodel our calculateHash
async calculateHashIdle(chunks) {
return new Promise(resolve= > {
const spark = new SparkMD5.ArrayBuffer();
let count = 0;
// Append based on the file content
const appendToSpark = async file => {
return new Promise(resolve= > {
const reader = new FileReader();
reader.readAsArrayBuffer(file);
reader.onload = e= > {
spark.append(e.target.result);
resolve();
};
});
};
const workLoop = async deadline => {
// There is a task and the current frame is not finished
while (count < chunks.length && deadline.timeRemaining() > 1) {
await appendToSpark(chunks[count].file);
count++;
// There is no complete calculation
if (count < chunks.length) {
/ / calculation
this.hashProgress = Number(((100 * count) / chunks.length).toFixed(2));// console.log(this.hashProgress)
} else {
// Finish the calculation
this.hashProgress = 100; resolve(spark.end()); }}window.requestIdleCallback(workLoop);
};
window.requestIdleCallback(workLoop);
});
},
Copy the code
The calculation process, the page put an input box, input no pressure, the power of the time slice
React1
Fiber
Sampling the hash
The function of calculating the MD5 value of a file is to determine whether the file exists. We can consider designing a sampling hash to improve efficiency while sacrificing some hit ratio. The design idea is as follows
- Cut the files into 2M slices
- The entire contents of the first and last slices, and 2 bytes each in the first, middle and last places of the other slices
- After merging the contents, calculate
md5
And call itShadow copy Hash
- this
hash
The result is that the file exists, with a small probability of misjudgment, but if it does not exist, it is 100% accurate, which is similar to the idea of the Bloom filter, you can consider twohash
Together with - I tried 1.5g files on my computer. It takes about 20 seconds for the full amount, but a sampling of about 1 second is still good. It can be used to determine whether the file does not exist first
- I’m such a clever little guy
Sample MD5:1028.006103515625ms Full MD5:21745.13916015625msCopy the code
async calculateHashSample() {
return new Promise(resolve= > {
const spark = new SparkMD5.ArrayBuffer();
const reader = new FileReader();
const file = this.container.file;
// File size
const size = this.container.file.size;
let offset = 2 * 1024 * 1024;
let chunks = [file.slice(0, offset)];
/ / in front of 100 k
let cur = offset;
while (cur < size) {
// Add the last piece
if (cur + offset >= size) {
chunks.push(file.slice(cur, cur + offset));
} else {
// Remove two bytes from the middle
const mid = cur + offset / 2;
const end = cur + offset;
chunks.push(file.slice(cur, cur + 2));
chunks.push(file.slice(mid, mid + 2));
chunks.push(file.slice(end - 2, end));
}
// Start with two bytes
cur += offset;
}
/ / stitching
reader.readAsArrayBuffer(new Blob(chunks));
reader.onload = e= > {
spark.append(e.target.result);
resolve(spark.end());
};
});
}
Copy the code
Network request concurrency control
After hashing hundreds of HTTP requests at a time, the process of setting up TCP kills the browser, and I remember the control of the number of concurrent asynchronous requests itself was a headline interview question
In fact, the idea is not difficult, is that we put asynchronous requests in a queue, for example, the number of concurrent 3, first three requests at the same time, and then a request to the end of the next request, the idea is clear, the code is ready
We manage the number of concurrent requests by Max. We initiate a request Max — and close a request Max ++
+async sendRequest(forms, max=4) {
+ return new Promise(resolve => {
+ const len = forms.length;
+ let idx = 0;
+ let counter = 0;
+ const start = async ()=> {
+ // There are requests, there are channels
+ while (idx < len && max > 0) {
+ max--; // Occupies the channel
+ console.log(idx, "start");
+ const form = forms[idx].form;
+ const index = forms[idx].index;
+ idx++
+ request({
+ url: '/upload',
+ data: form,
+ onProgress: this.createProgresshandler(this.chunks[index]),
+ requestList: this.requestList
+ }).then(() => {
+ max++; // Release the channel
+ counter++;
+ if (counter === len) {
+ resolve();
+ } else {
+ start();
+}
+});
+}
+}
+ start();
+});
+}Async uploadChunks(uploadedList = []) {// uploadChunks is chunks = [] {// uploadChunks is chunks = []; Const list = this.chunks.filter (chunk => uploadedList.indexof (chunk.hash) == -1).map(({ chunk, hash, index }, i) => { const form = new FormData(); form.append("chunk", chunk); form.append("hash", hash); form.append("filename", this.container.file.name); form.append("fileHash", this.container.hash); return { form, index }; })- .map(({ form, index }) =>
- request({
- url: "/upload",
- data: form,
- onProgress: this.createProgresshandler(this.chunks[index]),
- requestList: this.requestList
-})
-);
- // Direct full concurrency
- await Promise.all(list);// Control concurrency+ const ret = await this.sendRequest(list,4)
if (uploadedList.length + list.length === this.chunks.length) {// merge await this. MergeRequest (); }},Copy the code
You know, I also did bytedance’s other interview questions, so I’m not sure if I can get through them
Slow start policy implementation
The problem with TCP congestion control is to dynamically resize slices based on current network conditions
chunk
Bring insize
Value, but the number of progress bars is uncertain, modifycreateFileChunk
, request plus time statistics)- Let’s say we want to pass one in 30 seconds
- The initial size is set at 1M. If the upload takes 10 seconds, the next block size becomes 3M
- If the upload takes 60 seconds, the next block size becomes 500KB and so on
- The logic of concurrent + slow start is a bit complicated, but I haven’t understood it yet, so I only pass one slice at a time to demonstrate this logic, and create a new one
handleUpload1
function
Async handleUpload1(){// @todo data scaling rate can be more gentle // @todo concurrent + slow start // slow start upload logic const file = this.container.file if (! file) return; this.status = Status.uploading; const fileSize = file.size let offset = 1024*1024 let cur = 0 let count =0 this.container.hash = await this.calculateHashSample(); While (cur<fileSize){// cut offfset size const chunk = file.slice(cur, cur+offset) cur+=offset const chunkName = this.container.hash + "-" + count; const form = new FormData(); form.append("chunk", chunk); form.append("hash", chunkName); form.append("filename", file.name); form.append("fileHash", this.container.hash); form.append("size", chunk.size); let start = new Date().getTime() await request({ url: '/upload',data: Form}) const now = new Date().gettime () const time = ((now-start)/1000).tofixed (4) let rate = time/30 // If (rate<0.5) rate=0.5 if(rate>2) rate=2 console.log(' slice ${count} size ${this.format(offset)}, ${time} seconds, ${this.format(offset/rate)} ') // offset = parseInt(offset/rate) // if(time) count++}}Copy the code
Adjust slow 3G network speed to see the effect
Slice 0: size: 1024.00KB; time: 13.2770 seconds; modified: size: 2.00MB; slice 1: size: 2.00MB; time: 25.4130 seconds; time: 0.8471 times; Slice 2 is 2.36MB in size, takes 14.1260 seconds, 0.5 times as long as 30 seconds, is modified to 4.72MB in sizeCopy the code
done
Progress bar optimization
This is a small optimization that allows us to see the number of existing file blocks and concurrency, and is inspired by hard disk scanning
<div class="cube-container" :style="{width:cubeWidth+'px'}">
<div class="cube"
v-for="chunk in chunks"
:key="chunk.hash">
<div
:class="{ 'uploading':chunk.progress>0&&chunk.progress<100, 'success':chunk.progress==100 }"
:style="{height:chunk.progress+'%'}"
>
<i v-if="chunk.progress>0&&chunk.progress<100" class="el-icon-loading" style="color:#F56C6C;"></i>
</div>
</div>
</div>
Copy the code
.cube-container
width 100px
overflow hidden
.cube
width 14px
height 14px
line-height 12px;
border 1px solid black
background #eee
float left
>.success
background #67C23A
>.uploading
background #409EFF
Copy the code
// The width of the progress bar is controlled by the square root of the number of square slices as possible
cubeWidth(){
return Math.ceil(Math.sqrt(this.chunks.length))*16
},
Copy the code
The effect can also be seen again 🐶
Concurrent retry + error
- Request error. Catch requeues the task
- If an error occurs, the progress bar is set to -1 and the progress bar is displayed in red
- This array stores the number of retries for each hash request
,0,2 [1]
The 0 file slice will report one error, and the 2 file slice will report two errors - More direct than 3
reject
First, the back-end simulation reports an error
if(Math.random()<0.5) {// Probability error
console.log('The probability is wrong.')
res.statusCode=500
res.end()
return
}
Copy the code
async sendRequest(urls, max=4) {
- return new Promise(resolve => {
+ return new Promise((resolve,reject) => {
const len = urls.length;
let idx = 0;
let counter = 0;
+ const retryArr = []Const start = async ()=> {// Request, channel- while (idx < len && max > 0) {
+ while (counter < len && max > 0) {max--; Console. log(idx, "start");- const form = urls[idx].form;
- const index = urls[idx].index;
- idx++
+ // Tasks should not be acquired only by accumulation, but by state
+ // Wait and error requests can be sent to facilitate retries
+ const I = urls. FindIndex (v = > v. tatus = = Status. Wait | | v. tatus = = Status. The error) / / wait or error
+ urls[i].status = Status.uploading
+ const form = urls[i].form;
+ const index = urls[i].index;
+ if(typeof retryArr[index]=='number'){
+ console.log(index,' start retry ')
+}
request({
url: '/upload',
data: form,
onProgress: this.createProgresshandler(this.chunks[index]),
requestList: this.requestList
}).then(() => {
+ urls[i].status = Status.donemax++; // release the channel counter++;+ urls[counter].done=true
if (counter === len) {
resolve();
} else {
start();
}
-});
+ }).catch(()=>{
+ urls[i].status = Status.error
+ if(typeof retryArr[index]! =='number'){
+ retryArr[index] = 0
+}
+ // The number of times
+ retryArr[index]++
+ // a request has three errors
+ if(retryArr[index]>=2){
+ return reject()
+}
+ console.log(index, retryArr[index],' error ')
+ // Restart within 3 times
+ this.chunks[index]. Progress = -1
+ max++; // Release the channel currently occupied, but counter does not accumulate
+
+ start()
+})
}
}
start();
}
Copy the code
As shown in the figure, the block will turn red after an error, but it will be retried
File defragmentation
If a lot of people leave the file in the middle of the upload, these slices are useless. Consider periodically cleaning them up. Of course, we can use node-schedule to manage scheduled tasks, such as scanning target once a day, or deleting files that were modified a month ago
// In order to test, I changed to scan every 5 seconds, expired 1 minute delete for demonstration
const fse = require('fs-extra')
const path = require('path')
const schedule = require('node-schedule')
// Delete an empty directory
function remove(file,stats){
const now = new Date().getTime()
const offset = now - stats.ctimeMs
if(offset>1000*60) {// Fragments greater than 60 seconds
console.log(file,'Expired, waste of space, delete')
fse.unlinkSync(file)
}
}
async function scan(dir,callback){
const files = fse.readdirSync(dir)
files.forEach(filename= >{
const fileDir = path.resolve(dir,filename)
const stats = fse.statSync(fileDir)
if(stats.isDirectory()){
return scan(fileDir,remove)
}
if(callback){
callback(fileDir,stats)
}
})
}
// * * * * * *
// ┬ ┬ ┬ ┬ ┬ ┬
// │ │ │
// │ │ │ ├ day │ ├ day (0-7) (0 or 7 is Sun)
// │ │ ├ ─── month (1-12)
// │ │ │ ├ ────────── ────────── ────────── ────────── ─── the day of celebration (1-31)
// │ ├ ─────────────── ─────────────── ───── hour (0-23)
// │ ├ ──────────────────── ────────────────
/ / └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ the second (0 to 59, OPTIONAL)
let start = function(UPLOAD_DIR){
/ / every 5 seconds
schedule.scheduleJob("*/5 * * * * *".function(){
console.log('Start scanning')
scan(UPLOAD_DIR)
})
}
exports.start = start
Copy the code
Upload /target/625c... /625c... Delete /upload/target/625c... /625c... Delete /upload/target/625c... /625c... Delete /upload/target/625c... /625c... Delete /upload/target/625c... /625c... -12 Delete it if it is expiredCopy the code
Follow-up expansion and thinking
Leave a few thinking questions, next time to write an article to achieve convenience to continue to warm
- RequestIdleCallback compatibility, how to implement one
- React is also the scheduling logic written by myself. I will write an article about it sometime in the future
- React’s own implementation of requestIdleCallback
- Concurrent + slow start coordination
- Sampling hash+ full hash+ time slice
- Large file slice download
- The same slicing logic is used to obtain content-length via the AXIos. head request
- Using the HTTP Range header, you can download slices. Otherwise, the logic is similar to uploading
- Small experience optimizations
- Tips such as reminders to leave the page and so on
- Slow start changes should be smoother, such as using trigonometric functions to limit the smooth rate of change to between 0.5 and 1.5
- Websocket push progress
Thinking and summarizing
- Any seemingly simple need becomes complex as it increases in magnitude, and so does life
- File upload is simple, large file is complex, single simple, distributed difficult
- Even a simple leftPad function (left complement character), considering the performance, dichotomy performance is more than array join, causing a lot of discussion
- On the implementation of left-pad function
- Any knowledge point is worth digging deep
- The next time the product manager says what needs to be simple, he will kan
- I’m going to combine the last post, record a large file upload hand touch hand analysis video stay tuned
code
The first half of the @yeyan1996 copy of the code, the code behind the main to explain the idea, the implementation of the more rough, for light spray github.com/shengxinjin…
Welcome to attention
The resources
I also copied some of the images directly from the link below
- For the novice front end of the various file upload guide
- Bytedance Interviewer: Please implement a large file upload and breakpoint resume
- This is probably the most popular way to open React Fiber
- The vernacular Bloom filter
- requestIdleCallback
- TCP congestion control problems
- Download the Python implementation file
- RequestIdleCallback – Indicates background task scheduling