There are many scenarios for applying concurrency, and downloading files is a very common one.

Why would you want to write a multithreaded downloader? I don’t know if you have used IDM (Internet Download Manager), but WHEN I first used IDM, I was attracted by its way of downloading.

When using IDM to download files, you can see its download process visually: Fixed with N threads to download files, at the beginning of the file is divided into N sections, each section with a thread to download, when a section of download is completed, the corresponding thread is free, at this time how to do? Take the largest segment from the remaining N-1 segment, split it in half, so that you have N segments again, and let the idle thread download the newly divided segment.

Each time a thread completes the download task, it continues to separate a section from the remaining parts until all parts of the file have been downloaded.

Of course, files can’t be partitioned indefinitely. IDM sets a threshold for segments. When the maximum number of remaining segments is less than this threshold, it is no longer partitioned for idle threads, just waiting for all active threads to download.

Say that finish IDM download strategy, we have a general idea. However, excellent software is not overnight, although we want to write like IDM downloaders, but also need to start from a simple implementation, iterative optimization.

So our initial strategy was to fix N threads and divide the file into N segments, with each thread responsible for downloading a segment of data.

Implementation steps

Determines whether the server supports breakpoint continuation

The first step is to determine whether the target file supports breakpoint continuation. The HTTP server supports breakpoint continuation.

The HTTP request header has a Range field that can be used to specify the Range of data to request. For example, if we want to request data from byte 10 to byte 20, we can write this field as Range:bytes=10-20.

Accordingly, if the HTTP server supports resumable breakpoints, 206 status codes are returned for requests with Range fields specified.

Let’s put this to the test with Curl:

curl -I --header "Range: bytes=0-" http://mirrors.163.com/debian/ls-lR.gz
Copy the code

The resulting response:

HTTP/1.1 206 Partial Content
Server: nginx
Date: Wed, 25 Apr 2018 02:57:56 GMT
Content-Type: application/octet-stream
Content-Length: 15316619
Connection: keep-alive
Last-Modified: Mon, 23 Apr 2018 14:38:44 GMT
ETag: "5addeff4-e9b68b"
Content-Range: bytes 0-15316618/15316619
Copy the code

We set Range: bytes=0- to request data from the 0th byte to the last byte, so why specify this field? This is done for two reasons: to determine whether the status code of the response is 206, and to get the size of the file.

If the server supports breakpoint continuations, we use multithreaded downloads, if not, we use single-threaded downloads.

Document segmentation

We get the size of the file fileSize and divide it into N segments, then the size of each segment is fileSize/N. Since the file is usually not divided into exactly N segments, the last segment is equal to the size of the rest.

We use an array endPoint to store the start and end locations of each segment. For example, if a 10B file is downloaded in three segments, the endPoint = {0, 3, 6, 10} is the range between left and right for each segment.

For segment I (I starts at 0), the download starts from endPoint[I] and stops at endPoint[I + 1] -1. Similarly, for segment I + 1, it starts at endPoint[I + 1] and stops at endPoint[I + 2] -1.

Creating a download thread

We create a download thread for each segment and store each segment in a separate temporary file.

What the download thread needs to do can be summarized as follows:

  • To set the request headerRangeField to specify the request scope
  • Setting timeout
  • Connecting to the HTTP Server
  • Create temporary file (download this section for the first time)
  • Read the data returned by the server and write it to a temporary file until the number of bytes read equals the size of the segment
  • Close temporary files

Here is the successful download process, we also need to retry when the following problems occur:

  • If the connection time or read time times out
  • Temporary file read/write error

Again, the question is, for this thread, should it reload the whole section when it tries again, or should it continue to download the rest? We know that the best thing to do is to download the part that’s not finished yet, so how do you do that?

We can do this: each thread saves the start and end positions of its responsible part, which is the start and end positions of segments when starting the thread, and creates temporary files to write out the downloaded data. When data is downloaded, the real-time update thread starts at the next byte from the current downloaded byte, and when an error occurs and a retry is required, it starts directly from that location and is written to the previously created temporary file.

Create a monitor thread

We create a daemon thread, responsible for monitoring the download progress, download speed, the number of active threads, after each thread download, notify the main thread to do next processing.

Processing temporary Files

When the main thread is notified that all parts have been downloaded, the temporary files need to be cleaned up.

If the file is multithreaded, you need to merge multiple temporary files.

If the file is downloaded by a single thread, the temporary file is renamed.

The specific implementation

Here gives the outline of the program, for an overall introduction, the complete source can be viewed at Github: github.com/wrayzheng/j…

public class HttpDownloader { private boolean resumable; private URL url; private File localFile; private int[] endPoint; private Object waiting = new Object(); private AtomicInteger downloadedBytes = new AtomicInteger(0); private AtomicInteger aliveThreads = new AtomicInteger(0); private boolean multithreaded = true; private int fileSize = 0; private int THREAD_NUM = 5; private int TIME_OUT = 5000; private final int MIN_SIZE = 2 << 20; public HttpDownloader(String Url, String localPath) throws MalformedURLException {... } public HttpDownloader(String Url, String localPath, int threadNum, int timeout) throws MalformedURLException {... } // Start downloading files public void get() throws IOException {... Public Boolean supportResumeDownload() throws IOException {... } public void startDownloadMonitor() {// Monitor the download speed and status and notify the main thread when the download is complete. } // Merge or rename temporary files public void cleanTempFile() throws IOException {... } public void merge() {public void merge() {... } class DownloadThread extends Thread {private int id;} class DownloadThread extends Thread {private int id; private int start; private int end; private OutputStream out; public DownloadThread(int id, int start, int end) {... Override public void run() {Override public void run() {... } public Boolean download() {... }}}Copy the code

Here, DownloadThread is defined as the internal class of HttpDownloader. This is because an HttpDownloader instance corresponds to a file download task. The instance stores various data of the task, and the DownloadThread is associated with the task and needs these data. Therefore, defined as inner classes, this data can be shared directly, thus avoiding excessive parameter passing and storage.

To download a file, first create an instance of HttpDownloader. The parameters must be the URL of the target file and the local storage location, and the optional parameters are the number of threads and the timeout.

The HttpDownloader entry method is get(), which does the following:

  • Call the supportResumeDownload() method to determine whether the target file supports breakpoint continuation and whether the target file is greater than the set minimum value to determine whether to adopt multithreaded download mode.
  • Calculate the start and end positions of each segment and store them to the endPoint.
  • Create a DownloadThread for downloading.
  • Call the startDownloadMonitor() method to start the monitor thread;
  • Wait for the file to download;
  • Call cleanTempFile() to process temporary files;
  • The end information is displayed.

The DownloadThread, whose entry method is run(), works like this:

  • Call the Download () method to download the specified part of the data;
  • If it succeeds, the thread terminates, if it fails, it goes back to the previous step.

Download the test

For a server with limited connection speed, multi-threaded download can take advantage. If the server does not limit the connection speed, a single connection can approach the bandwidth upper limit.

Let’s look at how single-threaded versus multi-threaded threads compare when the speed of a single connection is much lower than the bandwidth.

First, a single thread downloads:

It took 54.133 seconds with an average download speed of 42 KB/s.

Open 10 threads for download:

It took 10.144 seconds with an average download speed of 228 KB/s.

As you can see, there is a huge increase in download speed when multi-threading is enabled compared to single-threaded downloads.

In actual download, different timeout periods are set according to different network conditions, which has a great impact on the download speed. If the timeout period is too small, threads frequently establish connections, which is a time-consuming operation, resulting in low download efficiency. If the timeout is set too long, the connection may have failed and the client may have waited too long, wasting time unnecessarily.

conclusion

This is a basic implementation of a multithreaded downloader that divides the file into fixed N segments and assigns them to N threads for downloading. When one thread completes downloading, the thread ends and is not reused.

Later, I will further optimize the program. On the one hand, I will adopt a download strategy similar to IDM to further improve the download efficiency. On the other hand, I will also enhance functions and robustness to improve exception handling.

Interested partners can research more efficient download methods, any ideas are welcome to leave a comment.

Related articles

  • Java multithreading race conditions, mutual exclusion, and synchronization
  • Java GUI: Awt/Swing zoom and scroll images to view
  • Java Swing write database add delete change check GUI procedures
  • Common usage scenarios for Java Lambda expressions
  • How does Java use interfaces to avoid function callbacks
  • Thoroughly understand binary lookup and its boundary cases

Loading Likes…