background
Recently, I was working on the function of exporting data. Since the exported files should be 3GB and in bulk, I decided to compress the exported final results
The first day
Java compression, EMMM… The first thing that comes to mind is the various apis under java.util.zip.
/** * @param fileNames list of fileNames to be compressed (including relative paths) * @param zipOutName compressed file name **/ public static void BatchZipFiles (List<String> fileNames, String zipOutName) {batchZipFiles(List<String> fileNames, String zipOutName) {byte[] buffer = new byte[4096]; ZipOutputStream zipOut = null; try { zipOut = new ZipOutputStream(new FileOutputStream(zipOutName)); for (String fileName : fileNames) { File inputFile = new File(fileName); if (inputFile.exists()) { BufferedInputStream bis = new BufferedInputStream(new FileInputStream(inputFile)); Zipout.putnextentry (new ZipEntry(inputfile.getName ()))); Int size = 0; While ((size = bis.read(buffer)) >= 0) {zipout. write(buffer, 0, size); } // Close the input/output stream zipout.closeEntry (); bis.close(); } } } catch (Exception e) { log.error("batchZipFiles error:sourceFileNames:" + JSONObject.toJSONString(fileNames), e); } finally { if (null ! = zipOut) { try { zipOut.close(); } catch (Exception e) { log.error("batchZipFiles error:sourceFileNames:" + JSONObject.toJSONString(fileNames), e); }}}}Copy the code
BufferedInputStream first reads the contents of the file, and ZipOutputStream’s putNextEntry method compresses each file. Finally, all compressed files are written to the final zipOutName file. Since BufferedInputStream is used to buffer input streams, files are read and written from the cache (memory), the corresponding byte array in the code, which is much more efficient than normal FileInputStream. But not enough! The time is as follows:
Compress three 3.5GB files
The second day
Thinking of NIO, traditional IO called BIO(code above) blocks synchronously, reading and writing in one thread. NIO is synchronous, non-blocking, and the core is channels, buffers, and selectors. The reason NIO is more efficient than BIO in intensive computing is that NIO is multiplexed, requiring fewer threads to do the most things. Compared with BIO, NIO greatly reduces the resource loss caused by thread switching and competition. No more BB, code:
/** * @param fileNames list of compressed fileNames (including relative paths) * @param zipOutName compressed file name **/ public static void batchZipFiles(List<String> fileNames, String zipOutName) throws Exception { ZipOutputStream zipOutputStream = null; WritableByteChannel writableByteChannel = null; ByteBuffer buffer = ByteBuffer.allocate(2048); try { zipOutputStream = new ZipOutputStream(new FileOutputStream(zipOutName)); writableByteChannel = Channels.newChannel(zipOutputStream); for (String sourceFile : fileNames) { File source = new File(sourceFile); zipOutputStream.putNextEntry(new ZipEntry(source.getName())); FileChannel fileChannel = new FileInputStream(sourceFile).getChannel(); while (fileChannel.read(buffer) ! // update buffer. Flip (); while (buffer.hasRemaining()) { writableByteChannel.write(buffer); } buffer.rewind(); } fileChannel.close(); } } catch (Exception e) { log.error("batchZipFiles error fileNames:" + JSONObject.toJSONString(fileNames), e); } finally { zipOutputStream.close(); writableByteChannel.close(); buffer.clear(); }}Copy the code
The following API or using Java nio package, first used Channels. NewChannel () method will create a write channel zipOutputStream output flow channel, when read from the file content to directly use FileInputStream. GetChannel (), Write to writableByteChannel. Be sure to reverse buffer.flip(), otherwise the read will be the last byte=0 of the file. The speed of this method compared to the above is shown in the figure below:
Compress three 3.5GB files
On the third day
Continue to optimize, I heard that using memory mapped files is faster! What are you waiting for? Let me try! Lu code:
@param zipOutName compressed file name **/ public static void v3.0 ** @param fileNames compressed fileNames (including relative paths) * @param zipOutName compressed file name **/ public static void batchZipFiles(List<String> fileNames, String zipOutName) { ZipOutputStream zipOutputStream = null; WritableByteChannel writableByteChannel = null; MappedByteBuffer mappedByteBuffer = null; try { zipOutputStream = new ZipOutputStream(new FileOutputStream(zipOutName)); writableByteChannel = Channels.newChannel(zipOutputStream); for (String sourceFile : fileNames) { File source = new File(sourceFile); long fileSize = source.length(); zipOutputStream.putNextEntry(new ZipEntry(source.getName())); int count = (int) Math.ceil((double) fileSize / Integer.MAX_VALUE); long pre = 0; long read = Integer.MAX_VALUE; For (int I = 0; int I = 0; i < count; i++) { if (fileSize - pre < Integer.MAX_VALUE) { read = fileSize - pre; } mappedByteBuffer = new RandomAccessFile(source, "r").getChannel() .map(FileChannel.MapMode.READ_ONLY, pre, read); writableByteChannel.write(mappedByteBuffer); pre += read; } / / release resources Method m = FileChannelImpl class. GetDeclaredMethod (" unmap ", MappedByteBuffer. Class); m.setAccessible(true); m.invoke(FileChannelImpl.class, mappedByteBuffer); mappedByteBuffer.clear(); } } catch (Exception e) { log.error("zipMoreFile error fileNames:" + JSONObject.toJSONString(fileNames), e); } finally { try { if (null ! = zipOutputStream) { zipOutputStream.close(); } if (null ! = writableByteChannel) { writableByteChannel.close(); } if (null ! = mappedByteBuffer) { mappedByteBuffer.clear(); } } catch (Exception e) { log.error("zipMoreFile error fileNames:" + JSONObject.toJSONString(fileNames), e); }}}Copy the code
Where there are two pits are:
1. The mappedbytebuffer. map file is too large for integer. MAX (approximately 2GB) :
So you need to map the files that are going to be written to memory in batches.
2. There is a bug that mappedByteBuffer will not be released even if mappedByteBuffer is cleared after the file is mapped to the memory. In this case, you need to release the memory manually.
The speed!
Compress three 3.5GB files
Must be my way to open the problem, why is the slowest. Is the file too big? My machine has too little memory? Let me think about it. I’d like to discuss it in the comments section.
The fourth day
I wonder if batch compression is so slow because it is serial, if you change to multi-threaded parallel it will be faster? I wanted to write it myself, but later I found apache-Commons on Google, so I decided not to repeat the wheel, on the code:
/** * @param fileNames list of fileNames to be compressed (including relative paths) * @param zipOutName compressed file name **/ public static void compressFileList(String zipOutName, List<String> fileNameList) throws IOException, ExecutionException, InterruptedException { ThreadFactory factory = new ThreadFactoryBuilder().setNameFormat("compressFileList-pool-").build(); ExecutorService executor = new ThreadPoolExecutor(5, 10, 60, TimeUnit.SECONDS, new LinkedBlockingQueue<>(20), factory); ParallelScatterZipCreator parallelScatterZipCreator = new ParallelScatterZipCreator(executor); OutputStream outputStream = new FileOutputStream(zipOutName); ZipArchiveOutputStream zipArchiveOutputStream = new ZipArchiveOutputStream(outputStream); zipArchiveOutputStream.setEncoding("UTF-8"); for (String fileName : fileNameList) { File inFile = new File(fileName); final InputStreamSupplier inputStreamSupplier = () -> { try { return new FileInputStream(inFile); } catch (FileNotFoundException e) { e.printStackTrace(); return new NullInputStream(0); }}; ZipArchiveEntry zipArchiveEntry = new ZipArchiveEntry(inFile.getName()); zipArchiveEntry.setMethod(ZipArchiveEntry.DEFLATED); zipArchiveEntry.setSize(inFile.length()); zipArchiveEntry.setUnixMode(UnixStat.FILE_FLAG | 436); parallelScatterZipCreator.addArchiveEntry(zipArchiveEntry, inputStreamSupplier); } parallelScatterZipCreator.writeTo(zipArchiveOutputStream); zipArchiveOutputStream.close(); outputStream.close(); log.info("ParallelCompressUtil->ParallelCompressUtil-> info:{}", JSONObject.toJSONString(parallelScatterZipCreator.getStatisticsMessage())); }Copy the code
First, the results:
Compress three 3.5GB files
Parallelism is fast!
As for the realization principle, prepare a special record behind, easy to deepen understanding!