This is the 8th day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021.
Requirements describe
Last week suddenly received a task, to export through XX website – XX XX years between the data, after the export file name is corresponding to the date of export and found some file size is the same, the file name is different, just watched the open file, found two different file name of the file content is repeated, temporary not clear why, forecasting is the cause of the website, It turned out that only about 30% of the data were not duplicated. I am dubious!
Say nothing, the first task is to filter out those files that do not duplicate, or delete duplicate files. Python has a built-in filecmp that can be used to compare files
Lu code ing
The exported files are saved in the same folder in the same format. Then, take a look at the use of filecmp.cmp() in the official documentation. The summary is as follows:
filecmp.cmp(f1, f2, shallow=True)
f1/f2
: Indicates the paths of two files to be compared.shallow
Default for:True
, that is, only compareos.stat()
Whether the obtained metadata (such as creation time and size) is the same is set toFalse
When comparing files, you also need to compare the contents of the files.
Just in case there are any problems with the code, I created onetest
Folder, in the folder manually created six files, 1 to 5 only 1,2,3,4,5 corresponding digital content, the sixth is empty file.Then make a copy of the entire document. The following
The test code
from pathlib import Path
import filecmp
path_list = [path for path in Path(r'C:\Users\pc\Desktop\test').iterdir() if path.is_file()]
for front in range(len(path_list) - 1) :for later in range(front + 1.len(path_list)):
if filecmp.cmp(path_list[front], path_list[later], shallow=False):
path_list[front].unlink() # delete file
break
Copy the code
Running effect The overall logic of the code is very simple, first get “all files” under the corresponding file, where “all files” refers totest
The path to the level-1 file in the directory, iftest
If there are subfolders in a folder, the file path in the subfolder will not be obtained, and because it is specifiedpath.is_file()
, sopath_list
To obtain only the paths of files such as TXT, XLSX, CSV, and zip. Then the file contents in the two paths are compared through a double-layer loop. If they are the same, the files are deleted.
Although the amount of code is not much, but it can reduce the manual processing time, OK, the end of the flower ~
This is what I want to share today. Search Python New Horizons on wechat, bringing you more useful knowledge every day.