This is the second day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021

0 x1, cause

🀑 finished work yesterday afternoon, there is still a clock from work, touched the fish to see a group of friends in the car to eat melon group ask technical questions:

Eager to help jie Elder brother smell and can practice the opportunity, anyway, such as work, looking very simple, maybe they also useful, after leaving a sentence to you before work began to toss.


0x2 script Idea

β‘  Basic idea

  • A project contains multiple modules, generally thrown in the same directory, recursive traversal filter to obtain all resource files, a resource file list;
  • Traverse the list of resource files, exclude certain directories or files (e.g. /build, androidmanifest.xml), and put together files with the same file name;
  • Format and output the traversal results to a file;

β‘‘ Implementation details

  • 1. Retrieve all resource files

Use os.listdir() to get all files under the current directory, os.chdir() can switch the current directory, traversal judgment: folder β†’ recursion, resource files β†’ save, non-resource files β†’ skip, of course, to make the retrieval process more intuitive, can print out the file type, this step can get the resource list;

  • 2. Analyze resource files with duplicate names

List (‘md5β†’ absolute path of file ‘);

  • 3. Format output

The dictionary of identical files produced in the previous step is processed, with some null judgments, followed by traversal, concatenation of the output text, and finally output to the result file.

The overall process is relatively simple, mainly some details.

β‘’ Use and effect demonstration

To see how this script works, you need to install Python.

Either double-click to open or type python REPEAT_res_check.py on the command line:

Two files will then be generated under the directory:

  • Name_repeat_result. TXT β†’ result file with the same file name, suitable for extracting different file contents with the same name (different MD5)

  • Md5_repeat_result. TXT β†’ MD5 The same file with different file names, suitable for reducing resources and reducing duplicate files ~

Readers can also customize the resource type and exclude directories/files as needed, so it is not limited to Android project search:

If you have a Python foundation, you can extend it by yourself, such as indexing where resources are referenced, batch replacing multiple files, etc


0x3. Complete code

# -*- coding: utf-8 -*-
# !/usr/bin/env python
""" ------------------------------------------------- File : repeat_res_check.py Author : CoderPig date : The 2021-11-04 17:28 Desc: repeat resources retrieval script -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- "" "
import os
import threading as t
import hashlib

search_path = os.getcwd()  # to analyze the directory, the default script path, can be written to death

# output file, in order: name repeat, MD5 repeat
name_repeat_result_file = os.path.join(os.getcwd(), "name_repeat_result.txt")
md5_repeat_result_file = os.path.join(os.getcwd(), "md5_repeat_result.txt")

# temporary results
res_list = []  List of resource files
name_repeat_dict = {}
md5_repeat_dict = {}

# resource suffix tuples (add them as needed)
res_suffix_tuple = ('.xml'.'.jpg'.'.png'.'.webp'.'.JPG'.'.PNG'.'.svg'.'.SVG'.'.webp'.'.py')

# Exclude directory list or files (add your own as needed)
exclude_dir_list = ["build{}".format(os.path.sep),
                    ".idea{}".format(os.path.sep),
                    "AndroidManifest.xml"]

File read/write lock
lock = t.RLock()


Retrieve all resource files
def search_all_res_files(path) :
    os.chdir(path)
    items = os.listdir(os.curdir)
    for item in items:
        path = os.path.join(item)
        Get the last part of the path split, that is, the filename
        file_name = path.split(os.path.sep)[-1]
        absolute_path = "{} {} {}".format(os.getcwd(), os.path.sep, path)
        # determine if it is a directory
        if os.path.isdir(path):
            print("[-].", absolute_path)
            search_all_res_files(path)
            os.chdir(os.pardir)
        Check only resource files
        elif file_name.endswith(res_suffix_tuple):
            print("[!] ", absolute_path)
            res_list.append(absolute_path)
        else:
            print("[+]", absolute_path)


# Analyze duplicate name resource files
def analysis_repeat_name_files() :
    print("Analyze duplicate name files...")
    for res in res_list:
        if not any(name in res for name in exclude_dir_list):
            res_file_name = res.split(os.path.sep)[-1]
            file_md5 = get_file_md5(res)
            if name_repeat_dict.get(res_file_name) is None:
                name_repeat_dict[res_file_name] = ["{} - > {}".format(file_md5, res)]
            else:
                name_repeat_dict[res_file_name].append("{} - > {}".format(file_md5, res))
    if len(name_repeat_dict.keys()) == 0:
        print("Duplicate name resource not detected...")
    else:
        format_output(name_repeat_dict, "Duplicate names", name_repeat_result_file)


Parse md5 duplicate resource files
def analysis_repeat_md5_files() :
    print("Parse MD5 duplicate files...")
    for res in res_list:
        if not any(name in res for name in exclude_dir_list):
            file_md5 = get_file_md5(res)
            if md5_repeat_dict.get(file_md5) is None:
                md5_repeat_dict[file_md5] = {res}
            else:
                md5_repeat_dict[file_md5].add(res)
    if len(name_repeat_dict.keys()) == 0:
        print("Md5 duplicate resource not detected...")
    else:
        format_output(md5_repeat_dict, "Md5 repeat", md5_repeat_result_file)


# format output
def format_output(origin_dict, hint, result_file) :
    print("Generate {} analysis results...".format(hint))
    output_content = ' '
    if len(origin_dict.keys()) == 0:
        output_content += "{} resource not detected...".format(hint)
    else:
        output_content = {} resource files: [{}] resource files: [{}] \n\n"
        repeat_file_count = 0
        for (k, v) in origin_dict.items():
            if len(v) > 1:
                repeat_file_count += 1
                output_content += "{} {} {}\n".format('=' * 18, k, '=' * 18)
                for value in v:
                    output_content += "{}\n".format(value)
                output_content += "\n\n"
        output_content = output_content.format(len(res_list), hint, repeat_file_count)
        write_str_data(output_content, result_file)


# get file MD5
def get_file_md5(file_name) :
    m = hashlib.md5()
    try:
        with lock:
            with open(file_name, 'rb') as f:
                while True:
                    data = f.read(4096)
                    if not data:
                        break
                    m.update(data)
            return m.hexdigest()
    except OSError as reason:
        print(str(reason))


Write the contents to the file
def write_str_data(content, file_path) :
    with lock:
        try:
            with open(file_path, "w+", encoding='utf-8') as f:
                f.write(content + "\n".)print("Output result file: {}\n".format(file_path))
        except OSError as reason:
            print(str(reason))


if __name__ == '__main__':
    print("Current search directory: {}".format(search_path))
    search_all_res_files(search_path)
    print("\n Resource file retrieved, analysis started... \n")
    if len(res_list) == 0:
        print("No resource file detected")
    else:
        analysis_repeat_name_files()
        analysis_repeat_md5_files()
Copy the code

The above is the whole content of this section, tested our company several projects can run normally, welcome readers test feedback, put forward suggestions for improvement, thank you ~

Life is short, I use Python🐍~