Similarity image detection is a common problem in model training. This paper introduces four commonly used hashing algorithms and demonstrates the whole training process through Colab code.

Wang, a new alchemist, recently met a problem. He lost several ribs of his hair.

Upon inquiring, I found that his master had assigned him a deep learning model training task, but the image data set of the training model was slightly complicated:

In addition to the existing public data sets, it also contains images crawled from Google, Bing and other sites.

The existence of repeated images makes the model performance very unreliable. After all:

  • Repetitive images introduce bias into data sets, which makes deep learning models have to learn specific patterns of repeated images.

  • The ability of deep learning model to generalize new images will decrease under certain learning mode.

Manual deletion of duplicate images is definitely not the optimal solution, because the number of images in the data set is often millions, manual inspection and deletion will be a non-routine (complicated) process, and will consume a lot of time.

Hashing algorithm became wang’s first solution.

Image similarity retrieval, “hash” a bit

Hash algorithm is a “magic tool” to solve image similarity retrieval. It can calculate any group of input data and get a fixed length output summary (string).

Comparing the output summaries, the closer the results are, the more similar the images are.

The hashing algorithm has the following characteristics:

  • The same input must yield the same output;

  • Different inputs are likely to produce different outputs;

Note: Even if there is only one byte difference between two input images, the output hash value can be very different.

Common hash algorithms in the ImageHash Python library include aHash, pHash, dHash, and wHash.

Average Hash (aHash) : An algorithm that cuts the image into 8×8 grayscale images and sets the 64 bits of the Hash value according to whether the pixel value is greater than the Average of all the colors of the image.

AHash has fast calculation speed and is not affected by image size, but it is sensitive to the mean value. For example, gamma correction or histogram equalization of images will affect the mean value, thus leading to the rate reporting, and accuracy cannot be guaranteed.

AHash image processing effect display

Perceptual Hash (pHash) : Perceptual hash algorithm is similar to aHash, except that pHash does not rely on Average color, but on discrete cosine transform (DCT), and makes comparison based on frequency rather than color value.

PHash avoids the effects of gamma correction or color histogram tweaking. It has high accuracy and few false positives, but is slow to compute.

PHash image processing effect display

The Difference Hash (dHash) algorithm is similar to the aHash principle, except that instead of using information about average color values, gradient (the Difference between adjacent pixels) is used.

DHash algorithm runs at the same speed as aHash, but the false positive rate is very low.

DHash image processing effect display

Wavelet Hash (wHash) : Wavelet Hash algorithm, very similar to pHash, but wHash uses discrete Wavelet Transformation instead of DCT.

WHash is faster, more accurate and has fewer false positives than pHash.

WHash image processing effect display

See TESTING DIFFERENT IMAGE HASH FUNCTIONS for comparison of effect of DIFFERENT HASH algorithms

Use wheels: Similar image detection with off-the-shelf modules

As a qualified engineer, Wang’s consistent pursuit is to avoid repeated wheel construction and improve development efficiency.

After a search, Wang found Jina Hub’s ImageHasher Executor.

After reviewing the relevant documentation, Wang found that Executor corresponds to different modules in the neural search system, implementing the core functions of data processing, and can be used directly.

Flow corresponds to the whole neural search system, which connects multiple executors to build a complete search system to easily detect similar images.

Directly on the code:

! gdown --id 1wPg_Yx2ydcgsDA3BYO-Lw8ym5vjT0oQ3 ! unzip data.zip -d imagesCopy the code
! mkdir index
! mv images/*1.* index/
Copy the code
! mkdir query ! mv images/*.* query/Copy the code
! pip install jina imagehashCopy the code

Create a Flow index Document:

from jina import Flow
from docarray import Document, DocumentArray
import matplotlib.pyplot as plt
Copy the code
! rm -rf workspaceCopy the code

The Document containing the image will be encoded into a hash value, which can then be stored with SimpleIndexer using any of the four hash algorithms above:

# Creating a DocumentArray object docs_index = DocumentArray.from_files('index/*') docs_index = [doc.load_uri_to_image_tensor() for doc in docs_index] # Creating the indexing flow with ImageHasher and SimpleIndexer Flow = (flow (). The add (USES = 'jinahub: / / ImageHasher v0.2, uses_metas = {' hash_type' : 'dhash'}) .add( uses='jinahub://SimpleIndexer', uses_metas={'workspace': 'workspace'}, uses_with={ 'match_args': {'limit': 1, 'metric': 'euclidean', 'use_scipy': True} }, ) ) # Indexing the Documents using the flow with flow: flow.post(on='/index', inputs=docs_index)Copy the code
def print_matches(resp):
    for idx, doc in enumerate(resp.docs):
        print('-'*50)
        print(f'Query {idx + 1}')
        plt.imshow(doc.tensor)
        plt.show()
        for match in doc.matches:
            print('Matching query -->')
            plt.imshow(match.tensor)
            plt.show()
Copy the code

Query any new Document and find the matching Document in the index data:

docs_query = DocumentArray.from_files('query/*')
docs_query = [doc.load_uri_to_image_tensor() for doc in docs_query]

# Using the same flow to find matches
# Opening the flow for incoming queries
with flow:
    flow.post(
        on='/search',
        inputs=docs_query,
        on_done=print_matches,
    )
Copy the code

The whole process ran down, a look at wang’s similar image detection results!

Even if the image is only a pixel, filter or size difference, the hashing algorithm can be detected

Using Jina Hub ImageHasher Executor, albino Wang finally solved the problem of similar images in the data set, and put the code on Colab to participate in more deep learning, hashing algorithm related discussions → Find the organization.

Enthusiastic Xiao Wang is looking forward to your participation!


References:

Imagehashing–Find duplicates complete Colab

ImageHasher Executor

Liao Xuefeng’s official website – hash algorithm

Similar image detection method

Testing different image hash functions

An image hashing library written in Python