There are many ways to identify verification code, such as Tesseract, SVM and so on. The previous several articles introduced KNN algorithm, today’s main learning is how to use KNN verification code recognition.

Data preparation

The verification code of CSDN is used in this experiment for exercise. The relevant interfaces are as follows: https://download.csdn.net/index.php/rest/tools/validcode/source_ip_validate/10.5711163911089325

Currently, the interface returns two types of verification codes:

  • Pure digital, small interference verification code, simple image background removal, binarization and threshold processing, using kNN algorithm can be recognized.
  • Letters plus numbers, background interference and slightly deformed position of graphic characters are identified by kNN algorithm after image background removal, binarization and threshold processing

Choose the second one here. Since the image sizes of the two captcha are different, you can use the image size to determine which is the first captcha and which is the second.

Download verification code

import requests import uuid from PIL import Image import os url = "Http://download.csdn.net/index.php/rest/tools/validcode/source_ip_validate/10.5711163911089325" for I in range (1000) : resp = requests.get(url) filename = "./captchas/" + str(uuid.uuid4()) + ".png" with open(filename, 'wb') as f: for chunk in resp.iter_content(chunk_size=1024): if chunk: # filter out keep-alive new chunks f.write(chunk) f.flush() f.close() im = Image.open(filename) if im.size ! = (70, 25): im.close() os.remove(filename) else: print(filename)Copy the code

Character segmentation

Once downloaded, you need to split the letters. Splitting characters is a rather troublesome task.

Gray,

Convert the color picture to gray picture, for the following binarization processing, example code:

from PIL import Image
 
file = ".\\captchas\\0a4a22cd-f16b-4ae4-bc52-cdf4c081301d.png"
im = Image.open(file)
im_gray = im.convert('L')
im_gray.show()Copy the code

Processing before:

After the treatment:

binarization

After graying, colored pixels are between 0 and 255. Binarization is to change all pixels larger than a certain value to 255 and those smaller than this value to 0. Example code:

from PIL import Image import numpy as np file = ".\\captchas\\0a4a22cd-f16b-4ae4-bc52-cdf4c081301d.png" im = Image.open(file) im_gray = im.convert('L') # im_gray.show() pix = np.array(im_gray) print(pix.shape) print(pix) Print (pix) out = image.fromarray (pix) out.show()Copy the code

Binarization output results:

Remove the border

From the binarization output, you can see that in addition to characters, there is also a border, which needs to be removed before cutting characters.

border_width = 1
new_pix = pix[border_width:-border_width,border_width:-border_width]Copy the code

Characters of cutting

Because there is no connection between characters, you can use the simple projection method to cut characters. The principle is to project the binarized image in the vertical direction first and judge the segmentation boundary according to the extreme value after the projection. The segmented small picture is then projected horizontally.

Code implementation:

def vertical_image(image):
    height, width = image.shape
    h = [0] * width
    for x in range(width):
        for y in range(height):
            s = image[y, x]
            if s == 255:
                h[x] += 1
    new_image = np.zeros(image.shape, np.uint8)
    for x in range(width):
        cv2.line(new_image, (x, 0), (x, h[x]), 255, 1)
    cv2.imshow('vert_image', new_image)
    cv2.waitKey()
cv2.destroyAllWindows()Copy the code

The overall code

from PIL import Image import cv2 import numpy as np import os import uuid def clean_bg(filename): Im = image.open (filename) im_gray = im.convert('L') Image = np.array(im_gray) threshold = 100 # pix = (Image > threshold) * 255 border_width = 1 new_image = pix[border_width:-border_width, border_width:-border_width] return new_image def get_col_rect(image): height, width = image.shape h = [0] * width for x in range(width): for y in range(height): s = image[y, x] if s == 0: h[x] += 1 col_rect = [] in_line = False start_line = 0 blank_distance = 1 for i in range(len(h)): if not in_line and h[i] >= blank_distance: in_line = True start_line = i elif in_line and h[i] < blank_distance: rect = (start_line, i) col_rect.append(rect) in_line = False start_line = 0 return col_rect def get_row_rect(image): height, width = image.shape h = [0] * height for y in range(height): for x in range(width): s = image[y, x] if s == 0: h[y] += 1 in_line = False start_line = 0 blank_distance = 1 row_rect = (0, 0) for i in range(len(h)): if not in_line and h[i] >= blank_distance: in_line = True start_line = i elif in_line and i == len(h)-1: row_rect = (start_line, i) elif in_line and h[i] < blank_distance: row_rect = (start_line, i) break return row_rect def get_block_image(image, col_rect): col_image = image[0:image.shape[0], col_rect[0]:col_rect[1]] row_rect = get_row_rect(col_image) if row_rect[1] ! = 0: block_image = image[row_rect[0]:row_rect[1], col_rect[0]:col_rect[1]] else: block_image = None return block_image def clean_bg(filename): Im = image.open (filename) im_gray = im.convert('L') Image = np.array(im_gray) threshold = 100 # pix = (Image > threshold) * 255 border_width = 2 new_image = pix[border_width:-border_width, border_width:-border_width] return new_image def split(filename): image = clean_bg(filename) col_rect = get_col_rect(image) for cols in col_rect: block_image = get_block_image(image, cols) if block_image is not None: new_image_filename = 'letters/' + str(uuid.uuid4()) + '.png' cv2.imwrite(new_image_filename, block_image) if __name__ == '__main__': for filename in os.listdir('captchas'): current_file = 'captchas/' + filename split(current_file) print('split file:%s' % current_file)Copy the code

Data set preparation

After the image is cut, it needs to be done to create a sample of the letters that will be sliced by the label. The characters are sorted into correct categories. The more common way is manual carding.

Due to the large number of images, tesserACt-OCR is used for identification.

Official project address: github.com/tesseract-o…

Windows installation package address: github.com/UB-Mannheim…

The installation of Tesseract – OCR

After downloading the installation package, you can directly run the installation. It is more important to set environment variables.

  • Add the installation directory (D:\Program Files (x86)\ tesseract-OCr) to PATH
  • Create a new TESSDATA_PREFIX system variable with the value of tessData folder path (D:\Program Files (x86)\ tesseract-ocr \ tessData)
  • Pytesseract install Pytesseract

Tesseract – the use of OCR

Very simple to use, the code is as follows:

from PIL import Image
import pytesseract
import os
 
 
def copy_to_dir(filename):
    image = Image.open(filename)
    code = pytesseract.image_to_string(image, config="-c tessedit"
                                                     "_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
                                                     " --psm 10"
                                                     " -l osd"
                                                     " ")
    if not os.path.exists("dataset/" + code):
        os.mkdir("dataset/" + code)
    image.save("dataset/" + code + filename.replace("letters", ""))
    image.close()
 
 
if __name__ == "__main__":
    for filename in os.listdir('letters'):
        current_file = 'letters/' + filename
        copy_to_dir(current_file)
        print(current_file)Copy the code

Because tesserACT-OCR recognition accuracy is very low, it can not be used at all, give up ~, still need to manual sorting.

Image size uniformity

After the completion of manual processing, it was found that the size of the cut picture is different. The size of the image needs to be unified before character recognition.

Specific implementation methods:

import cv2 def image_resize(filename): Img = cv2.imread(filename, cv2.imread_grayscale) print(img) if img.shape[0]! = 10 or img.shape[1] ! = 6: img = cv2.resize(img, (6, 10), interpolation=cv2.INTER_CUBIC) print(img) cv2.imwrite(filename, img)Copy the code

When cv2.resize is used, the parameter is width × height × channel. Here the parameter is single-channel, the options are:

  • INTER_NEAREST neighbor interpolation
  • INTER_LINEAR Bilinear interpolation (default)
  • INTER_AREA uses pixel area relationships for resampling. It may be the preferred method for image extraction because it produces cloud-free textures. But when the image is scaled, it is similar to the INTER_NEAREST method.
  • Bicubic interpolation for INTER_CUBIC 4×4 pixel neighborhood
  • INTER_LANCZOS4 Lanczos interpolation for 8×8 pixel neighborhoods

In addition, in order to make the data more convenient to use, the picture can be binarized normalization. The specific code is as follows:

import cv2 import numpy as np def image_normalize(filename): Img = cv2.imread(filename, cv2.imread_grayscale) # if img.shape[0]! = 10 or img.shape[1] ! = 6: img = cv2.resize(img, (6, 10), interpolation=cv2.INTER_CUBIC) normalized_img = np.zeros((6, Normalize (img, normalized_img, 0, 1, cv2.NORM_MINMAX) cv2.imwrite(filename, normalized_img)Copy the code

The normalized type can have the following values:

  • NORM_MINMAX: The value of an array is shifted or scaled to a specified range, linearly normalized, generally used.
  • NORM_INF: The definition of this type is not found, depending on the OpenCV 1 counterpart, may be the C-norm of the normalized array (maximum absolute value)
  • NORM_L1: L1-norm of a normalized array (sum of absolute values)
  • NORM_L2: The (Euclidean) L2-norm of normalized arrays

Character recognition

A character picture is 6 pixels wide and 10 pixels high, and in theory it is possible to define 60 features in the simplest way: pixel values above 60 pixels. But obviously, such a high dimension will inevitably cause too much calculation, which can be appropriately reduced. Such as:

  • The number of black pixels on each line yields 10 features
  • The number of black pixels in each column yields six features
from sklearn.neighbors import KNeighborsClassifier import os from sklearn import preprocessing import cv2 import numpy as np import warnings warnings.filterwarnings(module='sklearn*', action='ignore', category=DeprecationWarning) def get_feature(file_name): Img = cv2.imread(file_name, cv2.imread_grayscale) # width = img.shape pixel_cnt_list = [] for y in range(height): pix_cnt_x = 0 for x in range(width): if img[y, x] == 0: Append (pix_cnt_x) for x in range(width): pix_cnt_y = 0 for y in range(height): if img[y, x] == 0: Pixel_cnt_y += 1 pixel_cnt_list.append(pix_cnt_Y) return pixel_cnt_list if __name__ == "__main__": test = get_feature("dataset/K/04a0844c-12f2-4344-9b78-ac1d28d746c0.png") category = [] features = [] for dir_name in os.listdir('dataset'): for filename in os.listdir('dataset/' + dir_name): category.append(dir_name) current_file = 'dataset/' + dir_name + '/' + filename feature = get_feature(current_file) features.append(feature) # print(current_file) le = preprocessing.LabelEncoder() label = le.fit_transform(category) model = KNeighborsClassifier(n_neighbors=1) model.fit(features, label) predicted= model.predict(np.array(test).reshape(1, -1)) print(predicted) print(le.inverse_transform(predicted))Copy the code

The KNN method of SkLearn is used directly here. For more information, see scikit-learn for KNN classification