“This is the 22nd day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

preface

K-nearest Neighbours (KNN) is one of the simplest algorithms in supervised learning. KNN can be used for classification and regression problems, and KNN clustering algorithm can be realized and trained in OpenCV. In this article, we will learn how to perform handwritten number recognition using KNN classifier, while we will start with the basic program and improve it to improve its performance.

MNIST introduction to handwritten digital data sets

In order to ensure the integrity, the training data used in the algorithm is composed of MNIST handwritten digits. The MNIST data set comes from the National Institute of Standards and Technology of the United States and is composed of handwritten digits from 250 different people. The training set contains 60,000 pictures and the test set contains 10,000 pictures. Each image has its own label and the image size is 28 by 28. Many machine learning libraries provide methods for loading MNIST datasets, here using the Keras library:

# import keras library
import keras
# load data
(train_dataset, train_labels), (test_dataset, test_labels) = keras.datasets.mnist.load_data()
train_labels = np.array(train_labels, dtype=np.int32)
Print the data set shape
print(train_dataset.shape, test_dataset.shape)
# Image preview
for i in range(40):
    plt.subplot(4.10, i+1)
    plt.imshow(train_dataset[i], cmap='gray')
    plt.title(train_labels[i], fontsize=10)
    plt.axis('off')
plt.show()
Copy the code

KNN algorithm is used to recognize handwritten digits

After loading the data set, we try to use KNN classifier to identify the numbers. In the original method, we first use the original pixel value as the feature, so the size of the image descriptor is 28 × 28 = 784.

Firstly, kerAS was used to load all digital images. In order to understand the whole process of data training, we divided the loaded training data set into training data set + test data set, with each part accounting for 50% :

Load the data set
(train_dataset, train_labels), (test_dataset, test_labels) = keras.datasets.mnist.load_data()
train_labels = np.array(train_labels, dtype=np.int32)
# use the original image as the descriptor
def raw_pixels(img) :
    return img.flatten()
# Data shredding
shuffle = np.random.permutation(len(train_dataset))
train_dataset, train_labels = train_dataset[shuffle], train_labels[shuffle]
# Calculate the descriptor of each image, where the feature descriptor is the original pixel
raw_descriptors = []
for img in train_dataset:
    raw_descriptors.append(np.float32(raw_pixels(img)))
raw_descriptors = np.squeeze(raw_descriptors)
# Split data into training and test data (50% each)
# Therefore, 30,000 digits are used to train the classifier, and 30,000 digits are used to test the trained classifier
partition = int(0.5 * len(raw_descriptors))
raw_descriptors_train, raw_descriptors_test = np.split(raw_descriptors, [partition])
labels_train, labels_test = np.split(train_labels, [partition])
Copy the code

Now we can use the knN.train () method to train the KNN model and test it with the get_accuracy() function:

# Train KNN model
knn = cv2.ml.KNearest_create()
knn.train(raw_descriptors_train, cv2.ml.ROW_SAMPLE, labels_train)
# Test kNN model
k = 5
ret, result, neighbours, dist = knn.findNearest(raw_descriptors_test, k)
# Calculate the accuracy according to the real value and predicted value
def get_accuracy(predictions, labels) :
    acc = (np.squeeze(predictions) == labels).mean()
    return acc * 100
acc = get_accuracy(result, labels_test)
print("Accuracy: {}".format(acc))
Copy the code

We can see that when K = 5, KNN model can achieve 96.48% accuracy, but we can still improve it to achieve higher performance.

The influence of parameter K on the accuracy of handwritten digit recognition

It has been known that K is an important parameter affecting algorithm performance in KNN algorithm. Therefore, we can first try to use different K values to check their influence on the accuracy of handwritten digit recognition. To compare the accuracy of models with different K values, we first need to create a dictionary to store the accuracy of tests with different K values:

from collections import defaultdict
results = defaultdict(list)
Copy the code

Next, calculate the KNN.findnearest () method, change the K parameter, and store the result in the dictionary:

# K ranges from (1, 9)
for k in range(1.10):
    ret, result, neighbours, dist = knn.findNearest(raw_descriptors_test, k)
    acc = get_accuracy(result, labels_test)
    print("{}".format("%.2f" % acc))
    results['50'].append(acc)
Copy the code

Finally, draw the result:

ax = plt.subplot(1.1.1)
ax.set_xlim(0.10)
dim = np.arange(1.10)
for key in results:
    ax.plot(dim, results[key], linestyle=The '-', marker='o', label="50%")
    
plt.legend(loc='upper left', title="% training")
plt.title('Accuracy of the K-NN model varying k')
plt.xlabel("number of k")
plt.ylabel("accuracy")
plt.show()
Copy the code

The running results of the program are as follows:

As shown in the figure above, the accuracy obtained by changing the K parameter is also different, so the best performance can be obtained by adjusting the K parameter in the application.

The influence of training data amount on the accuracy of handwritten digit recognition

In machine learning, training classifiers with more data generally improves model performance because classifiers can better learn the structure of features. In KNN classifier, increasing the number of training also increases the probability of finding the correct match of test data in the feature space.

Next, we modify the image percentage of = used for training and testing models to observe the influence of training data amount on handwritten digit recognition accuracy:

Divide training dataset and test dataset
split_values = np.arange(0.1.1.0.1)
# Store result accuracy
results = defaultdict(list)
# create model
knn = cv2.ml.KNearest_create()
# Influence of different training data amount on handwritten digit recognition accuracy
for split_value in split_values:
    # Divide the dataset into training and test datasets
    partition = int(split_value * len(raw_descriptors))
    raw_descriptors_train, raw_descriptors_test = np.split(raw_descriptors, [partition])
    labels_train, labels_test = np.split(train_labels, [partition])
    # Train KNN model
    print('Training KNN model - raw pixels as features')
    knn.train(raw_descriptors_train, cv2.ml.ROW_SAMPLE, labels_train)
    # At the same time, different K values affect each partition test
    for k in range(1.10):
        ret, result, neighbours, dist = knn.findNearest(raw_descriptors_test, k)
        acc = get_accuracy(result, labels_test)
        print("{}".format("%.2f" % acc))
        results[int(split_value * 100)].append(acc)
Copy the code

The percentage of digital images of the training algorithm is 10%, 20%,… 90%, the digital percentage of the test algorithm is 90%, 80%… 10%, and finally, the drawing result:

ax = plt.subplot(1.1.1)
ax.set_xlim(0.10)
dim = np.arange(1.10)
for key in results:
    ax.plot(dim, results[key], linestyle=The '-', marker='o', label=str(key) + "%")

plt.legend(loc='upper left', title="% training")
plt.title('Accuracy of the KNN model varying both k and the percentage of images to train/test')
plt.xlabel("number of k")
plt.ylabel("accuracy")
plt.show()
Copy the code

As can be seen from the above figure, the accuracy rate will increase with the increase of the number of training images. Therefore, when conditions permit, the model performance can be improved by increasing the amount of training data.

Although we can see that the accuracy rate has reached more than 97%, we should not stop there.

The influence of preprocessing on the accuracy of handwritten digit recognition

In all of the examples above, we used raw pixel values as features to train the classifier. In machine learning, the one before training the classifier can usually perform some kind of preprocessing on the input data to improve the classifier training performance. Therefore, we apply preprocessing next to see its effect on the accuracy of handwritten digit recognition. The preprocessor function desew() is as follows:

def deskew(img) :
    m = cv2.moments(img)
    if abs(m['mu02']) < 1e-2:
        return img.copy()
    skew = m['mu11'] / m['mu02']
    M = np.float32([[1, skew, -0.5 * SIZE_IMAGE * skew], [0.1.0]])
    img = cv2.warpAffine(img, M, (SIZE_IMAGE, SIZE_IMAGE), flags=cv2.WARP_INVERSE_MAP | cv2.INTER_LINEAR)

    return img
Copy the code

The desew() function deskews the number by using its second moment. More specifically, a measure of skew can be calculated by the ratio of the two central moments (MU11 / MU02). The calculated skew is used to compute the affine transformation, thus eliminating the skew of the number. Next, compare the pre-processed before and after image effects:

for i in range(10):
    plt.subplot(2.10, i+1)
    plt.imshow(train_dataset[i], cmap='gray')
    plt.title(train_labels[i], fontsize=10)
    plt.axis('off')
    plt.subplot(2.10, i+11)
    plt.imshow(deskew(train_dataset[i]), cmap='gray')
    plt.axis('off')
plt.show()
Copy the code

The first line below shows the original digital image and the second line shows the pre-processed digital image:

Through the application of this preprocessing, the accuracy of recognition is improved, and the accuracy curve is shown as follows:

It can be seen that the accuracy rate of the pre-processed classifier is even close to 98%. Considering that we only use a simple KNN model, the effect is already very good, but we can further improve the model performance.

Advanced descriptors are used as image features to improve the accuracy of KNN algorithm

In the example above, we have always used the original pixel value as the feature descriptor. In machine learning, a common method isto use more advanced descriptors. Next, Histogram of Oriented Gradients (HOG) will be used as image features to improve the accuracy of KNN algorithm. A feature descriptor is a representation of an image that simplifies the image by extracting useful information that describes basic features, such as shapes, colors, or textures. Typically, feature descriptors convert images into feature vectors of length N, and HOG is a popular feature descriptor used in computer vision. Next, define the get_hog() function to get the HOG descriptor:

(train_dataset, train_labels), (test_dataset, test_labels) = keras.datasets.mnist.load_data()
SIZE_IMAGE = train_dataset.shape[1]
train_labels = np.array(train_labels, dtype=np.int32)
def get_hog() :
    hog = cv2.HOGDescriptor((SIZE_IMAGE, SIZE_IMAGE), (8.8), (4.4), (8.8), 9.1, -1.0.0.2.1.64.True)
    print("hog descriptor size: {}".format(hog.getDescriptorSize()))
    return hog
Copy the code

Then, THE HOG feature is used to train the KNN model

hog = get_hog()

hog_descriptors = []
for img in train_dataset:
    hog_descriptors.append(hog.compute(deskew(img)))
hog_descriptors = np.squeeze(hog_descriptors)
Copy the code

The accuracy of the model after training is shown in the figure below:

Through the improvement process described above, you can see that a good way to write a machine learning model is to start with a basic baseline model to solve the problem, and then iterate to improve the model by adding better preprocessing, more advanced feature descriptors, or other machine learning techniques. Finally, if conditions permit, more data can be collected for training and testing the model.