preface

There are plenty of tutorials for getting started with Kaggle, but there are a few downsides. Some of them are old and the code doesn’t work in the new version, which can be tough for beginners. (Some of them are py2 code); Some have principles that are not implemented, some implementations have principles that are not implemented, and in short, there are not many good introductory tutorials. (there is running out, here I attach another I think to write a good tutorial at the end of the article, interested students can go to consult, direct look at me this is ok, of course, I will try to include the content, is an update of the content, because the version problem, some of his code can’t be directly run)

About Kaggle, blah, blah

At the beginning, kaggle was difficult, and I was afraid that I could not handle it, so I put it off and put it off. It’s not necessary (difficult, but not impossible)

First up at our Digit Recognizer address: www.kaggle.com/c/digit-rec…

KNN can be optimized point is not much, in addition to change the K value seems to have nothing to change, have seen this blog friends can first see my article at the end of the attached reference link, at the same time reference my code. At present, the result of KNN running out is only 96.3%, which is not very ideal. Attached source code:

import pandas as pd
from numpy import mat,tile,array
import numpy as np
import time
import csv,operator

def load_data(filename) :
    return pd.read_csv(filename, sep=', ', header='infer', names=None, index_col=None, usecols=None)

# binarization, set 0-255 to 1
def to_binary(df) :
    dv = df.values
    width = len(dv[0])
    for i in range(1.len(dv)):
        for j in range(1,width):
            ifdv[i][j]! =0:
                dv[i][j] = 1
    return dv

#inX is the vector to be tested
DataSet is a set of training samples, one row corresponding to a sample. The label vector corresponding to dataSet is labels
#k is the number of nearest neighbors selected
def classify(inX, dataSet, labels, k) :
    # mat method create a matrix
    inX=mat(inX) # test_data, just one sample
    dataSet=mat(dataSet) # train_data
    labels=mat(labels) # the labels of the train_data
    dataSetSize = dataSet.shape[0]
[dataSetSize = 1, inX = 1, inX = 1] [dataSetSize = 1, inX = 1, inX = 1]
    diffMat = tile(inX, (dataSetSize,1)) - dataSet
    sqDiffMat = array(diffMat)**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort() #array.argsort() to get the sort number of each element
    classCount={}                           #sortedDistIndicies[0] indicates the index of the first number in the sorted array
    for i in range(k):
        voteIlabel = labels[0,sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0] [0]

def saveResult(result) :
    with open('result.csv'.'w',newline=' ') as myFile:
        myWriter=csv.writer(myFile)
        myWriter.writerow(['imageid'.'label'])
        for i in range(len(result)):
            myWriter.writerow([i+1.str(result[i])])

if __name__ == '__main__':
    print("loading train data...")
    df = load_data("train.csv") Load training data
    label = df['label'].values
    df = df.drop(['label'],axis=1) Remove the label column
    print("loading test data...")
    dt = load_data("test.csv") # Load test sample set, because test sample set does not contain label, can directly fetch values, improve the subsequent speed
    print("nomalizing...")
    start = time.time()
    db = to_binary(df)  # binarization
    dt = to_binary(dt) # Binarize the test set as well
    print("training...")
    # print(len(dt))
    dt_len = len(dt)
    resultList = [] #np.empty(dt_len) # more efficient than append after building a list
    for i in range(dt_len):
        classifierResult = classify(dt[i], db, label, 5)
        # resultList[i] = classifierResult
        resultList.append(classifierResult)
        print(str(i+1) +","+str(classifierResult))
        # print("the classifier came back with: %d" % (classifierResult))
    saveResult(resultList)
    end = time.time()
    print("using time:",end-start)
Copy the code

Reference links: blog.csdn.net/l297969586/…