This is the seventh day of my participation in the August More text Challenge. For details, see:August is more challenging
Description of algorithm
The k-nearest neighbor algorithm is a classification method by measuring the distance between different eigenvalues. The idea is that a sample belongs to a category if most of the k most similar (adjacent) samples in the feature space belong to that category. In the K-neighbor algorithm, all the selected adjacent points are already correctly classified objects. We only decide the category of samples to be divided according to the category of k (usually no more than 20) adjacent samples.
Algorithm process
The general process of k-neighbor algorithm is as follows:
- To collect data
- Calculate the distance between the data to be measured and the training data (generally using Euclidean distance)
- Sort the calculated distances
- Find the k values with the smallest distances
- Calculate the frequency of each category in the find value
- Returns the category with the highest frequency
Algorithm characteristics
Advantages:
- The principle is simple, easy to understand, no advanced mathematical theory;
- No assumptions, wide application;
- Less affected by outliers.
Disadvantages:
- It requires high computing speed and storage space, especially in the case of large amount of data.
- Sample imbalance problem. When the number of a label is particularly large, it will cause deviation; However, in reality, some data is difficult to obtain, and the amount is indeed relatively small.
The choice of k.
K value can directly affect the prediction effect. There is no fixed calculation of how to choose the right K, but rather an empirical one. For example, cross validation can be used to get the optimal value of K.
Distance calculation
There are many ways to measure the distance value, including Manhattan distance, Mahalanobis distance, Chebyshev distance and so on. In general, the classical Euclidean distance is used, which is small in calculation, easy to explain and accurate enough.
The problem to improve
(1) Weighting distance
After finding k nearest neighbor points, the direct voting decision may not be accurate. This is equivalent to thinking that all neighboring points have the same influence on the measurement points. But in general, the point to be measured should be more “like” the nearest sample and less “like” the farther sample. This requires further analysis of the distance value, giving more weight to the near point while reducing the decision-making influence of the far point. Here are two ways to measure weights.
- Inverse function
- Gaussian function
(2), KD tree
The resources
zhuanlan.zhihu.com/p/268873551
www.heywhale.com/mw/notebook…
Blog.csdn.net/u010551621/…