There are two cases in which a target variable can only take a nominal value: true or false; Reptiles, fish, mammals, amphibians, plants, fungi. The second is when the target variable can accept an unlimited number of values, such as 0.100, 42.001, 1000.743,
We can get insights from data through machine learning, which is making meaningful things out of data
Beer and diapers
Machine learning is used in many fields, machine learning uses statistics, and what is statistics in machine learning, is a problem that we don’t have enough outcome capability, we can’t simulate problems and come up with solutions by ourselves, but we can model problems with statistics and come up with conclusions by computer simulation of problems, What is machine learning by using it to identify birds
Question: Why can’t you use one algorithm to solve all problems
How to choose an algorithm: If you want to predict a value, choose supervised learning, and if the target value is multiple, study classification and regression. If you don’t always have to choose unsupervised learning, then you need to study clusters. Do you need to make some numerical estimates of fitness for each group? If your answer is yes, then you should probably investigate a density estimation algorithm. But this is not set in stone.
You should take the time to understand your data, and the more you know about it, the better able you will be to build a successful application. What you need to know about the data: Are the features nominal or continuous? Are there missing values in the functionality? If the value is missing, why is the value missing? Are there outliers in the data? All of these features with data can help you narrow down the algorithm selection process. There is no single answer to what is the best algorithm or what will give you the best results. You have to try different algorithms and see how they perform. You can also use other machine learning techniques to improve the performance of machine learning algorithms.
Steps for developing machine learning applications
1. Collect data: To save some time and effort, you can use publicly available data. 2. Prepare input data: Once you have this data, you need to make sure it’s in a usable format. You may need to do some algorithm-specific formatting here. Some algorithms require features of a special format, some can treat target variables and features as strings, and some require integers. 3. Analyze the input data: This is looking at the data from the previous task. This could be as simple as looking at parsed data in a text editor to make sure steps 1 and 2 actually work and don’t have a bunch of empty values of 4. Training algorithms: This is where machine learning happens. This and the next step is where the “core” algorithm resides, depending on the algorithm. Good cleaning data can be obtained through the first few steps 6. Test the algorithm 7. Use it
Make sure you have the Numpy module installed for the Python used in this book
K proximity algorithm 9.21
K proximity algorithm is to calculate the distance between unknown data and all known data types. The first K points with the smallest distance are selected, and the type with the highest frequency is the type of unknown data. The following content is copied from: Jack-Cui: blog.csdn.net/c406495762
K-nearest neighbor (K-NN) is a basic classification and regression method proposed by Cover T and Hart P in 1967. Its working principle is: there is a sample data set, also known as the training sample set, and each data in the sample set has a label, that is, we know the corresponding relationship between each data in the sample set and its classification. After inputting the new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification label of the most similar data (the nearest neighbor) of the sample. In general, we select only the first k most similar data in the sample data set, which is where k comes from in the K-nearest neighbor algorithm, which is usually an integer not greater than 20. Finally, the classification with the most frequent occurrence among the k most similar data was selected as the classification of the new data.
For a simple example, we can use the K-nearest neighbor algorithm to classify a movie as a romance or an action movie.Copy the code
Table 1.1 Number of fighting scenes, number of kissing scenes and type of film in each film
Table 1.1 is the existing data set, that is, the training sample set. This dataset has two characteristics, namely the number of fighting scenes and the number of kissing scenes. In addition, we also know each movie belongs to the genre, namely the classification tag. With the naked eye roughly observed, kissing scenes, is love. There's a lot of action. It's an action movie. Based on our years of experience, this classification is reasonable. If you give me a movie right now, you tell me how many fights there are and how many kisses there are. Don't tell me the type of movie, I can decide based on the information you give me whether the movie is a romance or an action movie. The k-neighbor algorithm can do this just like us, but we have more experience, whereas the K-neighbor algorithm relies on existing data. For example, if you tell me that the movie has two fights and 102 kisses, my experience will tell you that it's a love story, and k-Nearest Neighbor will tell you that it's a love story. You tell me about another movie that has 49 fights and 51 kisses. My "evil" experience would probably tell you that it could be a "love action movie" that is too beautiful to imagine. (If you don't know what a romantic action movie is? Please contact me in the comments, I need a pure friend like you. But the K-Neighbor algorithm doesn't tell you that, because it sees only romance and action movies in the movie genre. It will extract the classification labels of the data with the most similar characteristics in the sample set (the nearest neighbor), and the result will be either romance or action, but never "romantic action." Of course, this depends on factors such as the size of the dataset and the criteria for nearest neighbors.Copy the code