A series of articles will follow on the Python machine learning library: SciKit-learn. In this first article, we will introduce some basic concepts and explain how to use SciKit-Learn for linear regression analysis.

Scikit – learn is introduced

Scikit-learn is a Python machine learning software library. Its official website is here: Scikit-learn.

Scikit-learn has the following features:

  • It contains simple and efficient data mining and data analysis tools
  • It is available to all and can be used in a variety of environments
  • Based on NumPy, SciPy and Matplotlib
  • BSD based open source, so open source projects and commercial projects can use

There are many libraries in sciKit-learn, such as NumPy, PANDAS, and matplotlib. I’ve written about it before:

  • Python machine learning library NumPy tutorial
  • Python Data Processing Library pandas
  • Python data Processing Library pandas
  • Python drawing library Matplotlib tutorial

Introduction to Machine Learning

Typically, machine learning involves a collection of sample data. The purpose of learning is to predict unknown data by analyzing known data.

If each sample contains more than one number, the data is said to have multiple attributes or characteristics.

Learning problems can be classified as follows:

  • Supervised learningSupervised Learning: Data contains additional properties that we will predict. Supervised learning can be divided into:
    • Classification: The sample belongs to two or more categories, and we want to learn how to predict the categories of unlabeled data from labeled data. An example of a classification problem is the example of handwritten number recognition, where the goal is to assign each input vector to one of a finite number of discrete categories.
    • Regression: If the predicted result consists of one or more continuous variables, the task is called Regression. An example of a regression problem is predicting the length of an animal in relation to age and weight.
  • Unsupervised Learning: the training data consists of a set of input vector X without any corresponding target value. The goal of these problems may be to discover a collection of similar examples in the data, known as clustering, or to determine the distribution of data within the input space, known as density estimation, or to project data from high dimensions to reduce the space to two or three dimensions.

Experimental environment

The code in this article was tested in the following environment:

  • Apple OS X 10.13
  • Python 3.6.3
  • Scikit – learn 0.19.1
  • Matplotlib 2.1.1
  • Numpy 1.13.3

The source code and test data for this article can be obtained here: Sklearn_tutorial

The experiment described

In this paper, we will make a prediction of housing prices.

The data used is a set of CSV files containing several housing price information. The file can be obtained here: housing.csv.

Note:
sklearn_tutorialThis data file is also included in the project source code.

We can read this file with the following code:

import pandas as pd
input_data = pd.read_csv("./housing.csv")
Copy the code

Note: In this article, a basic understanding of NumPy, PANDAS, and Matplotlib is assumed.

Understand the data

Before we get the data, we usually have some general understanding of the data.

For example:

print("Describe Data:")
print(input_data.describe())

print("\nFirst 10 rows:")
print(input_data.head(10))
Copy the code

Describe () is an API provided by pandas that outputs an overall statistical description of the data, like the following:

Describe Data: Longitude latitude housing_median_age total_rooms \ count 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000 mean -119.569704 35.631861 28.639486 2635.763081 STD 2.003532 2.135952 12.585558 2181.615252 min-124.350000 32.540000 1.000000 2.000000 25% -121.800000 33.930000 18.000000 1447.750000 50% -118.490000 34.260000 29.000000 2127.000000 75% -118.010000 37.710000 37.000000 3148.000000 Max -114.310000 41.950000 52.000000 39320.000000 Total_bedrooms population Households Median_income \ count 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000 mean 537.870553 1425.476744 499.539680 3.870671 STD 421.385070 1132.462122 382.329753 1.899822min 1.000000 3.000000 1.000000 0.499900 25% 296.000000 787.000000 280.000000 2.563400 50% 435.000000 1166.000000 409.000000 3.534800 75% 647.000000 1725.000000 605.000000 4.743250 Max 6445.000000 35682.000000 6082.000000 15.000100 median_house_value count 20640.000000 mean 206855.816909 STD 115395.615874 min 14999.000000 25% 119600.000000 50% 179700.000000 75% 264725.000000 Max 500001.000000Copy the code

Input_data.head (10) prints the first 10 lines of data.

First 10 rows: Longitude latitude housing_median_age total_rooms total_bedrooms \ 0-122.23 37.88 41.0 880.0 129.0 1-122.22 37.86 21.0 7099.0 1106.0 2-122.24 37.85 52.0 1467.0 190.0 3-122.25 37.85 52.0 1274.0 235.0 4-122.25 37.85 52.0 1627.0 280.0 5 -122.25 37.85 52.0 919.0 213.0 6-122.25 37.84 52.0 2535.0 489.0 7-122.25 37.84 52.0 3104.0 687.0 8-122.26 37.84 42.0 Population median_income median_house_value ocean_proximity 0 322.0 126.0 8.3252 452600.0 NEAR BAY 1 2401.0 1138.0 8.3014 358500.0 NEAR BAY 2 496.0 177.0 7.2574 352100.0 NEAR BAY 3 558.0 219.0 5.6431 341300.0 NEAR BAY 4 565.0 259.0 3.8462 342200.0 NEAR BAY 5 413.0 193.0 4.0368 269700.0 NEAR BAY 6 1094.0 514.0 3.6591 299200.0 NEAR BAY 7 1157.0 647.0 3.1200 241400.0 NEAR BAY 8 1206.0 595.0 2.0804 226700.0 NEAR BAY 9 1551.0 714.0 3.6912 261100.0 NEAR BAYCopy the code

From the output above, we learn a lot about the data:

  • In total, this data set contains 20,640 pieces of data.
  • Data are divided into longitudelongitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, Median_house_value and ocean_proximity have 10 columns in total.
  • Some of the data lacked information for the Total_bedrooms column, so the total number of active data was only 20433
  • The data in ocean_proximity is a string, while the other columns are numeric
  • Median_house_value represents the value of the house, and that’s what we want to predict

We can also use Matplotlib to visualize the results.

For example:

input_data.hist(bins=100, figsize=(20, 12))
plt.show()
Copy the code

A histogram is generated from the data, as shown below. From this picture we can see the frequency of data occurrence.

Or this:

Input_data.plot (kind="scatter", x="longitude", y="latitude", alpha=0.1) plt.show()Copy the code

Hash points are generated based on latitude and longitude, and we set transparency to 0.1 to show how dense the data is at the address location.

pretreatment

In reality, the data we can obtain is not perfect, usually contains some invalid values, or missing values. In this case, that’s exactly what happened.

Before data analysis, the first step we need to do is to preprocess the data.

Fit and transform

Scikit-learn comes with a range of transformers. They are used to clean, reduce, expand, and generate feature representations. These classes will have three methods:

  • fitMethod: This method learns model parameters (for example, mean and standard deviation for normalization) from the training set.
  • transformMethod: Model transformation is performed on the data.
  • fit_transformMethod: is the combination of the previous two methods, it is more convenient and efficient.

Sklearn. preprocessing contains a series of tools for data preprocessing.

Let’s use these apis to preprocess the housing price information above.

Handling strings

Data containing strings is inconvenient for our analysis, so we first need to convert string data to numeric data.

We can do this using LabelEncoder:

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
data["ocean_proximity"] = encoder.fit_transform(data["ocean_proximity"])
Copy the code

This column is converted from a string to an integer, and finally converted to [0, 1, 2, 3…]. Data like this.

We can know the result of the transformation by describing () :

20640.000000 mean 1.165843 STD 1.420662 min 0.000000 25% 0.000000 50% 1.000000 75% 1.000000 Max 4.000000Copy the code

Handling invalid values

In the previous section, we saw invalid data in the total_bedrooms column. The total number of valid data was only 20433.

For cells that contain invalid data, we can discard them or fill them with values that might be valid.

Sklearn. Preprocessing. Imputer is used to conduct numerical filling. It offers the following three strategies:

  • “Mean” : the value is filled by the average value of the valid value
  • Median: The value is filled by the median of valid values
  • Most_frequent: Fill in with the most frequent of the valid values

Strategies 1 and 3 are well understood, but some of you may not be familiar with the median.

Median is a term in statistics representing a number in a sample, population, or probability distribution that divides the set of values into equal upper and lower parts.

Here is an example of code that uses numpy to get the mean and median:

import numpy as np

data = [1, 2, 3, 4, 5, 6, 7, 8, 95, 96, 97, 98, 100]

print("mean: {}".format(np.mean(data)))
print("median: {}".format(np.median(data)))
Copy the code

As can be seen from the output, the mean of this data is 40.153, and the median is 7.

We can populate a value that originally contained invalid data using the median test by using the following code.

from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median")
X = imputer.fit_transform(data)
input_data = pd.DataFrame(X, columns=data.columns)
Copy the code

After processing, we can look at the processed data to confirm:

Total_bedrooms count 20640.000000 mean 536.838857 STD 419.391878 min 1.000000 25% 297.000000 50% 435.000000 75% 643.250000 Max 6445.000000Copy the code

Feature scaling

The input data values may be in a wide range. For example, the maximum value of median_house_value is 500001. Housing_median_age can be described by just two digits.

In some cases, some data may have very large values. And when you do calculations, it’s very easy to overflow. In this case, we usually scale the data. For example, for the data of [100, 10000], we scale it to [1, 100]. In the case of maintaining the data proportion, it does not affect the data processing results.

The MinMaxScaler is used to scale data by specifying a range of data. Such as:

scalar = MinMaxScaler(feature_range=(0, 100), copy=False)
scalar.fit_transform(data)
Copy the code

All data is scaled within the range of [0, 100].

We can use the matplotlib interface to show that the data before and after processing has been compared:

plt.subplot(2, 1, 1) plt.scatter(x=origin["longitude"], y=origin["latitude"], c=origin["median_house_value"], Subplot (2, 1, 2) plt.scatter(x=scaled["longitude"], y=scaled["latitude"], C = origin [" median_house_value "], cmap = "viridis," alpha PLT = 0.1). The show ()Copy the code

As can be seen from this figure, we only adjusted the proportion of values here, and had no effect on the distribution of data.

Data partitioning

In order to get the prediction model, we need to divide the obtained data into two parts: training set and test set.

The former is used to train the data model, while the latter is used to test and validate the model. Usually, we use most of the data for training and a little of the data for testing.

When partitioning data, we should be as “random” as possible. That is, both the training set and the test set can reflect the real data model as much as possible.

permutation

Permutation is an interface provided by NUMpy. It scrambles integers in a specified range to generate random numbers. Here is an example of code:

import numpy as np

data = np.arange(0, 100)
np.random.seed(59)
print(data[np.random.permutation(100)[90:]])
Copy the code

Here, we scramble 100 random numbers within the range of [0, 100]. Before scrambling, the seed of random number is set through NP.random.seed (59), so as to ensure that the generated random number is stable.

Finally, we printed the last 10 of the scrambled data:

[64 67  0 57 53 79 23 77 44 49]
Copy the code

With these scrambled integers, we can use this as an index to cull training and test sets from existing data sets. For example, [0, 90] is the training set, and [90:] is the test set.

train_test_split

Sklearn provides similar functionality. This function is train_test_split. A code example is as follows:

import numpy as np from sklearn.model_selection import train_test_split data = np.arange(0, 100) train_set, Test_set = train_test_split(data, test_size=0.1, random_state=59) print("test_set: \n {} \n". Format (test_set)Copy the code

Here test_size specifies the scale of the test set, and random_state specifies the seed of the random number.

This function returns train_set and test_set. The former accounts for 90% of the data, the latter for 10%. We printed out the data results of the test set:

test_set: 
 [38 46 24 87 30 85 16 96 18 99] 
Copy the code

For housing price data, we can divide it into training set and test set by the following method:

Train_set, test_set = train_test_split(input_data, test_size=0.1, random_state=59) show_data_summary(test_set)Copy the code

The last line prints out the data for the test set, so we can see if the results match our expectations:

Describe Data: Longitude Latitude housing_median_age total_rooms \ count 2064.000000 2064.000000 2064.000000 2064.000000 2064.000000 2064.000000 mean 48.415061 32.276916 53.952918 6.692606 STD 19.640685 22.608316 24.649983 5.817431 min 0.896414 0.000000 1.960784 0.022890 25% 27.863546 14.665250 33.333333 3.638919 50% 58.764940 17.959617 54.901961 5.348695 75% 63.346614 54.835282 70.588235 7.932118 Max 96.613546 98.618491 100.000000 82.977262 Total_bedrooms population median_income \ count 2064.000000 2064.000000 2064.000000 2064.000000 2064.000000 2064.000000 mean 8.309936 4.013810 8.192199 22.725069 STD 6.794665 3.581933 6.626166 12.800111 min 0.062073 0.028028 0.049334 0.000000 25% 4.527467 2.149724 4.534616 13.727052 50% 6.719429 3.188150 6.610755 20.242479 75% 9.970515 4.904846 9.965466 28.686501 Max 100.000000 80.055495 100.000000 100.000000 Median_house_value Ocean_proximity Count 2064.000000 2064.000000 mean 38.343739 28.234012 STD 23.628904 34.874180 min 1.546592 0.000000 25% 20.350638 0.000000 50% 32.412444 25.000000 75% 49.432992 25.000000 Max 100.000000 100.000000 First  10 rows: Longitude latitude housing_median_age total_rooms total_bedrooms \ 9878 24.800797 43.464400 70.588235 0.854570 1.675978 4356 59.661355 16.471838 72.549020 5.483494 9.016139 1434 23.107570 57.810840 84.313725 3.184292 3.895096 6422 63.247012 16.896918 45.098039 6.566967 8.054004 1624 22.011952 56.323061 45.098039 5.414823 5.307263 1423 22.908367 58.023379 29.411765 2.754464 3.351955 4853 60.258964 15.834219 70.588235 7.068010 11.871508 16893 19.621514 53.560043 100.000000 4.328806 3.491620 4232 60.258964 16.578108 70.588235 13.487461 30.710739 19083 18.625498 61.424017 80.392157 5.351239 8.562 Population median_income median_house_value \ 9878 0.639031 1.628022 14.009462 19.215562 2.965330 9.472126 16.995628 70.164865 1434 1.387371 3.552047 20.366616 27.608340 6422 5.229967 8.255221 19.201114 31.340283 1624 2.441212 6.117415 35.413305 70.226721 1423 0.989378 3.798717 12.732928 12.371289 4853 7.239553 11.971715 9.355043 35.567070 16893 1.872250 3.979609 54.968207 100.000000 4232 10.908377 30.800855 10.808816 55.319566 19083 3.189551 7.301431 18.020441 27.690814 Ocean_proximity 9878 0.0 4356 0.0 1434 75.0 6422 25.0 1624 75.0 1423 75.0 4853 0.0 16893 100.0 4232 0.0 19083 0.0....Copy the code

Linear model

To help us understand linear models, we can start with a data set that contains only one attribute value.

Now suppose we want to predict the price of a home in a given area. All other things being equal, we assume that house prices depend only on age.

Suppose the following is the existing housing price data:

  • In 1999, 3800
  • In 2000, 3900
  • In 2001, 4000
  • In 2002, 4200
  • In 2003, 4500
  • In 2004, 5500
  • In 2005, 6500
  • In 2006, 7000
  • In 2007, 8000
  • In 2008, 8200
  • In 2009, 10000
  • In 2010, 14000
  • In 2011, 13850
  • In 2012, 13000
  • In 2013, 16000
  • In 2014, 18500

We can display this data using matplotlib:

# house_price.py import matplotlib.pyplot as plt import numpy as np import pandas as pd data = pd.DataFrame({ "Year": [1999200, 0200, 1200, 2200, 3200, 4200, 5200 (6), 2007200 8200 9201 0201 1201 2201 3201 4], "Price" : [3800390, 0400, 0420, 0450, 0550, 0650, 0700, 8000820, 0100, 00140, 00138, 50130, 00160, 00185 00]}) data. The plot (kind = "scatter", x="Year", y="Price", c="B", s=100) plt.show()Copy the code

This code results in something like this:

If we want to predict house prices in 2015 and beyond, what we need to do is to find a line where all the existing points are as close as possible to that line, and based on that line, we can predict house prices in the future. For example, the following three lines can be used as alternatives.

We know that the equation of the line is:

And we just have to determine w and b, and this line is determined.

In fact, there are many factors that determine house prices. It’s not just age. In a real project, there may even be thousands of attributes of data. So we need to extend the above model to multiple properties.

Given data described by n attributes, includingIs the value on the ith property.

The linear model attempts to learn a function that can predict through a linear combination of attributes, namely:

If we take the vector 1, 2and, the above function can be written as follows:

And the purpose of learning is to determine w and B, so that the model is determined.

Going back to the concrete example, each piece of data contains ten values, where median_house_value is the result we want to predict. The other nine are characteristics of the data. Our goal is to find a model that can predict the value of median_house_value from the other nine attributes.

It is not easy to write algorithms to determine the values of w and b. The good news is that SciKit-Learn already includes algorithms to do this for us, so we can just use them.

First, we define a function to separate the target value that needs to be predicted from the other nine attributes:

def split_house_value(data):
    value = data["median_house_value"].copy()
    return data.drop(["median_house_value"], axis=1), value
Copy the code

We then use this function to separate the data from the training set and test set:

Train_set, test_set = train_test_split(input_data, test_size=0.1, random_state=59) train_data, train_value = split_house_value(train_set) test_data, test_value = split_house_value(test_set)Copy the code

Next we can train the model through the training set:

from sklearn.linear_model import LinearRegression

linear_reg = LinearRegression()
linear_reg.fit(train_data, train_value)
Copy the code

Yes, the training is done in three lines of code. So with the trained model we can use it to make predictions about the data. All it takes is one line of code:

predict_value = linear_reg.predict(test_data)
Copy the code

Finally, we can compare the predicted results with the actual results. To confirm whether our predictions are accurate:

def show_predict_result(test_data, test_value, predict_value): ax = plt.subplot(221) plt.scatter(x=test_data["longitude"], y=test_data["latitude"], s=test_value, c="dodgerblue", Alpha PLT = 0.5). The subplot (222) PLT. Hist (test_value, color="dodgerblue") plt.subplot(223) plt.scatter(x=test_data["longitude"], y=test_data["latitude"], s=predict_value, C ="lightseagreen", alpha=0.5) plt.subplot(224) plt.hist(predict_value, color="lightseagreen") plt.show()Copy the code

In this function, we draw a scatter plot for the actual result and a histogram for the predicted result. The actual result is blue, the prediction is green. The points in the scatter plot represent the size of the value, while the histogram shows the frequency of the value.

The result is shown below:

Performance measurement

We’ve just graphically compared the predicted results to the actual results. This method can only be used as an aid in the training process. Because this result needs to be “seen” artificially, and the result of our evaluation is also “ok”, “relatively poor”, such a very subjective judgment. A real production or experimental environment would certainly require more precise measurements.

Mean square error (mse)

The most commonly used performance measure for regression tasks is mean Squared Error (MSE), which is calculated as follows:

The algorithm takes the square of the predicted value and the actual value, and then takes the average of all the squared results.

With scikit-learn we don’t need to implement the algorithm ourselves, we can simply call the mean_squared_error function:

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(test_value, predict_value)
print("Diff: {}".format(np.sqrt(mse)))
Copy the code

The first argument to this function is the real result, and the second is the predicted result. The square root of the resulting value gives the following result:

The Diff: 14.127834130789983Copy the code

Thus we know the exact result of the model: the prediction error is about 14 on the interval between [0, 100].

Cross validation

Scikit-learn also provides cross-validation algorithms.

This algorithm randomly divides the training set into n pieces (called folds), then selects one of them at a time as the result and n-1 others as the training. To calculate the score of the algorithm. So this algorithm will eventually return n results. Each result is a numeric type of score.

A code example is as follows:

from sklearn.model_selection import cross_val_score

scores = cross_val_score(linear_reg, train_data, train_value, cv=10)
print("cross_val_score: {}".format(scores))
Copy the code

The results are as follows:

cross_val_score: [0.65629302 0.67609342 0.60996658 0.64629956 0.64577402 0.64565816
 0.62451489 0.57967974 0.62916854 0.60401622]
Copy the code

If we have multiple models, we can easily compare the two models by comparing the scores.

Note: This function is explained here:
cross_val_score

conclusion

That concludes our first article on linear regression.

There is not much to be covered in this article and it may take some time to digest.

In the next article, we’ll cover SciKit-learn and classification prediction.

Stay tuned.

Resources and recommended readings

  • Andrew Ng CS229 Lecture notes
  • Machine Learning by Zhou Zhihua
  • Hands-on Machine Learning with Scikit-Learn and TensorFlow
  • Documentation of scikit-learn