Machine learning 007- Use random forest to build a demand prediction model for shared bikes
(Python libraries and versions used in this article: Python 3.5, Numpy 1.14, Scikit-learn 0.19, matplotlib 2.2)
Bike-sharing, a convenient means of transportation developed in recent years, is basically a must-have for losers to go to work, get off work, meet people and pick up women. This project intends to use random forest regression to build a demand prediction model for shared bikes, so as to check the demand for shared bikes under various conditions.
1. Prepare data sets
The data set used in this study is from the public data set of the University of California, Irvine (UCI) : https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset, all kinds of information about the data set can refer to the website, also can be directly downloaded from the website and use the data set. The data set of shared bikes contains two files, one is the usage data of shared bikes by day, the other is the usage data by hours.
As an aside, this bike-sharing dataset was collected between 2011 and 2012, and the bikes here are fixed pile bikes, similar to China’s Yongan Xing, not the yellow bikes, blue bikes and Mobikes that we see on the streets today.
Once downloaded, unpack the DataSet into D: PyProjects\DataSet\SharingBikes. There are a total of 17,389 samples in this data set, and each sample has 16 columns. Among them, the first two columns are sample serial number and date, which can be ignored; the last three columns are output results of different types; the last column is the sum of the 14th and 15th columns, so the 14th and 15th columns are not considered in this model.
The corresponding information of the 16 columns in this dataset is as follows:
The following is the main code for analyzing the data set. Here, I did not study the relationship between the feature columns of the data set in depth.
# Analyze the data set first
dataset_path='D:\PyProjects\DataSet\SharingBikes/day.csv' # Analyze only day data first
Load the data set first
raw_df=pd.read_csv(dataset_path,index_col=0)
# print(raw_df.shape) # (731, 15)
Print (raw_df.head()) # print(raw_df.head()
# print(raw_df.columns)
Drop column 1, column 12, column 13
df=raw_df.drop(['dteday'.'casual'.'registered'],axis=1)
# print(df.shape) # (731, 12)
# print(df.head()
print(df.info()) The first column object needs to be converted
# print(df.columns)
# separate data sets
dataset=df.as_matrix() # change pandas to Np.ndarray
# separate the entire data set into train set and test set
from sklearn.model_selection import train_test_split
train_set,test_set=train_test_split(dataset,test_size=0.1,random_state=37)
# print(train_set.shape) # (657, 12)
# print(test_set.shape) # (74, 12)
# print(dataset[:3])
Copy the code
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
<class ‘pandas.core.frame.DataFrame’> Int64Index: 731 entries, 1 to 731 Data columns (total 12 columns): season 731 non-null int64 yr 731 non-null int64 mnth 731 non-null int64 holiday 731 non-null int64 weekday 731 non-null int64 workingday 731 non-null int64 weathersit 731 non-null int64 temp 731 non-null float64 atemp 731 non-null float64 hum 731 non-null float64 windspeed 731 non-null float64 cnt 731 non-null int64 dtypes: float64(4), int64(8) memory usage: 74.2 KB None
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –
# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
1. It can be seen from the printed results that there are no missing values in this data set, and the data characteristics of each column are consistent, so there is no need to do additional processing.
2. There are 7 columns of int64 type in season, yr, etc. in the data set, which means that these data need to be re-converted into the unique thermal coding format. For example, in season, 1= spring, 2= summer, 3= autumn, 4= winter need to be converted into the sparse matrix formed by the unique thermal coding.
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
2. Build a random forest regression model
In the first attempt, I did not conduct any feature analysis on the original data, nor did I modify the data set, but directly used the random forest regression model to fit and see what the result was.
# Secondly, build the random forest regression model
from sklearn.ensemble import RandomForestRegressor
rf_regressor=RandomForestRegressor()
# rf_regressor = RandomForestRegressor (n_estimators = 1000, max_depth = 10, min_samples_split = 0.5)
rf_regressor.fit(train_set[:,:- 1],train_set[:,- 1]) # Training model
# Use test sets to evaluate the regression model
predict_test_y=rf_regressor.predict(test_set[:,:- 1])
import sklearn.metrics as metrics
print('Evaluation results of random forest regression model ----->>>')
print('MSE: {}'.format(
round(metrics.mean_squared_error(predict_test_y,test_set[:,- 1]),2)))
print('Explanation difference: {}'.format(
round(metrics.explained_variance_score(predict_test_y,test_set[:,- 1]),2)))
print('R squared score: {}'.format(
round(metrics.r2_score(predict_test_y,test_set[:,- 1]),2)))
Copy the code
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Evaluation results of random forest regression model —–>>> Mean square error MSE: 291769.31 Explanation square difference: 0.92 R square score: 0.92
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –
Then, the relative importance histogram is drawn by using machine learning 006 — Building housing price evaluation model with decision tree regressor. The results are as follows:
# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
1. In the absence of any processing of the data set, the MSE of the model obtained by using the default random forest regressor on the test set is very large, and the difference of interpretation formula and R2 are both 0.93, indicating that the simulation is ok.
2. As can be seen from the chart of relative importance, temperature has the greatest impact on the use of shared bikes, which is understandable. For example, when it is too cold in winter and too hot in summer, the number of people riding yellow bikes decreases significantly. But the figure shows that yr is the second important factor. This estimate is due to the fact that the years are only 2011 and 2012, and more years are needed for more reliable results.
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
Note: This part of the code has been uploaded to (my Github), welcome to download.
References:
1, Classic Examples of Python machine learning, by Prateek Joshi, translated by Tao Junjie and Chen Xiaoli