The SK-Learn Facebook dataset predicts check-in locations
The goal of this competition is to predict where a person will sign up. For the contest, Facebook created a virtual world that included about 100,000 places covering 100 square kilometers, 10 kilometers by 10 kilometers.
For a given set of coordinates, our task will predict the user’s next check-in location based on the user’s location, accuracy and time stamp. The data is made similar to location data from mobile devices.
Characteristic values: “x”, “y”, “accuracy”, “day”, “hour”, “weekday”
Target value: place_id
This example uses the statistical data on Facebook to train the model according to the features such as location coordinates and check-in time, and finally get the ID of the target location. The training set to test set ratio is 8:2.
Data preprocessing is the first step in data model training
Narrow the data range: Because the data set has 2000W+ pieces of data, the program will run very slowly, so narrow the data range appropriately. If the computer is configured enough or the server is rented, please choose the time characteristics at will: Separate day, hour and weekend from the time in the data and remove the places with less checkin: The special sites with little significance were removed to reduce the overfitting to determine the eigenvalue and the target value to segment the data set
Cross validation: Divide training data into training and verification sets. As shown in the figure below, the data is divided into four parts, and one part is used as the verification set. Then four (groups) of tests were performed, each with a different validation set. That is, the results of four groups of models are obtained, and the average value is taken as the final result. Also known as 4-fold cross validation. In this example CV =5, it is 5 fold cross validation.
def facebook_demo() :
""" sk-Learn Facebook dataset predicts check-in location :return: """
# 1, get the data set
facebook = pd.read_csv('/ Users/maxinze/Downloads/machine learning xiday2 information / 02 - code/FBlocation/train. CSV')
# 2. Basic data processing
# 2.1 Narrow the data range
Select data in the range (2,2.5) and use query
facebook_data = facebook.query("X > 5.0&x < 6&y > 5.0&y <6.0")
# 2.2 Select time characteristics
# Extraction time
time = pd.to_datetime(facebook_data["time"], unit="s")
time = pd.DatetimeIndex(time)
# add a column of day
facebook_data["day"] = time.day
# add a column of hour
facebook_data["hour"] = time.hour
# add a column of weekdays
facebook_data["weekday"] = time.weekday
# 2.3 Remove places that are checked in less frequently
# Grouping clustering, clustering by number
place_count = facebook_data.groupby("place_id").count()
# Select check in > 3
place_count = place_count[place_count["row_id"] > 3]
# Pass data
facebook_data = facebook_data[facebook_data["place_id"].isin(place_count.index)]
# facebook_data.shape()
# 2.4 Filter for eigenvalues and target values
# eigenvalue
x = facebook_data[["x"."y"."accuracy"."day"."hour"."weekday"]]
# the target
y = facebook_data["place_id"]
# 2.5 Segmentation data set (data set partition) parameter eigenvalue, target value
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)
# 3. Feature Engineering -- Feature Preprocessing (standardization)
# 3.1 Instantiate a converter
transfer = StandardScaler()
# 3.2 Call fit_transform
# Feature training set
x_train = transfer.fit_transform(x_train)
# Feature test set
x_test = transfer.fit_transform(x_test)
# 4. Machine learning -- KNN + CV
Instantiate an estimator
estimator = KNeighborsClassifier()
# 4.2 Call gridsearchCV
# param_grid = {"n_neighbors": [1, 3, 5, 7, 9]}
param_grid = {"n_neighbors": [5.7.9]}
estimator = GridSearchCV(estimator, param_grid=param_grid, cv=3 )
# 4.3 Model training
estimator.fit(x_train, y_train)
# 5. Model evaluation
# 5.1 Basic evaluation methods
score = estimator.score(x_test, y_test)
print("The final prediction accuracy is :\n", score)
y_predict = estimator.predict(x_test)
print("The final predicted value is :\n", y_predict)
print("Comparison of predicted value and true value :\n", y_predict == y_test)
# 5.2 Use cross-validated evaluation
print("Best result of validation in cross validation :\n", estimator.best_score_)
print("Best parameter model :\n", estimator.best_estimator_)
print("Verification set accuracy results and training set accuracy results after each cross validation :\n", estimator.cv_results_)
return None
Copy the code
Since only part of the data running code is selected, the accuracy of the test after the training of the model is not very high. If all the data running code can be selected.