The SK-Learn Facebook dataset predicts check-in locations

The goal of this competition is to predict where a person will sign up. For the contest, Facebook created a virtual world that included about 100,000 places covering 100 square kilometers, 10 kilometers by 10 kilometers.

For a given set of coordinates, our task will predict the user’s next check-in location based on the user’s location, accuracy and time stamp. The data is made similar to location data from mobile devices.

Characteristic values: “x”, “y”, “accuracy”, “day”, “hour”, “weekday”

Target value: place_id

This example uses the statistical data on Facebook to train the model according to the features such as location coordinates and check-in time, and finally get the ID of the target location. The training set to test set ratio is 8:2.

Data preprocessing is the first step in data model training

Narrow the data range: Because the data set has 2000W+ pieces of data, the program will run very slowly, so narrow the data range appropriately. If the computer is configured enough or the server is rented, please choose the time characteristics at will: Separate day, hour and weekend from the time in the data and remove the places with less checkin: The special sites with little significance were removed to reduce the overfitting to determine the eigenvalue and the target value to segment the data set

Cross validation: Divide training data into training and verification sets. As shown in the figure below, the data is divided into four parts, and one part is used as the verification set. Then four (groups) of tests were performed, each with a different validation set. That is, the results of four groups of models are obtained, and the average value is taken as the final result. Also known as 4-fold cross validation. In this example CV =5, it is 5 fold cross validation.

def facebook_demo() :
    """ sk-Learn Facebook dataset predicts check-in location :return: """
    # 1, get the data set
    facebook = pd.read_csv('/ Users/maxinze/Downloads/machine learning xiday2 information / 02 - code/FBlocation/train. CSV')

    # 2. Basic data processing
    # 2.1 Narrow the data range
    Select data in the range (2,2.5) and use query
    facebook_data = facebook.query("X > 5.0&x < 6&y > 5.0&y <6.0")

    # 2.2 Select time characteristics
    # Extraction time
    time = pd.to_datetime(facebook_data["time"], unit="s")
    time = pd.DatetimeIndex(time)
    # add a column of day
    facebook_data["day"] = time.day
    # add a column of hour
    facebook_data["hour"] = time.hour
    # add a column of weekdays
    facebook_data["weekday"] = time.weekday

    # 2.3 Remove places that are checked in less frequently
    # Grouping clustering, clustering by number
    place_count = facebook_data.groupby("place_id").count()
    # Select check in > 3
    place_count = place_count[place_count["row_id"] > 3]
    # Pass data
    facebook_data = facebook_data[facebook_data["place_id"].isin(place_count.index)]
    # facebook_data.shape()

    # 2.4 Filter for eigenvalues and target values
    # eigenvalue
    x = facebook_data[["x"."y"."accuracy"."day"."hour"."weekday"]]
    # the target
    y = facebook_data["place_id"]

    # 2.5 Segmentation data set (data set partition) parameter eigenvalue, target value
    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)

    # 3. Feature Engineering -- Feature Preprocessing (standardization)
    # 3.1 Instantiate a converter
    transfer = StandardScaler()
    # 3.2 Call fit_transform
    # Feature training set
    x_train = transfer.fit_transform(x_train)
    # Feature test set
    x_test = transfer.fit_transform(x_test)

    # 4. Machine learning -- KNN + CV
    Instantiate an estimator
    estimator = KNeighborsClassifier()
    # 4.2 Call gridsearchCV
    # param_grid = {"n_neighbors": [1, 3, 5, 7, 9]}
    param_grid = {"n_neighbors": [5.7.9]}
    estimator = GridSearchCV(estimator, param_grid=param_grid, cv=3 )
    # 4.3 Model training
    estimator.fit(x_train, y_train)

    # 5. Model evaluation
    # 5.1 Basic evaluation methods
    score = estimator.score(x_test, y_test)
    print("The final prediction accuracy is :\n", score)

    y_predict = estimator.predict(x_test)
    print("The final predicted value is :\n", y_predict)
    print("Comparison of predicted value and true value :\n", y_predict == y_test)

    # 5.2 Use cross-validated evaluation
    print("Best result of validation in cross validation :\n", estimator.best_score_)
    print("Best parameter model :\n", estimator.best_estimator_)
    print("Verification set accuracy results and training set accuracy results after each cross validation :\n", estimator.cv_results_)

    return None
Copy the code

Since only part of the data running code is selected, the accuracy of the test after the training of the model is not very high. If all the data running code can be selected.