Accuracy is 94%! Python machine learning identifies twitter or Twitter bots

Twitter or weibo for the existence of the robot is more dangerous, they can make a false flow, spreading rumours and even perform some embarrassing malicious operations, here we use machine learning in 2017 at the university of New York kaggle competitions twitter classification data for our recognition experiment, the experimental data set, please visit: Download the Python Twitter bot classification dataset.

To start, we need to install the following Python packages (libraries). Open your CMD(Windows)/Terminal(macOS) and enter the following command:

pip install numpy
pip install seaborn
pip install pandas
pip install matplotlib
pip install scikit-learn
Copy the code

The python library is written in C and is designed to be faster than python’s built-in algorithms. The Python library is used for data visualization. Scikit-learn is built with many commonly used machine learning analysis models, making it very simple to use.

1.Python loads data

Ok, without further ado, let’s start using Panda to load data and get bot and non-bot data respectively:

import pandas as pd
import numpy as np
import seaborn
import matplotlib
data = pd.read_csv('training_data.csv')
Bots = data[data.bot==1]
NonBots = data[data.bot==0]
Copy the code

Using thermal maps to identify missing data in training/test sets:

seaborn.heatmap(data.isnull(), yticklabels=False, cbar=False, cmap='viridis')
# Thermal map, marked yellow when null in data
matplotlib.pyplot.tight_layout() 
matplotlib.pyplot.show()
Copy the code

2.Python feature selection

What is feature selection? In fact, it’s very simple. How do we recognize watermelon and durian in our daily life? For example, from the appearance of the characteristics: durian thorns, yellow; Watermelon is round and green. The machine learning model is the same, we need to pick two categories of features from something like the appearance of a watermelon. For example, to apply our previous Python map to check for missing data:

Data missing thermal map of Python robot

Python non-robot data missing thermal map

We can obviously see that the location and urls of the robot are missing more. So we can add these two items to our feature. Since there is not much data, we should bypass the string encoding, using the location column as an example: false if location is missing, True if location exists.

Other features include mandatory information such as name and description. Of course, we can also feature Twitter bots by selecting some of the bad words they use, and setting this feature to True if their messages contain those bad words. Here is an example of a robot using swear words. You can add more words:

bag_of_words_bot = r'bot|b0t|cannabis|tweet me|mishear|follow me|updates every|gorilla|yes_ofc|forget' \
r'expos|kill|bbb|truthe|fake|anony|free|virus|funky|RNA|jargon'\
r'nerd|swag|jack|chick|prison|paper|pokem|xx|freak|ffd|dunia|clone|genie|bbb' \
r'ffd|onlyman|emoji|joke|troll|droop|free|every|wow|cheese|yeah|bio|magic|wizard|face'
Copy the code

Encode our features in numeric form:

# Each value package in this column does not contain profanity, True if it does, False if it does not
data['screen_name_binary'] = data.screen_name.str.contains(bag_of_words_bot, case=False, na=False)
data['name_binary'] = data.name.str.contains(bag_of_words_bot, case=False, na=False)
data['description_binary'] = data.description.str.contains(bag_of_words_bot, case=False, na=False)
data['status_binary'] = data.status.str.contains(bag_of_words_bot, case=False, na=False)

Listedcount >20000 for each value of this column
data['listed_count_binary'] = (data.listedcount>20000)==False 

# check whether each value of this column is empty, False if it is empty, True otherwise
data['location_binary'] = ~data.location.isnull()
data['url_binary'] = ~data.url.isnull()

# Select our characteristics
features = ['screen_name_binary'.'name_binary'.'description_binary'.'status_binary'.'verified'.'followers_count'.'verified'.'friends_count'.'statuses_count'.'listed_count_binary'.'bot'.'url_binary'.'location_binary'.'default_profile'.'default_profile_image'] 
Copy the code

One thing to note is that we encode all the text as zeros and ones (with or without profanity).

3.Python Scikit-learn training and testing

Now let’s classify using the decision tree model in the Python Scikit-learn package.

First of all, we introduce three packages to be used, as follows:

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split 
Copy the code

1. Introduce DecisionTreeClassifier from Sklearn. tree, which is a classifier model of decision tree, we will use it for training later; 2. Sklear. metrics introduces accuracy_score, which is used to facilitate calculation of accuracy; 3. Train_test_split of sklearn.model_selection is used to make it easy to split training sets and test sets.

#### Split the training set

X = data[features].iloc[:,:-1] 
All bots are data except for the last column
y = data[features].iloc[:,-1] 
X_train, 1: robot, 0: non-robot X_train
X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
Copy the code

The train_test_split() function is used, test_size=0.3 means 30% of the data is used for testing, random_state=101 is the seed of random number, and the test result is the same for each test without changing the training set.

Training and testing

clf = DecisionTreeClassifier(criterion='entropy', min_samples_leaf=50, min_samples_split=10)
clf.fit(X_train, y_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)  
print("Training Accuracy: %.5f" %accuracy_score(y_train, y_pred_train))
print("Test Accuracy: %.5f" %accuracy_score(y_test, y_pred_test)) 
Copy the code

A decision tree model CLF is initialized. Clf. fit is used for training and CLF. predict is used for testing.

4.Python model results

My final result is as follows. The accuracy rate of the test is as high as 94.4%, which is quite satisfactory, ranking around 27th in kaggle competition at that time. You can also try other models, not just decision trees, such as SVM and LR.

Python Machine learning for Identifying Twitter or Twitter bots (ACC :94.4%)

So that’s the end of our article, if you enjoyed our Python tutorial today, please keep checking us out, and give us a thumbs up/check out below if it helped you


Python Dict.com Is more than a dictatorial model