1. Data sets
Introduction to data set
The Adult dataset is a classic data mining project that extracts data from the 1994 U.S. Census database, hence the so-called “Census income” dataset. It contains 48,842 records with annual income greater than 50K
The data set has been divided into 32,561 training data and 16,281 test data. The dataset class variable is whether the annual income exceeds 50K
Attribute variables include 14 types of important information, including age, job type, education background and occupation, among which 8 types belong to category discrete variables and the other 6 types belong to numerical continuous variables. This dataset is a classified dataset used to predict whether annual income exceeds $50K. Download addressClick here
Data set preprocessing and analysis
The pandas and NUMPY libraries are used to read the data and check for missing values
import pandas as pd
import numpy as np
df = pd.read_csv('adult.csv', header = None, names = ['age'.'workclass'.'fnlwgt'.'education'.'education-num'.'marital-status'.'occupation'.'relationship'.'race'.'sex'.'capital-gain'.'capital-loss'.'hours-per-week'.'native-country'.'income'])
df.head()
df.info()
Copy the code
Although there is no missing value in the above data, the missing value is actually “?” Info () detects missing values for NaT or Nan. Note that there is space before the question mark.
df.apply(lambda x : np.sum(x == "?"))
Copy the code
There is a shortage of 1836 for workclass (discrete type), 1843 for occupation (discrete type) and 583 for nationality (native country), respectively. Discrete values are generally populated with modes, but the missing values must be converted to nan or NaT before doing so. At the same time, because the income can be divided into two types, the income >50K is replaced by 1, and the income <=50K is replaced by 0
df.replace("?", pd.NaT, inplace = True)
df.replace(" >50K".1, inplace = True)
df.replace(" <=50K".0, inplace = True)
trans = {'workclass' : df['workclass'].mode()[0].'occupation' : df['occupation'].mode()[0].'native-country' : df['native-country'].mode()[0]}
df.fillna(trans, inplace = True)
df.describe()
Copy the code
As can be seen from the above table, more than 75% of people have no capital gains and capital output, so these two columns belong to irrelevant attributes. In addition, they also include serial number columns, and these three columns should be deleted. So we just have to focus on the data outside of those three columns.
df.drop('fnlwgt', axis = 1, inplace = True)
df.drop('capital-gain', axis = 1, inplace = True)
df.drop('capital-loss', axis = 1, inplace = True)
df.head()
Copy the code
import matplotlib.pyplot as plt
plt.scatter(df["income"], df["age"])
plt.grid(b = True, which = "major", axis = 'y')
plt.title("Income distribution by age (1 is >50K)")
plt.show()
Copy the code
It can be seen that income >50K is less than income <=50K for middle and high age people
df["workclass"].value_counts()
income_0 = df["workclass"][df["income"] = =0].value_counts()
income_1 = df["workclass"][df["income"] = =1].value_counts()
df1 = pd.DataFrame({" >50K" : income_1, " <=50K" : income_0})
df1.plot(kind = 'bar', stacked = True)
plt.title("income distribution by Workclass")
plt.xlabel("workclass")
plt.ylabel("number of person")
plt.show()
Copy the code
Observe the effect of job type on annual income. People whose job category was Private had the highest annual income in both categories, but self-emp-Inc had the highest percentage of >50K and <=50K
df1 = df["hours-per-week"].groupby(df["workclass"]).agg(['mean'.'max'.'min'])
df1.sort_values(by = 'mean', ascending = False)
df1
Copy the code
Group the number of hours worked per week by work category, calculate the mean of each group, the maximum value, the minimum value, and sort by the mean. Federal-gov workers worked the most hours on average, but did not have the highest percentage of high-earning jobs.
income_0 = df["education"][df["income"] = =0].value_counts()
income_1 = df["education"][df["income"] = =1].value_counts()
df1 = pd.DataFrame({" >50K" : income_1, " <=50K" : income_0})
df1.plot(kind = 'bar', stacked = True)
plt.title("income distribution by Workclass")
plt.xlabel("education")
plt.ylabel("number of person")
plt.show()
Copy the code
When you look at the effect of education on annual income, for Bachelors, the two incomes are relatively close, and the income ratio is the largest
income_0 = df["education-num"][df["income"] = =0]
income_1 = df["education-num"][df["income"] = =1]
df1 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df1.plot(kind = 'kde')
plt.title("education of income")
plt.xlabel("education-num")
Copy the code
Statistical probability density plots of the effect of time spent in education on earnings. Around the time of the median, the probability of people with income >50K is lower than that of <=50K, while in the time period to the right of the median, it is the opposite, and in the rest of the time, the two incomes are about in balance
# fig, ([[ax1, ax2, ax3], [ax4, ax5, ax6]]) = plt.subplots(2, 3, figsize=(15, 10))
fig = plt.figure(figsize = (15.10))
ax1 = fig.add_subplot(231)
income_0 = df[df["race"] = =' White'] ["relationship"][df["income"] = =0].value_counts()
income_1 = df[df["race"] = =' White'] ["relationship"][df["income"] = =1].value_counts()
df1 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df1.plot(kind = 'bar', ax = ax1)
ax1.set_ylabel('number of person')
ax1.set_title('income of relationship by race_White')
ax2 = fig.add_subplot(232)
income_0 = df[df["race"] = =' Black'] ["relationship"][df["income"] = =0].value_counts()
income_1 = df[df["race"] = =' Black'] ["relationship"][df["income"] = =1].value_counts()
df2 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df2.plot(kind = 'bar', ax = ax2)
ax2.set_ylabel('number of person')
ax2.set_title('income of relationship by race_Black')
ax3 = fig.add_subplot(233)
income_0 = df[df["race"] = =' Asian-Pac-Islander'] ["relationship"][df["income"] = =0].value_counts()
income_1 = df[df["race"] = =' Asian-Pac-Islander'] ["relationship"][df["income"] = =1].value_counts()
df3 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df3.plot(kind = 'bar', ax = ax3)
ax3.set_ylabel('number of person')
ax3.set_title('income of relationship by race_Asian-Pac-Islander')
ax4 = fig.add_subplot(234)
income_0 = df[df["race"] = =' Amer-Indian-Eskimo'] ["relationship"][df["income"] = =0].value_counts()
income_1 = df[df["race"] = =' Amer-Indian-Eskimo'] ["relationship"][df["income"] = =1].value_counts()
df4 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df4.plot(kind = 'bar', ax = ax4)
ax4.set_ylabel('number of person')
ax4.set_title('income of relationship by race_Amer-Indian-Eskimo')
ax5 = fig.add_subplot(235)
income_0 = df[df["race"] = =' Other'] ["relationship"][df["income"] = =0].value_counts()
income_1 = df[df["race"] = =' Other'] ["relationship"][df["income"] = =1].value_counts()
df5 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df5.plot(kind = 'bar', ax = ax5)
ax5.set_ylabel('number of person')
ax5.set_title('income of relationship by race_Other')
plt.tight_layout()
Copy the code
Here is mainly done the income status of the social roles played by different races.
# fig, ([[ax1, ax2, ax3], [ax4, ax5, ax6]]) = plt.subplots(2, 3, figsize=(10, 5))
fig = plt.figure()
ax1 = fig.add_subplot(121)
income_0 = df[df["sex"] = =' Male'] ["occupation"][df["income"] = =0].value_counts()
income_1 = df[df["sex"] = =' Male'] ["occupation"][df["income"] = =1].value_counts()
df1 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df1.plot(kind = 'bar', ax = ax1)
ax1.set_ylabel('number of person')
ax1.set_title('income of occupation by sex_Male')
ax2 = fig.add_subplot(122)
income_0 = df[df["sex"] = =' Female'] ["occupation"][df["income"] = =0].value_counts()
income_1 = df[df["sex"] = =' Female'] ["occupation"][df["income"] = =1].value_counts()
df2 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df2.plot(kind = 'bar', ax = ax2)
ax2.set_ylabel('number of person')
ax2.set_title('income of occupation by sex_Female')
plt.tight_layout()
Copy the code
Here is the income status of occupations by gender. More exec-managerial men made >50K than <=50K, and the reverse was true for women.
df_object_col = [col for col in df.columns if df[col].dtype.name == 'object']
df_int_col = [col for col in df.columns ifdf[col].dtype.name ! ='object' andcol ! ='income']
target = df["income"]
dataset = pd.concat([df[df_int_col], pd.get_dummies(df[df_object_col])], axis = 1)
dataset.head()
Copy the code
The data type is counted first, and the non-numerical data is individually thermal coded, and then the two are spliced. Finally, the revenue is separated from other data as labels and training sets or test sets respectively
Two, four models are used to predict the above data sets
Deep learning
Importing related packages
import pandas as pd
import numpy as np
import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import csv
from torch.utils.tensorboard import SummaryWriter
from torch.utils.data import Dataset, DataLoader
Copy the code
For data preprocessing, it should be noted that the shapes of training set and test set may be different after independent heat coding, so they should be paired. Again, because we need to add all zero columns to the missing column, the odd thing is that when we go from DataFrame to Numpy, all zero columns become nan, so we have to reset nan columns to zero. Otherwise the output of the network in the predicted procedure will be all nan. In this experiment, the training set was divided into 2:8 data, and 2 copies were used as verification sets. And it’s much better to normalize the data set
def add_missing_columns(d, columns) :
missing_col = set(columns) - set(d.columns)
for col in missing_col :
d[col] = 0
def fix_columns(d, columns) :
add_missing_columns(d, columns)
assert(set(columns) - set(d.columns) == set())
d = d[columns]
return d
def data_process(df, model) :
df.replace("?", pd.NaT, inplace = True)
if model == 'train' :
df.replace(" >50K".1, inplace = True)
df.replace(" <=50K".0, inplace = True)
if model == 'test':
df.replace(" >50K.".1, inplace = True)
df.replace(" <=50K.".0, inplace = True)
trans = {'workclass' : df['workclass'].mode()[0].'occupation' : df['occupation'].mode()[0].'native-country' : df['native-country'].mode()[0]}
df.fillna(trans, inplace = True)
df.drop('fnlwgt', axis = 1, inplace = True)
df.drop('capital-gain', axis = 1, inplace = True)
df.drop('capital-loss', axis = 1, inplace = True)
df_object_col = [col for col in df.columns if df[col].dtype.name == 'object']
df_int_col = [col for col in df.columns ifdf[col].dtype.name ! ='object' andcol ! ='income']
target = df["income"]
dataset = pd.concat([df[df_int_col], pd.get_dummies(df[df_object_col])], axis = 1)
return target, dataset
class Adult_data(Dataset) :
def __init__(self, model) :
super(Adult_data, self).__init__()
self.model = model
df_train = pd.read_csv('adult.csv', header = None, names = ['age'.'workclass'.'fnlwgt'.'education'.'education-num'.'marital-status'.'occupation'.'relationship'.'race'.'sex'.'capital-gain'.'capital-loss'.'hours-per-week'.'native-country'.'income'])
df_test = pd.read_csv('data.test', header = None, skiprows = 1, names = ['age'.'workclass'.'fnlwgt'.'education'.'education-num'.'marital-status'.'occupation'.'relationship'.'race'.'sex'.'capital-gain'.'capital-loss'.'hours-per-week'.'native-country'.'income'])
train_target, train_dataset = data_process(df_train, 'train')
test_target, test_dataset = data_process(df_test, 'test')
# Heat code alignment
test_dataset = fix_columns(test_dataset, train_dataset.columns)
# print(df["income"])
train_dataset = train_dataset.apply(lambda x : (x - x.mean()) / x.std())
test_dataset = test_dataset.apply(lambda x : (x - x.mean()) / x.std())
# print(train_dataset['native-country_ Holand-Netherlands'])
train_target, test_target = np.array(train_target), np.array(test_target)
train_dataset, test_dataset = np.array(train_dataset, dtype = np.float32), np.array(test_dataset, dtype = np.float32)
if model == 'test' :
isnan = np.isnan(test_dataset)
test_dataset[np.where(isnan)] = 0.0
# print(test_dataset[ : , 75])
if model == 'test':
self.target = torch.tensor(test_target, dtype = torch.int64)
self.dataset = torch.FloatTensor(test_dataset)
else :
# The first eighty percent of the data is used as the training set, and the rest as the verification set
if model == 'train' :
self.target = torch.tensor(train_target, dtype = torch.int64)[ : int(len(train_dataset) * 0.8)]
self.dataset = torch.FloatTensor(train_dataset)[ : int(len(train_target) * 0.8)]
else :
self.target = torch.tensor(train_target, dtype = torch.int64)[int(len(train_target) * 0.8) : ]
self.dataset = torch.FloatTensor(train_dataset)[int(len(train_dataset) * 0.8) :]print(self.dataset.shape, self.target.dtype)
def __getitem__(self, item) :
return self.dataset[item], self.target[item]
def __len__(self) :
return len(self.dataset)
train_dataset = Adult_data(model = 'train')
val_dataset = Adult_data(model = 'val')
test_dataset = Adult_data(model = 'test')
train_loader = DataLoader(train_dataset, batch_size = 64, shuffle = True, drop_last = False)
val_loader = DataLoader(val_dataset, batch_size = 64, shuffle = False, drop_last = False)
test_loader = DataLoader(test_dataset, batch_size = 64, shuffle = False, drop_last = False)
Copy the code
To build the network, because it is a simple binary classification, a two-layer perceptron network is used here. Softmax normalization of the results is carried out later.
class Adult_Model(nn.Module) :
def __init__(self) :
super(Adult_Model, self).__init__()
self.net = nn.Sequential(nn.Linear(102.64),
nn.ReLU(),
nn.Linear(64.32),
nn.ReLU(),
nn.Linear(32.2))def forward(self, x) :
out = self.net(x)
# print(out)
return F.softmax(out)
Copy the code
Training and verification. After each epoch, a loss comparison was conducted. When val_loss was smaller, the best model was saved until the end of iteration.
device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
model = Adult_Model().to(device)
optimizer = optim.SGD(model.parameters(), lr = 0.001, momentum = 0.9)
criterion = nn.CrossEntropyLoss()
max_epoch = 30
classes = [' <=50K'.' >50K']
mse_loss = 1000000
os.makedirs('MyModels', exist_ok = True)
writer = SummaryWriter(log_dir = 'logs')
for epoch in range(max_epoch) :
train_loss = 0.0
train_acc = 0.0
model.train()
for x, label in train_loader :
x, label = x.to(device), label.to(device)
optimizer.zero_grad()
out = model(x)
loss = criterion(out, label)
train_loss += loss.item()
loss.backward()
_, pred = torch.max(out, 1)
# print(pred)
num_correct = (pred == label).sum().item()
acc = num_correct / x.shape[0]
train_acc += acc
optimizer.step()
print(f'epoch : {epoch + 1}, train_loss : {train_loss / len(train_loader.dataset)}, train_acc : {train_acc / len(train_loader)}')
writer.add_scalar('train_loss', train_loss / len(train_loader.dataset), epoch)
with torch.no_grad() :
total_loss = []
model.eval(a)for x, label in val_loader :
x, label = x.to(device), label.to(device)
out = model(x)
loss = criterion(out, label)
total_loss.append(loss.item())
val_loss = sum(total_loss) / len(total_loss)
if val_loss < mse_loss :
mse_loss = val_loss
torch.save(model.state_dict(), 'MyModels/Deeplearning_Model.pth')
del model
Copy the code
Download the best model saved during the training process for prediction and save the results
best_model = Adult_Model().to(device)
ckpt = torch.load('MyModels/Deeplearning_Model.pth', map_location='cpu')
best_model.load_state_dict(ckpt)
test_loss = 0.0
test_acc = 0.0
best_model.eval()
result = []
for x, label in test_loader :
x, label = x.to(device), label.to(device)
out = best_model(x)
loss = criterion(out, label)
test_loss += loss.item()
_, pred = torch.max(out, dim = 1)
result.append(pred.detach())
num_correct = (pred == label).sum().item()
acc = num_correct / x.shape[0]
test_acc += acc
print(f'test_loss : {test_loss / len(test_loader.dataset)}, test_acc : {test_acc / len(test_loader)}')
result = torch.cat(result, dim = 0).cpu().numpy()
with open('Predict/Deeplearing.csv'.'w', newline = ' ') as file :
writer = csv.writer(file)
writer.writerow(['id'.'pred_result'])
for i, pred in enumerate(result) :
writer.writerow([i, classes[pred]])
Copy the code
The accuracy rate of 0.834 is pretty good.
The decision tree
Data processing is basically the same as deep learning, but the return value is different
import pandas as pd
import numpy as np
import csv
import graphviz
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier, export_graphviz
def add_missing_columns(d, columns) :
missing_col = set(columns) - set(d.columns)
for col in missing_col :
d[col] = 0
def fix_columns(d, columns) :
add_missing_columns(d, columns)
assert(set(columns) - set(d.columns) == set())
d = d[columns]
return d
def data_process(df, model) :
df.replace("?", pd.NaT, inplace = True)
if model == 'train' :
df.replace(" >50K".1, inplace = True)
df.replace(" <=50K".0, inplace = True)
if model == 'test':
df.replace(" >50K.".1, inplace = True)
df.replace(" <=50K.".0, inplace = True)
trans = {'workclass' : df['workclass'].mode()[0].'occupation' : df['occupation'].mode()[0].'native-country' : df['native-country'].mode()[0]}
df.fillna(trans, inplace = True)
df.drop('fnlwgt', axis = 1, inplace = True)
df.drop('capital-gain', axis = 1, inplace = True)
df.drop('capital-loss', axis = 1, inplace = True)
# print(df)
df_object_col = [col for col in df.columns if df[col].dtype.name == 'object']
df_int_col = [col for col in df.columns ifdf[col].dtype.name ! ='object' andcol ! ='income']
target = df["income"]
dataset = pd.concat([df[df_int_col], pd.get_dummies(df[df_object_col])], axis = 1)
return target, dataset
def Adult_data() :
df_train = pd.read_csv('adult.csv', header = None, names = ['age'.'workclass'.'fnlwgt'.'education'.'education-num'.'marital-status'.'occupation'.'relationship'.'race'.'sex'.'capital-gain'.'capital-loss'.'hours-per-week'.'native-country'.'income'])
df_test = pd.read_csv('data.test', header = None, skiprows = 1, names = ['age'.'workclass'.'fnlwgt'.'education'.'education-num'.'marital-status'.'occupation'.'relationship'.'race'.'sex'.'capital-gain'.'capital-loss'.'hours-per-week'.'native-country'.'income'])
train_target, train_dataset = data_process(df_train, 'train')
test_target, test_dataset = data_process(df_test, 'test')
# Heat code alignment
test_dataset = fix_columns(test_dataset, train_dataset.columns)
columns = train_dataset.columns
# print(df["income"])
train_target, test_target = np.array(train_target), np.array(test_target)
train_dataset, test_dataset = np.array(train_dataset), np.array(test_dataset)
return train_dataset, train_target, test_dataset, test_target, columns
train_dataset, train_target, test_dataset, test_target, columns = Adult_data()
print(train_dataset.shape, test_dataset.shape, train_target.shape, test_target.shape)
Copy the code
The GridSearchCV class can be used to perform an exhaustive search of the classifier’s specified parameter values, where the optimal depth of the decision tree is searched
# params = {'max_depth' : range(1, 20)}
# best_clf = GridSearchCV(DecisionTreeClassifier(criterion = 'entropy', random_state = 20), param_grid = params)
# best_clf = best_clf.fit(train_dataset, train_target)
# print(best_clf.best_params_)
Copy the code
The decision number is used for classification, entropy is used as the decision base, the decision depth is 8 from the previous step, the sample number required to split a node is set at least 5, and the prediction results are saved.
# clf = DecisionTreeClassifier() score:0.7836742214851667
classes = [' <=50K'.' >50K']
clf = DecisionTreeClassifier(criterion = 'entropy', max_depth = 8, min_samples_split = 5)
clf = clf.fit(train_dataset, train_target)
pred = clf.predict(test_dataset)
print(pred)
score = clf.score(test_dataset, test_target)
# pred = clf.predict_proba(test_dataset)
print(score)
# print(np.argmax(pred, axis = 1))
with open('Predict/DecisionTree.csv'.'w', newline = ' ') as file :
writer = csv.writer(file)
writer.writerow(['id'.'result_pred'])
for i, result in enumerate(pred) :
writer.writerow([i, classes[result]])
Copy the code
The result is 0.835, similar to deep learning Visualize the decision tree structure
dot_data = export_graphviz(clf, out_file = None, feature_names = columns, class_names = classes, filled = True, rounded = True)
graph = graphviz.Source(dot_data)
graph
Copy the code
Support vector machine
Since the data processing method is the same as the decision tree, only the model part is pasted here
from sklearn import svm
classes = [' <=50K'.' >50K']
clf = svm.SVC(kernel = 'linear')
clf = clf.fit(train_dataset, train_target)
pred = clf.predict(test_dataset)
score = clf.score(test_dataset, test_target)
print(score)
print(pred)
with open('Predict/SupportVectorMachine.csv'.'w', newline = ' ') as file :
writer = csv.writer(file)
writer.writerow(['id'.'result_pred'])
for i, result in enumerate(pred) :
writer.writerow([i, classes[result]])
Copy the code
Random forests
classes = [' <=50K'.' >50K']
rf = RandomForestClassifier(n_estimators = 100, random_state = 0)
rf = rf.fit(train_dataset, train_target)
score = rf.score(test_dataset, test_target)
print(score)
pred = rf.predict(test_dataset)
print(pred)
with open('Predict/RandomForest.csv'.'w', newline = ' ') as file :
writer = csv.writer(file)
writer.writerow(['id'.'result_pred'])
for i, result in enumerate(pred) :
writer.writerow([i, classes[result]])
Copy the code
3. Result analysis
According to the prediction results of test sets in Adult data set, the accuracy of deep learning model, decision tree, support vector machine and random forest are 0.834, 0.834, 0.834 and 0.817 respectively, and the accuracy of the four models is almost the same. The reasons why the accuracy rate is not very high may include:
1. The model is not robust enough.
2. There are a large number of discrete types of data in the data set, and the data are highly sparse after the unique thermal coding.
Solutions:
1. The model can be searched again to increase the complexity of the model, and overfitting should be paid attention to in the process.
2. Data dimension reduction can be considered without single thermal encoding
All the code is available from myMaking the warehouseGet, welcome to your STAR
Finally, if you have learned anything about the processing and model implementation of the Adult dataset, please give it a thumbs up