1. Data sets

Introduction to data set

The Adult dataset is a classic data mining project that extracts data from the 1994 U.S. Census database, hence the so-called “Census income” dataset. It contains 48,842 records with annual income greater than 50K
The proportion of 23.93 23.93% of annual income is less than 50K
The data set has been divided into 32,561 training data and 16,281 test data. The dataset class variable is whether the annual income exceeds 50K
. .
Attribute variables include 14 types of important information, including age, job type, education background and occupation, among which 8 types belong to category discrete variables and the other 6 types belong to numerical continuous variables. This dataset is a classified dataset used to predict whether annual income exceeds $50K. Download addressClick here

Data set preprocessing and analysis

The pandas and NUMPY libraries are used to read the data and check for missing values

import pandas as pd 
import numpy as np

df = pd.read_csv('adult.csv', header = None, names = ['age'.'workclass'.'fnlwgt'.'education'.'education-num'.'marital-status'.'occupation'.'relationship'.'race'.'sex'.'capital-gain'.'capital-loss'.'hours-per-week'.'native-country'.'income'])
df.head()
df.info()
Copy the code

Although there is no missing value in the above data, the missing value is actually “?” Info () detects missing values for NaT or Nan. Note that there is space before the question mark.

df.apply(lambda x : np.sum(x == "?"))
Copy the code

There is a shortage of 1836 for workclass (discrete type), 1843 for occupation (discrete type) and 583 for nationality (native country), respectively. Discrete values are generally populated with modes, but the missing values must be converted to nan or NaT before doing so. At the same time, because the income can be divided into two types, the income >50K is replaced by 1, and the income <=50K is replaced by 0

df.replace("?", pd.NaT, inplace = True)
df.replace(" >50K".1, inplace = True)
df.replace(" <=50K".0, inplace = True)
trans = {'workclass' : df['workclass'].mode()[0].'occupation' : df['occupation'].mode()[0].'native-country' : df['native-country'].mode()[0]}
df.fillna(trans, inplace = True)
df.describe()
Copy the code

As can be seen from the above table, more than 75% of people have no capital gains and capital output, so these two columns belong to irrelevant attributes. In addition, they also include serial number columns, and these three columns should be deleted. So we just have to focus on the data outside of those three columns.

df.drop('fnlwgt', axis = 1, inplace = True)
df.drop('capital-gain', axis = 1, inplace = True)
df.drop('capital-loss', axis = 1, inplace = True)
df.head()
Copy the code

import matplotlib.pyplot as plt

plt.scatter(df["income"], df["age"])
plt.grid(b = True, which = "major", axis = 'y')
plt.title("Income distribution by age (1 is >50K)")
plt.show()
Copy the code

It can be seen that income >50K is less than income <=50K for middle and high age people

df["workclass"].value_counts()

income_0 = df["workclass"][df["income"] = =0].value_counts()
income_1 = df["workclass"][df["income"] = =1].value_counts()
df1 = pd.DataFrame({" >50K" : income_1, " <=50K" : income_0})
df1.plot(kind = 'bar', stacked = True)
plt.title("income distribution by Workclass")
plt.xlabel("workclass")
plt.ylabel("number of person")
plt.show()
Copy the code

Observe the effect of job type on annual income. People whose job category was Private had the highest annual income in both categories, but self-emp-Inc had the highest percentage of >50K and <=50K

df1 = df["hours-per-week"].groupby(df["workclass"]).agg(['mean'.'max'.'min'])
df1.sort_values(by = 'mean', ascending = False)
df1
Copy the code

Group the number of hours worked per week by work category, calculate the mean of each group, the maximum value, the minimum value, and sort by the mean. Federal-gov workers worked the most hours on average, but did not have the highest percentage of high-earning jobs.

income_0 = df["education"][df["income"] = =0].value_counts()
income_1 = df["education"][df["income"] = =1].value_counts()
df1 = pd.DataFrame({" >50K" : income_1, " <=50K" : income_0})
df1.plot(kind = 'bar', stacked = True)
plt.title("income distribution by Workclass")
plt.xlabel("education")
plt.ylabel("number of person")
plt.show()
Copy the code

When you look at the effect of education on annual income, for Bachelors, the two incomes are relatively close, and the income ratio is the largest

income_0 = df["education-num"][df["income"] = =0]
income_1 = df["education-num"][df["income"] = =1]
df1 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df1.plot(kind = 'kde')
plt.title("education of income")
plt.xlabel("education-num")
Copy the code

Statistical probability density plots of the effect of time spent in education on earnings. Around the time of the median, the probability of people with income >50K is lower than that of <=50K, while in the time period to the right of the median, it is the opposite, and in the rest of the time, the two incomes are about in balance

# fig, ([[ax1, ax2, ax3], [ax4, ax5, ax6]]) = plt.subplots(2, 3, figsize=(15, 10))
fig = plt.figure(figsize = (15.10))

ax1 = fig.add_subplot(231) 
income_0 = df[df["race"] = =' White'] ["relationship"][df["income"] = =0].value_counts()
income_1 = df[df["race"] = =' White'] ["relationship"][df["income"] = =1].value_counts()
df1 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df1.plot(kind = 'bar', ax = ax1)
ax1.set_ylabel('number of person')
ax1.set_title('income of relationship by race_White')

ax2 = fig.add_subplot(232) 
income_0 = df[df["race"] = =' Black'] ["relationship"][df["income"] = =0].value_counts()
income_1 = df[df["race"] = =' Black'] ["relationship"][df["income"] = =1].value_counts()
df2 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df2.plot(kind = 'bar', ax = ax2)
ax2.set_ylabel('number of person')
ax2.set_title('income of relationship by race_Black')

ax3 = fig.add_subplot(233) 
income_0 = df[df["race"] = =' Asian-Pac-Islander'] ["relationship"][df["income"] = =0].value_counts()
income_1 = df[df["race"] = =' Asian-Pac-Islander'] ["relationship"][df["income"] = =1].value_counts()
df3 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df3.plot(kind = 'bar', ax = ax3)
ax3.set_ylabel('number of person')
ax3.set_title('income of relationship by race_Asian-Pac-Islander')

ax4 = fig.add_subplot(234) 
income_0 = df[df["race"] = =' Amer-Indian-Eskimo'] ["relationship"][df["income"] = =0].value_counts()
income_1 = df[df["race"] = =' Amer-Indian-Eskimo'] ["relationship"][df["income"] = =1].value_counts()
df4 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df4.plot(kind = 'bar', ax = ax4)
ax4.set_ylabel('number of person')
ax4.set_title('income of relationship by race_Amer-Indian-Eskimo')

ax5 = fig.add_subplot(235) 
income_0 = df[df["race"] = =' Other'] ["relationship"][df["income"] = =0].value_counts()
income_1 = df[df["race"] = =' Other'] ["relationship"][df["income"] = =1].value_counts()
df5 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df5.plot(kind = 'bar', ax = ax5)
ax5.set_ylabel('number of person')
ax5.set_title('income of relationship by race_Other')

plt.tight_layout()
Copy the code

Here is mainly done the income status of the social roles played by different races.

# fig, ([[ax1, ax2, ax3], [ax4, ax5, ax6]]) = plt.subplots(2, 3, figsize=(10, 5))
fig = plt.figure()

ax1 = fig.add_subplot(121) 
income_0 = df[df["sex"] = =' Male'] ["occupation"][df["income"] = =0].value_counts()
income_1 = df[df["sex"] = =' Male'] ["occupation"][df["income"] = =1].value_counts()
df1 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df1.plot(kind = 'bar', ax = ax1)
ax1.set_ylabel('number of person')
ax1.set_title('income of occupation by sex_Male')

ax2 = fig.add_subplot(122) 
income_0 = df[df["sex"] = =' Female'] ["occupation"][df["income"] = =0].value_counts()
income_1 = df[df["sex"] = =' Female'] ["occupation"][df["income"] = =1].value_counts()
df2 = pd.DataFrame({' >50K' : income_1, ' <=50K' : income_0})
df2.plot(kind = 'bar', ax = ax2)
ax2.set_ylabel('number of person')
ax2.set_title('income of occupation by sex_Female')

plt.tight_layout()
Copy the code

Here is the income status of occupations by gender. More exec-managerial men made >50K than <=50K, and the reverse was true for women.

df_object_col = [col for col in df.columns if df[col].dtype.name == 'object']
df_int_col = [col for col in df.columns ifdf[col].dtype.name ! ='object' andcol ! ='income']
target = df["income"]
dataset = pd.concat([df[df_int_col], pd.get_dummies(df[df_object_col])], axis = 1)
dataset.head()
Copy the code

The data type is counted first, and the non-numerical data is individually thermal coded, and then the two are spliced. Finally, the revenue is separated from other data as labels and training sets or test sets respectively

Two, four models are used to predict the above data sets

Deep learning

Importing related packages

import pandas as pd 
import numpy as np
import os 
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import csv
from torch.utils.tensorboard import SummaryWriter
from torch.utils.data import Dataset, DataLoader
Copy the code

For data preprocessing, it should be noted that the shapes of training set and test set may be different after independent heat coding, so they should be paired. Again, because we need to add all zero columns to the missing column, the odd thing is that when we go from DataFrame to Numpy, all zero columns become nan, so we have to reset nan columns to zero. Otherwise the output of the network in the predicted procedure will be all nan. In this experiment, the training set was divided into 2:8 data, and 2 copies were used as verification sets. And it’s much better to normalize the data set

def add_missing_columns(d, columns) :
    missing_col = set(columns) - set(d.columns)
    for col in missing_col :
        d[col] = 0
        
def fix_columns(d, columns) :  
    add_missing_columns(d, columns)
    assert(set(columns) - set(d.columns) == set())
    d = d[columns]
    return d

def data_process(df, model) :
    df.replace("?", pd.NaT, inplace = True)
    if model == 'train' :
        df.replace(" >50K".1, inplace = True)
        df.replace(" <=50K".0, inplace = True)
    if model == 'test':
        df.replace(" >50K.".1, inplace = True)
        df.replace(" <=50K.".0, inplace = True)
        
    trans = {'workclass' : df['workclass'].mode()[0].'occupation' : df['occupation'].mode()[0].'native-country' : df['native-country'].mode()[0]}
    df.fillna(trans, inplace = True)
    df.drop('fnlwgt', axis = 1, inplace = True)
    df.drop('capital-gain', axis = 1, inplace = True)
    df.drop('capital-loss', axis = 1, inplace = True)

    df_object_col = [col for col in df.columns if df[col].dtype.name == 'object']
    df_int_col = [col for col in df.columns ifdf[col].dtype.name ! ='object' andcol ! ='income']
    target = df["income"]
    dataset = pd.concat([df[df_int_col], pd.get_dummies(df[df_object_col])], axis = 1)

    return target, dataset
        

class Adult_data(Dataset) :
    def __init__(self, model) :
        super(Adult_data, self).__init__()
        self.model = model
        
        df_train = pd.read_csv('adult.csv', header = None, names = ['age'.'workclass'.'fnlwgt'.'education'.'education-num'.'marital-status'.'occupation'.'relationship'.'race'.'sex'.'capital-gain'.'capital-loss'.'hours-per-week'.'native-country'.'income'])
        df_test = pd.read_csv('data.test', header = None, skiprows = 1, names = ['age'.'workclass'.'fnlwgt'.'education'.'education-num'.'marital-status'.'occupation'.'relationship'.'race'.'sex'.'capital-gain'.'capital-loss'.'hours-per-week'.'native-country'.'income'])
        
        train_target, train_dataset = data_process(df_train, 'train')
        test_target, test_dataset = data_process(df_test, 'test')
        
# Heat code alignment
        test_dataset = fix_columns(test_dataset, train_dataset.columns)
# print(df["income"])
        train_dataset = train_dataset.apply(lambda x : (x - x.mean()) / x.std())
        test_dataset = test_dataset.apply(lambda x : (x - x.mean()) / x.std())
# print(train_dataset['native-country_ Holand-Netherlands'])
        
        train_target, test_target = np.array(train_target), np.array(test_target)
        train_dataset, test_dataset = np.array(train_dataset, dtype = np.float32), np.array(test_dataset, dtype = np.float32)
        if model == 'test' :
            isnan = np.isnan(test_dataset)
            test_dataset[np.where(isnan)] = 0.0
# print(test_dataset[ : , 75])
        
        if model == 'test':
            self.target = torch.tensor(test_target, dtype = torch.int64)
            self.dataset = torch.FloatTensor(test_dataset)
        else :
# The first eighty percent of the data is used as the training set, and the rest as the verification set
            if model == 'train' : 
                self.target = torch.tensor(train_target, dtype = torch.int64)[ : int(len(train_dataset) * 0.8)]
                self.dataset = torch.FloatTensor(train_dataset)[ : int(len(train_target) * 0.8)]
            else :
                self.target = torch.tensor(train_target, dtype = torch.int64)[int(len(train_target) * 0.8) : ] 
                self.dataset = torch.FloatTensor(train_dataset)[int(len(train_dataset) * 0.8) :]print(self.dataset.shape, self.target.dtype)   
        
    def __getitem__(self, item) :
        return self.dataset[item], self.target[item]
    
    def __len__(self) :
        return len(self.dataset)
    
train_dataset = Adult_data(model = 'train')
val_dataset = Adult_data(model = 'val')
test_dataset = Adult_data(model = 'test')

train_loader = DataLoader(train_dataset, batch_size = 64, shuffle = True, drop_last = False)
val_loader = DataLoader(val_dataset, batch_size = 64, shuffle = False, drop_last = False)
test_loader = DataLoader(test_dataset, batch_size = 64, shuffle = False, drop_last = False)
Copy the code

To build the network, because it is a simple binary classification, a two-layer perceptron network is used here. Softmax normalization of the results is carried out later.

class Adult_Model(nn.Module) :
    def __init__(self) :
        super(Adult_Model, self).__init__()
        self.net = nn.Sequential(nn.Linear(102.64), 
                                nn.ReLU(), 
                                nn.Linear(64.32), 
                                nn.ReLU(),
                                nn.Linear(32.2))def forward(self, x) :
        out = self.net(x) 
# print(out)
        return F.softmax(out)
Copy the code

Training and verification. After each epoch, a loss comparison was conducted. When val_loss was smaller, the best model was saved until the end of iteration.

device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
model = Adult_Model().to(device)
optimizer = optim.SGD(model.parameters(), lr = 0.001, momentum = 0.9)
criterion = nn.CrossEntropyLoss()
max_epoch = 30
classes = [' <=50K'.' >50K']
mse_loss = 1000000
os.makedirs('MyModels', exist_ok = True)
writer = SummaryWriter(log_dir = 'logs')


for epoch in range(max_epoch) :
    
    train_loss = 0.0
    train_acc = 0.0
    model.train()
    for x, label in train_loader :
        x, label = x.to(device), label.to(device)
        optimizer.zero_grad()
        
        out = model(x)
        loss = criterion(out, label)
        train_loss += loss.item()
        loss.backward()
        
        _, pred = torch.max(out, 1)
# print(pred)
        num_correct = (pred == label).sum().item()
        acc = num_correct / x.shape[0]
        train_acc += acc
        optimizer.step()
        
    print(f'epoch : {epoch + 1}, train_loss : {train_loss / len(train_loader.dataset)}, train_acc : {train_acc / len(train_loader)}')
    writer.add_scalar('train_loss', train_loss / len(train_loader.dataset), epoch)
    
    with torch.no_grad() :
        total_loss = []
        model.eval(a)for x, label in val_loader :
            x, label = x.to(device), label.to(device)
            out = model(x)
            loss = criterion(out, label)
            total_loss.append(loss.item())
        
        val_loss = sum(total_loss) / len(total_loss)
    
    if val_loss < mse_loss :
        mse_loss = val_loss 
        torch.save(model.state_dict(), 'MyModels/Deeplearning_Model.pth')

del model
Copy the code

Download the best model saved during the training process for prediction and save the results

best_model = Adult_Model().to(device)
ckpt = torch.load('MyModels/Deeplearning_Model.pth', map_location='cpu')
best_model.load_state_dict(ckpt)

test_loss = 0.0
test_acc = 0.0
best_model.eval()
result = []

for x, label in test_loader :
    x, label = x.to(device), label.to(device)
    
    out = best_model(x)
    loss = criterion(out, label)
    test_loss += loss.item()
    _, pred = torch.max(out, dim = 1)
    result.append(pred.detach())
    num_correct = (pred == label).sum().item()
    acc = num_correct / x.shape[0]
    test_acc += acc

print(f'test_loss : {test_loss / len(test_loader.dataset)}, test_acc : {test_acc / len(test_loader)}')

result = torch.cat(result, dim = 0).cpu().numpy()
with open('Predict/Deeplearing.csv'.'w', newline = ' ') as file :
    writer = csv.writer(file)
    writer.writerow(['id'.'pred_result'])
    for i, pred in enumerate(result) :
        writer.writerow([i, classes[pred]])
Copy the code

The accuracy rate of 0.834 is pretty good.

The decision tree

Data processing is basically the same as deep learning, but the return value is different

import pandas as pd 
import numpy as np
import csv
import graphviz
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier, export_graphviz

def add_missing_columns(d, columns) :
    missing_col = set(columns) - set(d.columns)
    for col in missing_col :
        d[col] = 0
        
def fix_columns(d, columns) :  
    add_missing_columns(d, columns)
    assert(set(columns) - set(d.columns) == set())
    d = d[columns]
    
    return d

def data_process(df, model) :
    df.replace("?", pd.NaT, inplace = True)
    if model == 'train' :
        df.replace(" >50K".1, inplace = True)
        df.replace(" <=50K".0, inplace = True)
    if model == 'test':
        df.replace(" >50K.".1, inplace = True)
        df.replace(" <=50K.".0, inplace = True)
    trans = {'workclass' : df['workclass'].mode()[0].'occupation' : df['occupation'].mode()[0].'native-country' : df['native-country'].mode()[0]}
    df.fillna(trans, inplace = True)

    df.drop('fnlwgt', axis = 1, inplace = True)
    df.drop('capital-gain', axis = 1, inplace = True)
    df.drop('capital-loss', axis = 1, inplace = True)
# print(df)

    df_object_col = [col for col in df.columns if df[col].dtype.name == 'object']
    df_int_col = [col for col in df.columns ifdf[col].dtype.name ! ='object' andcol ! ='income']
    target = df["income"]
    dataset = pd.concat([df[df_int_col], pd.get_dummies(df[df_object_col])], axis = 1)

    return target, dataset
        

def Adult_data() :

    df_train = pd.read_csv('adult.csv', header = None, names = ['age'.'workclass'.'fnlwgt'.'education'.'education-num'.'marital-status'.'occupation'.'relationship'.'race'.'sex'.'capital-gain'.'capital-loss'.'hours-per-week'.'native-country'.'income'])
    df_test = pd.read_csv('data.test', header = None, skiprows = 1, names = ['age'.'workclass'.'fnlwgt'.'education'.'education-num'.'marital-status'.'occupation'.'relationship'.'race'.'sex'.'capital-gain'.'capital-loss'.'hours-per-week'.'native-country'.'income'])

    train_target, train_dataset = data_process(df_train, 'train')
    test_target, test_dataset = data_process(df_test, 'test')
# Heat code alignment
    test_dataset = fix_columns(test_dataset, train_dataset.columns)
    columns = train_dataset.columns
# print(df["income"])

    train_target, test_target = np.array(train_target), np.array(test_target)
    train_dataset, test_dataset = np.array(train_dataset), np.array(test_dataset)
        
    return train_dataset, train_target, test_dataset, test_target, columns

train_dataset, train_target, test_dataset, test_target, columns = Adult_data()
print(train_dataset.shape, test_dataset.shape, train_target.shape, test_target.shape)
Copy the code

The GridSearchCV class can be used to perform an exhaustive search of the classifier’s specified parameter values, where the optimal depth of the decision tree is searched

# params = {'max_depth' : range(1, 20)}
# best_clf = GridSearchCV(DecisionTreeClassifier(criterion = 'entropy', random_state = 20), param_grid = params)
# best_clf = best_clf.fit(train_dataset, train_target)
# print(best_clf.best_params_)
Copy the code

The decision number is used for classification, entropy is used as the decision base, the decision depth is 8 from the previous step, the sample number required to split a node is set at least 5, and the prediction results are saved.

# clf = DecisionTreeClassifier() score:0.7836742214851667
classes = [' <=50K'.' >50K']
clf = DecisionTreeClassifier(criterion = 'entropy', max_depth = 8, min_samples_split = 5)
clf = clf.fit(train_dataset, train_target)
pred = clf.predict(test_dataset)
print(pred)
score = clf.score(test_dataset, test_target)
# pred = clf.predict_proba(test_dataset)
print(score)
# print(np.argmax(pred, axis = 1))

with open('Predict/DecisionTree.csv'.'w', newline = ' ') as file :
    writer = csv.writer(file)
    writer.writerow(['id'.'result_pred'])
    for i, result in enumerate(pred) :
        writer.writerow([i, classes[result]])
Copy the code

The result is 0.835, similar to deep learning Visualize the decision tree structure

dot_data = export_graphviz(clf, out_file = None, feature_names = columns, class_names = classes, filled = True, rounded = True)
graph = graphviz.Source(dot_data)
graph
Copy the code

Support vector machine

Since the data processing method is the same as the decision tree, only the model part is pasted here

from sklearn import svm
classes = [' <=50K'.' >50K']
clf = svm.SVC(kernel = 'linear')
clf = clf.fit(train_dataset, train_target)
pred = clf.predict(test_dataset)
score = clf.score(test_dataset, test_target)
print(score)
print(pred)

with open('Predict/SupportVectorMachine.csv'.'w', newline = ' ') as file :
    writer = csv.writer(file)
    writer.writerow(['id'.'result_pred'])
    for i, result in enumerate(pred) :
        writer.writerow([i, classes[result]])
Copy the code

Random forests

classes = [' <=50K'.' >50K']
rf = RandomForestClassifier(n_estimators = 100, random_state = 0)
rf = rf.fit(train_dataset, train_target)
score = rf.score(test_dataset, test_target)
print(score)

pred = rf.predict(test_dataset)
print(pred)

with open('Predict/RandomForest.csv'.'w', newline = ' ') as file :
    writer = csv.writer(file)
    writer.writerow(['id'.'result_pred'])
    for i, result in enumerate(pred) :
        writer.writerow([i, classes[result]])
Copy the code

3. Result analysis

According to the prediction results of test sets in Adult data set, the accuracy of deep learning model, decision tree, support vector machine and random forest are 0.834, 0.834, 0.834 and 0.817 respectively, and the accuracy of the four models is almost the same. The reasons why the accuracy rate is not very high may include:

1. The model is not robust enough.

2. There are a large number of discrete types of data in the data set, and the data are highly sparse after the unique thermal coding.

Solutions:

1. The model can be searched again to increase the complexity of the model, and overfitting should be paid attention to in the process.

2. Data dimension reduction can be considered without single thermal encoding

All the code is available from myMaking the warehouseGet, welcome to your STAR

Finally, if you have learned anything about the processing and model implementation of the Adult dataset, please give it a thumbs up