The source code
https://gitee.com/pingfanrenbiji/Credit-Card-Score
Copy the code
Open the project in JUPyter
Import code base
# Numpy is a matrix based mathematical calculation module, pure mathematics
import numpy as np
# Pandas is a third-party library that provides high-performance, easy-to-use data types and analysis tools
import pandas as pd
# draw graph
import matplotlib.pyplot as plt
# Seaborn is a visual library based on Matplotlib. It is easier to use than Matplotlib, and the legend style is more modern
import seaborn as sns
# Matplotlib is a drawing library for Python. It contains a number of tools that you can use to create all kinds of graphics, including simple scatter plots, sinusoids, and even three-dimensional graphics
%matplotlib inline
Copy the code
Read data dictionary
What does each variable mean
SeriousDlqin2yrs Good and bad customers
RevolvingUtilizationOfUnsecuredLines total credit card and personal credit balance, in addition to real estate and no installment debt, such as car loans divided by line of credit
Age of birth
Numberoftime30-59dayspastduenotworse The number of times in the past two years that a loan was 35 to 59 days late but not bad
Monthly debt payments, alimony, living expenses divided by gross profit
MonthlyIncome monthly income
NumberOfOpenCreditLinesAndLoans open-end loans (installment car loan or mortgage) and the number of credit (credit card)
NumberOfTimes90DaysLate >= 90 days overdue
NumberRealEstateLoansOrLines mortgages and real estate loans, including home equity lines of credit
Numberoftime60-89dayspastduenotworse The number of times in the past two years that a loan was 60 to 89 days late but not bad
NumberOfDependents does not include yourself
Copy the code
REAL data type description
The REAL data type holds single-precision floating point numbers
The REAL value takes four bytes of storage
Values saved as REAL are accurate to seven significant digits
Copy the code
Reading training data
df = pd.read_csv("./GiveMeSomeCredit/cs-training.csv").drop("Unnamed: 0", axis=1)
Axis =1 indicates column drop"Unnamed: 0"The columns of the
Copy the code
Look at the first five rows of data
View missing values and outliers
Info () shows that there are 150,000 pieces of information in total. MonthyIncome and NumberOfDependents are missing. MonthyIncome is missing 29731 pieces of data. NumberOfDependents 3924 data is missing.
Copy the code
Data set calculation
df.describe().T.assign(missing_rate = df.apply(lambda x : (len(x)-x.count())/float(len(x))))
Copy the code
Describe () is used to describe count, mean, Max/min, standard deviation, and first, second, and third quartile values in the data set and to add missing rate calculations
Copy the code
Missing value handling:
1. Since MonthyIncome has many omissions, it is not suitable for direct deletion. The missing values are filled according to the relationship between variables, and the random forest method is adopted.
2. The loss of NumberOfDependents is small and has limited influence on the overall sample. Therefore, deletion operation can be carried out directly here, and some other filling operation can be carried out later.
Copy the code
Random forest function
- Principle analysis:
- Parameter Description:
Random_state: Random seed
N_estimators: the maximum number of weak learners. Generally speaking, if n_ESTIMators are too small, it is easy to underfit; if n_ESTIMators are too large, the calculation amount will be too large; and if N_ESTIMators reach a certain number, then increasing n_ESTIMators will get little improvement in the model, so a moderate value is generally selected. The default is 100
Max_depth: indicates the maximum depth of the decision tree
N_jobs: Sets the number of working cores. Minus 1 means that all the cores in the CPU are working
Copy the code
Delete missing values and duplicate values
df.dropna()
df.drop_duplicates()
Copy the code
Outlier handling
A value that deviates from the majority of sampled data, usually a measurement that deviates from the mean by more than two standard deviations
Outlier detection is usually used to detect outliers
Copy the code
- Draw a case diagram
df["RevolvingUtilizationOfUnsecuredLines"].plot(kind="box", grid=True)
Grid =True: Displays the grid
Copy the code
- Use 2 instead of values greater than 2
revNew = []
for val in df.RevolvingUtilizationOfUnsecuredLines:
if val <= 2:
revNew.append(val)
else:
revNew.append(2.)
Copy the code
Once again points
In the age box
df.age.plot.box(grid=True)
Copy the code
A 0 value is found in the AGE attribute, and the data is clearly an outlier, so it is processed
df = df[df["age"] > 0]
Copy the code
Box the three attributes
df.boxplot(column=["NumberOfTime30-59DaysPastDueNotWorse"."NumberOfTime60-89DaysPastDueNotWorse"."NumberOfTimes90DaysLate"],rot=30)
Rot: int orfloat, defaults to 0. The rotation Angle (in degrees) of the label relative to screen coordinates
Copy the code
The box diagram above shows that numberoftime30-59DAYSpastDuenotWorse, Numberoftime60-89DaySpastDuenotWorse, and NumberOfTimes90DaysLate all have two outliers
Copy the code
View specific outliers
df["NumberOfTime30-59DaysPastDueNotWorse"].unique()
df["NumberOfTime60-89DaysPastDueNotWorse"].unique()
df["NumberOfTimes90DaysLate"].unique()
Copy the code
You can see that 96 and 98 are outlier data
- Replaces the outlier with the median of the specified column data
def replaceOutlier(data):
New = []
med = data.median()
for val in data:
if ((val == 98) | (val == 96)):
New.append(med)
else:
New.append(val)
return New
Copy the code
After removing the outliers, look at the boxplot of the three columns
View the boxplot of DebtRatio
Use median absolute deviation for outlier detection
The first parameter is to specify column data
The second parameter is the threshold
If the column data is an array, it is converted to array
Shape returns the number of matrix rows and columns of the data table
The number of rows of the matrix you can also make len of x output the length of the matrix which is called the number of rows
Take the median for each row in the sequence
Subtract the median from each data in each row
And then you take the median of the difference
Then a formula norm.ppf(0.75) * (the gap between each data and median)/the median of each median is used to obtain an outlier score
Returns if the score is greater than the thresholdtrue
Returns if the score is less than the thresholdfalse
Copy the code
The smallest outlier detected is used to replace the outlier
minUpperBound = min([val for (val, out) in zip(df.DebtRatio, mad_based_outlier(df.DebtRatio)) if out == True])
Mad_based_outlier (df.debtratio) returnstrueIs displayed, the value greater than the threshold is an outlier
Find all the outliers and take the minimum to get the smallest outliers
Copy the code
- Replacement outliers
Anything greater than an outlier is replaced by an outlier
Copy the code
View the data in this column
Draw a boxplot
Check the monthly income data
Box plot monthly income
The same way we find the smallest outlier and replace it with anything greater than the smallest outlier
The other parameters are similar not to be repeated
Data segmentation
The data is divided into training sets and test sets
- Import libraries
from sklearn.model_selection import train_test_split
Copy the code
- Training data and test data segmentation
Y = df["SeriousDlqin2yrs"]
X = df.iloc[:, 1:]
# Test and training data are split at a 3:7 ratio random_state specifies a value that will not be randomly divided each time it is run
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=123)
train = pd.concat([Y_train, X_train], axis=1)
test = pd.concat([Y_test, X_test], axis=1)
train.to_csv('TrainData.csv',index=False)
test.to_csv('TestData.csv',index=False)
Copy the code
Exploratory analysis
Histogram, scatter diagram and boxplot are generally used for analysis
Using histogram and kernel density estimation under the drawing, the Age, MonthlyIncome, NumberOfOpenCreditLinesAndLoans positively too distribution, roughly in line with the statistical analysis
Copy the code
fig = plt.figure()
# alpha: Set chart color
FIG. Set (alpha = 0.2)
# subplot2Grid divides several smaller images into a larger image
Create an axis object at a specific location in the grid
Allow axis objects to span more than one row or column
# Vertical axis 2, horizontal axis 3 are located in the first position
plt.subplot2grid((2, 3), (0, 0))
# hist histogram
# bins sets the number of groups in the histogram
# figsize is a tuple that specifies the width and height of the inch.
train["age"].plot(kind="hist", bins=30, figsize=(12, 6), grid=True)
plt.title("Hist of Age")
# Solve the display problem of Chinese
plt.rcParams["font.sans-serif"] = ["SimHei"]
# Fix save image where negative sign '-' is displayed as a box
plt.rcParams["axes.unicode_minus"] = False
plt.tight_layout() Adjust the spacing between subgraphs to compact display images
plt.show()
Copy the code
Feature selection
1. Variable boxed Discretization of continuous variables Merge multi-state discrete variables into fewer states
2. Importance of variable sorting
A Stability avoids meaningless fluctuations in features that cause fluctuations in scores
B Robustness to avoid the influence of extreme values
3. Advantages of variable case distribution
A can carry the missing values into the model as separate boxes
B transforms all the variables to similar scales
Disadvantage of variable sorting
A Large amount of computation
B coding is required after container sorting
4. Common methods of variable sorting
A supervised
a-1 Best-Ks
A-2 ChiMerge (Card Loading and Unloading)
B unsupervised
The b-1 equidistant
B - 2 frequency, etc
B - 3 clustering
Copy the code
subsequent
In the next chapter, I will write down the implementation logic of feature sorting and so on
Copy the code