The operation principle of credit scoring system part I

The source code

https://gitee.com/pingfanrenbiji/Credit-Card-Score

Copy the code

Open the project in JUPyter

Import code base

# Numpy is a matrix based mathematical calculation module, pure mathematics

import numpy as np

# Pandas is a third-party library that provides high-performance, easy-to-use data types and analysis tools

import pandas as pd

# draw graph

import matplotlib.pyplot as plt

# Seaborn is a visual library based on Matplotlib. It is easier to use than Matplotlib, and the legend style is more modern

import seaborn as sns

# Matplotlib is a drawing library for Python. It contains a number of tools that you can use to create all kinds of graphics, including simple scatter plots, sinusoids, and even three-dimensional graphics

%matplotlib inline

Copy the code

Read data dictionary

What does each variable mean

SeriousDlqin2yrs Good and bad customers

RevolvingUtilizationOfUnsecuredLines total credit card and personal credit balance, in addition to real estate and no installment debt, such as car loans divided by line of credit

Age of birth

Numberoftime30-59dayspastduenotworse The number of times in the past two years that a loan was 35 to 59 days late but not bad

Monthly debt payments, alimony, living expenses divided by gross profit

MonthlyIncome monthly income

NumberOfOpenCreditLinesAndLoans open-end loans (installment car loan or mortgage) and the number of credit (credit card)

NumberOfTimes90DaysLate >= 90 days overdue

NumberRealEstateLoansOrLines mortgages and real estate loans, including home equity lines of credit

Numberoftime60-89dayspastduenotworse The number of times in the past two years that a loan was 60 to 89 days late but not bad

NumberOfDependents does not include yourself

Copy the code

REAL data type description

The REAL data type holds single-precision floating point numbers

The REAL value takes four bytes of storage

Values saved as REAL are accurate to seven significant digits

Copy the code

Reading training data

df = pd.read_csv("./GiveMeSomeCredit/cs-training.csv").drop("Unnamed: 0", axis=1)



Axis =1 indicates column drop"Unnamed: 0"The columns of the



Copy the code

Look at the first five rows of data

View missing values and outliers

Info () shows that there are 150,000 pieces of information in total. MonthyIncome and NumberOfDependents are missing. MonthyIncome is missing 29731 pieces of data. NumberOfDependents 3924 data is missing.

Copy the code

Data set calculation

df.describe().T.assign(missing_rate = df.apply(lambda x : (len(x)-x.count())/float(len(x))))

Copy the code

Describe () is used to describe count, mean, Max/min, standard deviation, and first, second, and third quartile values in the data set and to add missing rate calculations

Copy the code

Missing value handling:

1. Since MonthyIncome has many omissions, it is not suitable for direct deletion. The missing values are filled according to the relationship between variables, and the random forest method is adopted.

    

2. The loss of NumberOfDependents is small and has limited influence on the overall sample. Therefore, deletion operation can be carried out directly here, and some other filling operation can be carried out later.

Copy the code

Random forest function

Principle analysis:

Parameter Description:

Random_state: Random seed



N_estimators: the maximum number of weak learners. Generally speaking, if n_ESTIMators are too small, it is easy to underfit; if n_ESTIMators are too large, the calculation amount will be too large; and if N_ESTIMators reach a certain number, then increasing n_ESTIMators will get little improvement in the model, so a moderate value is generally selected. The default is 100



Max_depth: indicates the maximum depth of the decision tree



N_jobs: Sets the number of working cores. Minus 1 means that all the cores in the CPU are working

Copy the code

Delete missing values and duplicate values

df.dropna()

df.drop_duplicates()

Copy the code

Outlier handling

A value that deviates from the majority of sampled data, usually a measurement that deviates from the mean by more than two standard deviations



Outlier detection is usually used to detect outliers

Copy the code

Draw a case diagram

df["RevolvingUtilizationOfUnsecuredLines"].plot(kind="box", grid=True)



Grid =True: Displays the grid

Copy the code

Use 2 instead of values greater than 2

revNew = []

for val in df.RevolvingUtilizationOfUnsecuredLines:

    if val <= 2:

        revNew.append(val)

    else:

        revNew.append(2.)

Copy the code

Once again points

In the age box

df.age.plot.box(grid=True)

Copy the code

A 0 value is found in the AGE attribute, and the data is clearly an outlier, so it is processed



df = df[df["age"] > 0]

Copy the code

Box the three attributes

df.boxplot(column=["NumberOfTime30-59DaysPastDueNotWorse"."NumberOfTime60-89DaysPastDueNotWorse"."NumberOfTimes90DaysLate"],rot=30)



Rot: int orfloat, defaults to 0. The rotation Angle (in degrees) of the label relative to screen coordinates

Copy the code

The box diagram above shows that numberoftime30-59DAYSpastDuenotWorse, Numberoftime60-89DaySpastDuenotWorse, and NumberOfTimes90DaysLate all have two outliers

Copy the code

View specific outliers

df["NumberOfTime30-59DaysPastDueNotWorse"].unique()

df["NumberOfTime60-89DaysPastDueNotWorse"].unique()

df["NumberOfTimes90DaysLate"].unique()

Copy the code

You can see that 96 and 98 are outlier data

Replaces the outlier with the median of the specified column data

def replaceOutlier(data):

    New = []

    med = data.median()

    for val in data:

        if ((val == 98) | (val == 96)):

            New.append(med)

        else:

            New.append(val)

    return New

Copy the code

After removing the outliers, look at the boxplot of the three columns

View the boxplot of DebtRatio

Use median absolute deviation for outlier detection

The first parameter is to specify column data

The second parameter is the threshold



If the column data is an array, it is converted to array



Shape returns the number of matrix rows and columns of the data table



The number of rows of the matrix you can also make len of x output the length of the matrix which is called the number of rows



Take the median for each row in the sequence



Subtract the median from each data in each row



And then you take the median of the difference



Then a formula norm.ppf(0.75) * (the gap between each data and median)/the median of each median is used to obtain an outlier score



Returns if the score is greater than the thresholdtrue



Returns if the score is less than the thresholdfalse

  

Copy the code

The smallest outlier detected is used to replace the outlier

minUpperBound = min([val for (val, out) in zip(df.DebtRatio, mad_based_outlier(df.DebtRatio)) if out == True])



Mad_based_outlier (df.debtratio) returnstrueIs displayed, the value greater than the threshold is an outlier



Find all the outliers and take the minimum to get the smallest outliers



Copy the code

Replacement outliers

Anything greater than an outlier is replaced by an outlier

Copy the code

View the data in this column

Draw a boxplot

Check the monthly income data

Box plot monthly income

The same way we find the smallest outlier and replace it with anything greater than the smallest outlier

The other parameters are similar not to be repeated

Data segmentation

The data is divided into training sets and test sets

Import libraries

from sklearn.model_selection import train_test_split

Copy the code

Training data and test data segmentation

Y = df["SeriousDlqin2yrs"]

X = df.iloc[:, 1:]

# Test and training data are split at a 3:7 ratio random_state specifies a value that will not be randomly divided each time it is run

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=123)



train = pd.concat([Y_train, X_train], axis=1)

test = pd.concat([Y_test, X_test], axis=1)



train.to_csv('TrainData.csv',index=False)

test.to_csv('TestData.csv',index=False)

Copy the code

Exploratory analysis

Histogram, scatter diagram and boxplot are generally used for analysis



Using histogram and kernel density estimation under the drawing, the Age, MonthlyIncome, NumberOfOpenCreditLinesAndLoans positively too distribution, roughly in line with the statistical analysis

Copy the code

fig = plt.figure()

# alpha: Set chart color

FIG. Set (alpha = 0.2)

# subplot2Grid divides several smaller images into a larger image

Create an axis object at a specific location in the grid

Allow axis objects to span more than one row or column

# Vertical axis 2, horizontal axis 3 are located in the first position

plt.subplot2grid((2, 3), (0, 0))

# hist histogram

# bins sets the number of groups in the histogram

# figsize is a tuple that specifies the width and height of the inch.

train["age"].plot(kind="hist", bins=30, figsize=(12, 6), grid=True)

plt.title("Hist of Age")



# Solve the display problem of Chinese

plt.rcParams["font.sans-serif"] = ["SimHei"]

# Fix save image where negative sign '-' is displayed as a box

plt.rcParams["axes.unicode_minus"] = False



plt.tight_layout() Adjust the spacing between subgraphs to compact display images

plt.show()

Copy the code

Feature selection

1. Variable boxed Discretization of continuous variables Merge multi-state discrete variables into fewer states



2. Importance of variable sorting



A Stability avoids meaningless fluctuations in features that cause fluctuations in scores



B Robustness to avoid the influence of extreme values



3. Advantages of variable case distribution



A can carry the missing values into the model as separate boxes



B transforms all the variables to similar scales



Disadvantage of variable sorting



A Large amount of computation



B coding is required after container sorting



4. Common methods of variable sorting



A supervised



a-1 Best-Ks



A-2 ChiMerge (Card Loading and Unloading)



B unsupervised



The b-1 equidistant



B - 2 frequency, etc



B - 3 clustering



Copy the code

subsequent

In the next chapter, I will write down the implementation logic of feature sorting and so on

Copy the code