Original link:tecdat.cn/?p=2783
Original source:Tuo End number according to the tribe public number
The introduction
With the rapid development of China’s economy, personal credit business also develops rapidly, and personal credit business also plays a driving role in improving domestic demand and promoting consumption. With the continuous expansion of personal credit business scale, credit default and other risk problems are increasingly prominent, which to some extent restricts the healthy development of China’s credit market.
challenge
In recent years, the types of personal consumption loans show diversified changes and development, from the original single type of loans to today’s various types of loans, car mortgage loans, education loans, consumer durables loans (home appliances, computers, kitchenware, etc.), marriage loans in China have been carried out. Default risk refers to the risk that the debtor fails to repay the loan debt on time due to various reasons. For commercial banks, default risk mainly refers to the default caused by the lender’s reduced repayment ability or credit level.
Overview of theory
The decision tree
Decision Tree is the main technology used for classification and prediction. It focuses on inferring classification rules of the representation form of Decision Tree from a group of irregular cases. It uses top-down recursion to compare attribute values in the internal nodes of Decision Tree, and makes judgment to branch down from the node according to different attributes. A conclusion is reached at the leaf of the decision tree. Thus, from the root node to the leaf node corresponds to a reasonable rule, and the whole tree corresponds to a set of expression rules. Decision tree is a frequently used and very important technology in data analysis. It can be used not only for data analysis but also for prediction. One of the biggest advantages of decision tree-based algorithm is that it does not require users to know a lot of background knowledge in the learning process. As long as the training case can be expressed in the way of attribute or conclusion, the algorithm can be used for learning.
The classification model based on decision tree has the following characteristics :(1) the structure of decision tree method is simple and easy to understand; (2) The decision tree model has high efficiency and is suitable for the large amount of data in the training set; (3) Tree methods usually do not need to receive knowledge outside the training set data; (4) Decision tree method has high classification accuracy.
Early warning scheme design
In the process of data operation, we have divided into four steps: data analysis and separation of data sets, establishment of training data set decision tree, evaluation of model performance, and improvement of model performance.
Data analysis and separation of data sets
In the separation of data set, we divided the data into two parts: the training data set used to establish a decision tree and the test data set used to evaluate the performance of the model. Samples were separated according to 80% training set and 20% test set. In general, the ratio of the two data sets is roughly the same, so the separation of the two data sets is reasonable.
Training data set | Test data set | ||
---|---|---|---|
default | No default | default | No default |
0.31625 | 0.68375 | 0.235 | 0.765 |
25300 | 54700 | 4700 | 15300 |
Table 1
Establish the decision tree of training data set
Figure 1
Figure 1 shows the basic situation of the decision tree of the training data set.
Figure 2
Figure 2 is a partial decision tree of the training data set.
Due to the large size of our data, the generated decision tree is very large. The output in the figure above shows some branches of the decision tree. We use simple language to illustrate the first five lines:
(1) If the checking account balance is unknown, it is classified as unlikely to default.
(2) Otherwise, if the checking account balance is less than 0, or between 1 and 200;
(3) The monthly loan term is less than or equal to 11 months
(4) Credit history is endangered, good, excellent, poor, classified as unlikely to default.
(5) If the credit history is excellent, it is classified as very likely to default.
The numbers in brackets indicate the number of cases that meet the decision criteria and the number of cases that are incorrectly classified according to the decision.
In the decision tree, it’s not hard to see why an applicant with an excellent credit history is judged to be highly likely to default, while an applicant with unknown check balances is less likely to default. These decisions may seem to make no logical sense, but they may reflect a real pattern in the data, or they may be statistical outliers.
After the decision tree is generated, a confusion matrix is output, which is a cross list, representing the number of records of the model’s error classification of training data:
It is well known that decision trees have a tendency to overfit training data models, and for this reason, error rates reported in training data may be too optimistic, so it is important to evaluate decision tree models based on test data sets.
Evaluating model performance
The test data set is used to make predictions in this step, and the results are shown in Figure 3.
The actual value | Predictive value | Line together | |
---|---|---|---|
No default | default | ||
No default | 125000.625 | 28000.140 | 15300 |
default | 23000.115 | 24000.120 | 4700 |
The combined column | 14800 | 5200 | 20000 |
Table 2.
It can be seen from Table 2 that in the sample of test set, the number of non-defaults actually judged as non-defaults accounts for 0.625; The actual number of non-default is judged as default, accounting for 0.140; However, the proportion of actual default was 0.115; The number of actual defaults judged as defaults, the ratio is 0.120.
From the perspective of the bank, the impact of the applicant’s actual non-default being judged as default is far less than that of the applicant’s actual non-default being judged as default. The reasons are as follows: First, if the applicant does not actually breach the contract, the bank may not approve the loan application and thus not issue the loan. In this way, the bank will not suffer the risk of the loan being issued but not receiving back, but only charging less interest on the loan. Two, the actual default was sentenced for default, Banks may agree to the loan application of the applicant, the agreed to loan, after granting loans, was sentenced to no default of the applicant may be because of the lack of good faith, not on time payment to comply with the requirements of the contract, such not only lost interest income, even the principal has three possible close not come back, in a test data set data, The actual number of non-default, accounting for 0.183; The actual number of defaults, the number of non-defaults, accounted for 0.489.
From the above three points, it can be concluded that the model based on the training test set is not very good when tested by the data in the test data set. From the perspective of banks, if the model is used in real life, the probability of the applicant’s actual default being misjudged as non-default is too high, and the banks will make wrong decisions, resulting in losses.
Model optimization scheme – increase the number of iterations, cost matrix
As can be seen from the above evaluation model performance, the model based on the training data set is not ideal, so we will improve the performance of the model.
1. Iterate 10 times
Firstly, we choose to use the method of iterating 10 times to improve the performance of the model.
The actual value | Predictive value | Line together | |
---|---|---|---|
No default | default | ||
No default | 133000.665 | 20000.100 | 15300 |
default | 23000.115 | 24000.120 | 4700 |
The combined column | 15600 | 4400 | 20000 |
Table 3
It can be seen from Table 3 that after 10 iterations, the proportion of actual defaults judged as non-defaults is 0.115, which is unchanged compared with the model of training data set. And the actual number of non-default is judged as default, accounting for 0.100.
Perspective from the bank, there isn’t much practical significance to improve the model performance, because the main factors influencing whether bank losses is to look at the actual default to be tried for the proportion of default, and the performance improvement did not reduce the number of actual default was sentenced to is not default, so we will continue to improve the performance of the model.
2. Iterate 100 times
According to the above steps, the model produced by iterating 10 times is not very good, so we will iterate 100 times in this step.
The actual value | Predictive value | Line together | |
---|---|---|---|
No default | default | ||
No default | 129000.645 | 24000.120 | 15300 |
default | 24000.120 | 23000.115 | 4700 |
The combined column | 15300 | 4700 | 20000 |
Table 4
The results obtained after 100 iterations are shown in Table 4. Compared with the result graph of the training data set, it can be seen that the improvement of the model performance has no significant effect.
3. Cost matrix
Since the above two operations did not greatly improve the performance of the model, we decided to adopt the cost matrix method at this step.
Here, we assume that the misclassification of a defaulting user as non-defaulting will cause 4 times of losses to the lender compared to the misclassification of the non-defaulting user as defaulting, so the cost matrix is:
The above matrix rows represent true values, columns represent predicted values, the first and first columns represent non-default, and the second and second columns represent default. If the algorithm classifies correctly, there is no allocation cost. Figure 6 is a summary of model classification results with cost matrix added.
The actual value | Predictive value | Line together | |
---|---|---|---|
No default | default | ||
No default | 76000.380 | 77000.385 | 15300 |
default | 10000.050 | 37000.185 | 4700 |
The combined column | 8600 | 11400 | 20000 |
Table 5
Compared with the previous results, the model with the addition of cost matrix has a good effect, and the proportion of actual breach being judged as non-breach is greatly reduced.
Figure 3
Figure 3 is a partial decision tree of the test dataset.
advice
Nowadays, our country’s living standards gradually improved, the personal consumption level also follow up, but still there are a lot of people pay wages rise can’t keep up with the growth of the consumer, will maintain loans from commercial Banks as the economic life, improve their standard of living, is not only a vehicle mortgage loans, housing mortgage loans, education loans, consumer durables loans, Marriage loans and so on have been carried out in Our country, the number and scale is more and more huge. To make profits from loans, commercial banks must strengthen the risk management of loans and gain experience from a large number of rules while conducting a single assessment. For large-scale data that human beings cannot understand, relevant research is needed to obtain useful rules and help commercial banks and other financial institutions to make decisions. Decision tree is a good decision management method for banks and financial institutions.
Through each child node of the decision tree, we can see which independent variable has great influence on loan default, so that commercial banks can pay more attention to this aspect of customers and strictly control this aspect. The algorithm can set the misjudgment generation value for the influential misjudgment classification, so that the model can pay more attention to this kind of misjudgment and reduce the probability of this kind of error. If a bank defaults on a customer who is not actually defaulting, the bank merely receives less interest on a few loans and does not go bad on the whole loan. However, if the bank uses this algorithm to make a more accurate judgment on the possibility of default of customers, it can reduce the case that the bank wrongly judges the actual defaulting customers as not defaulting, and reduce the case that the bank cannot collect the loan.