Have you heard and seen all the wonders of machine learning and decided to give it a try yourself? In this article, we introduce you to the basics of machine learning in plain English, using a loan risk assessment example to help you complete your first machine learning project in Python. Once you try it, machine learning is really not hard.

task

Congratulations on your successful internship in a financial company.

It’s your first day and you’re still excited. Then the supervisor calls you over and shows you a file.

This is what the file looks like:

The director said it was a valuable data asset for the company. I urge you to read carefully and find patterns in the numbers to make informed lending decisions.

Each line of data represents the previous loan information. You think for a long time and finally figure out what each column means:

  • Grade: Loan level
  • sub_grade: Loan breakdown level
  • short_emp: Short-term employment within one year
  • emp_length_num: Years of employment
  • home_ownership: Living status (owner-occupied, mortgaged, rented)
  • Dti: Loans as a percentage of income
  • The purpose of the loan
  • Term: loan period
  • last_delinq_none: Does the loan applicant have a bad record
  • last_major_derog_none: Whether the loan applicant is more than 90 days overdue
  • revol_util: Ratio of overdraft to credit
  • total_rec_late_fee: Total amount of overdue fines
  • safe_loans: Whether the loan is safe

The last column records whether the loan is repaid on schedule. Armed with these valuable lessons from the past, your supervisor wants you to be able to draw rules about how safe a loan is. Take it easy and react correctly to new loan applications.

The pattern your supervisor asked you to find can be expressed in a decision tree.

Decision making

Let’s talk about what a decision tree is.

The decision tree looks something like this:

When making a decision, you need to start from the top node. On each branch, there is a judgment condition. If you meet the conditions, go left; Not satisfied, go right. Once you reach the edge of the tree, a decision is made.

For example, you are walking in the street when you meet Lao Zhang, your neighbor. You say hello warmly:

“Zhang, have you eaten?”

All right, so this is a branch. Lao Zhang’s answer will determine your decision – what you will say next.

Case one.

Lao Zhang: Yes, I have.

You: Why don’t you come to my house for some more?

Case two.

Lao Zhang: Not yet.

You: Then go home and eat. Good bye!

In the case of a loan, you need to analyze each of the applicant’s indicators in turn, and then determine whether the loan application is safe to make a decision on whether to lend to him or not. If you write this process down, it’s a decision tree.

As a financial rookie, you had a positive and open mind and wanted to try more. But when you pull down the table to the last row, there are 46,509 entries!

You assess your reading speed, patience, and cognitive load, decide that the task is Mission Impossible, and quietly start packing up your things. You plan to find your master and quit.

Wait, you don’t have to be so upset. Technology has put a piece of dark magic at your disposal. It’s called machine learning.

learning

What is machine learning?

Once upon a time, people “operated” computers. There is no doubt in one’s mind how a task will be accomplished. People give instructions to the computer, the computer is responsible for the silly finish, call it a day.

It turns out that there are some tasks that people simply don’t know how to do.

In the news a few days ago, you know Alpha Go played Go with Ke Jie. Ke jie not only lost the game, but also cried.

But do the guys who made Alpha Go really know how to beat Ke Jie at chess? You just tell them to give up sportsmanship, save up feather dusters and play chess with Ke Jie… Who do you think was crying?

How did a group of people who could not even beat Ke Jie at chess create computer software that defeated the “strongest brain” in the human go world?

The answer is machine learning.

You can’t tell A machine to “do this step one, do that step two” or “If A happens, open the first bag; If B happens, open the second bag.

The key to machine learning is not human experience and wisdom, but data.

In this paper, we are exposed to the most basic supervised learning. Supervised learning uses data that machines like best. The characteristic of this data is that it’s all tagged.

The loan data set the supervisor gave you, it was flagged. Each loan case is followed by a “safe or not” mark. One means safe, minus one means unsafe.

The machine sees a piece of data, sees a mark on the data, and makes a hypothesis.

Then you show it another piece of data, and it reinforces or modifies the assumption.

This is the process of learning: building hypotheses — receiving feedback — revising hypotheses. In this process, the machine is constantly refreshing its own cognition through iteration.

This reminds me of the dialogue snippet in the classic crosstalk comedy “Toad Drum”.

A: Let me ask you something. Have you seen Toad?

B: Who hasn’t seen Toad?

A: Why do you think such a small animal makes such a loud noise?

B: That’s because it has a big mouth, a big belly and a big neck, so it must bark loudly. All things are the same.

A: My paper basket has a big mouth and a big neck. Why doesn’t it cry?

B: The wastepaper basket is dead. It’s made of bamboo. It doesn’t even ring.

A: The sheng is also made of bamboo. How can it sound?

B: Although it is made of bamboo, because it has holes and holes, those with holes will ring.

A: The sieve of my rice is full of holes. Why doesn’t it blow?

Here comedian B has been trying to establish generalizable assumptions. Unfortunately, A always destroys B’s three views with new examples.

After hitting the wall everywhere, the poor machine stumbled to grow. After looking at lots and lots of data, the computer gradually comes up with its own ideas for judging things. We call this idea a model.

After that, you can use the model to help you make informed judgments.

So let’s get started. Use Python to make a decision tree to help us judge loan risk.

To prepare

To use Python and related packages, you need to install the Anaconda package first. See “How to Make a Word Cloud in Python” for detailed steps.

The loan data file shown by your supervisor can be downloaded here.

The file extension is CSV, which you can open in Excel to see if it downloaded correctly.

If everything works, please move it to our working directory, Demo.

Go to your system “terminal” (macOS, Linux) or “command prompt” (Windows), go to our working directory, Demo, and execute the following command.

pip install -U PIL
Copy the code

The operating environment is configured.

At a terminal or command prompt type:

jupyter notebook
Copy the code

The Jupyter Notebook is running correctly. Now we are ready to write the code.

code

First, let’s create a new Python 2 notebook called loans-Tree.

To make Python efficient with table data, we use a very good data-processing framework called Pandas.

import pandas as pd
Copy the code

Then we read all the contents of loans.csv into a variable called df.

df = pd.read_csv('loans.csv')
Copy the code

Let’s look at the first few lines of the df data box to make sure the data is read correctly.

df.head()
Copy the code

Since the table has a large number of columns and the screen is incomplete, we drag the table to the right to see if the rightmost columns of the table also read correctly.

Verify that all columns of the data have been read in.

Count the total number of rows to see if all rows are also fully read in.

df.shape
Copy the code

The running results are as follows:

(46508, 13)
Copy the code

The number of rows and columns is correct and the data is read correctly.

As you may recall, the last column of each piece of data safe_loans is a flag that tells us whether the loan we made was safe. We call this tag the target, and all the preceding columns the features. It doesn’t matter if you can’t remember these terms now, because you’ll come across them again and again. It will naturally reinforce the memory.

Now let’s extract the features and the targets separately. As is customary in machine learning, we call the feature X and the target Y.

X = df.drop('safe_loans', axis=1)
y = df.safe_loans
Copy the code

Let’s take a look at the shape of characteristic data X:

X.shape
Copy the code

The running results are as follows:

(46508, 12)
Copy the code

All but the last column are there. In line with our expectations. Let’s look at the Target column.

y.shape
Copy the code

The following information is displayed:

(46508)Copy the code

There is no number after the comma, which means there is only one column.

Let’s look at the first few columns of X.

X.head()
Copy the code

The running results are as follows:

Notice there’s a problem here. When making decision trees in Python, every feature should be numeric (integer or real). However, we can see at a glance that the values of grade, sub_grade and home_ownership are all categorical. So, you have to go through a transformation that maps each of these categories to a number in order to proceed.

So let’s start mapping:

from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
d = defaultdict(LabelEncoder)
X_trans = X.apply(lambda x: d[x.name].fit_transform(x))
X_trans.head()
Copy the code

The result looks like this:

Here, we use the LabelEncoder function to successfully convert the category into a numeric value. Quiz: Below the grade column, what number is B mapped to?

Compare the two tables and think for 10 seconds.

The answer is 1. Did you get that right?

Now what we need to do is divide the data into two parts, called training set and test set.

Why all this fuss?

Because it makes sense.

Imagine if before a final exam, your teacher gave you a set of questions and answers and you memorized them. And then when you take the exam, you just take part of the exam from that set. You got 100 points for your superhuman memory. Have you learned the knowledge of this subject? I wonder if you can do it if I give you a new topic? The answer is still unknown.

So the exam questions need to be different from the review questions. In the same way, we use the data to generate a decision tree, which must be perfect for the data we’ve already seen. But can it be generalized to new data? That’s what we really care about. As in this case, your company doesn’t care whether or not the previous loan should be taken out. It’s how to handle new loan applications that come your way in the future.

To randomly divide data into training sets and test sets, only 2 statements are required in Python.

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_trans, y, random_state=1)
Copy the code

Let’s look at the shape of the training dataset:

X_train.shape
Copy the code

The running results are as follows:

(34881, 12)
Copy the code

What about test sets?

X_test.shape
Copy the code

Here is the result:

(11627, 12)
Copy the code

At this point, all data preparation work is ready. We started calling the Scikit-learn package in Python. The decision tree model has been integrated. It only needs 3 statements and can be called directly, which is very convenient.

from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth=3)
clf = clf.fit(X_train, y_train)
Copy the code

Ok, the decision tree you want has been generated.

It’s that simple. Capricious?

But how do I know what the resulting decision tree looks like? Seeing is believing!

This…… All right, let’s draw the decision tree. Note that there is a lot of content in this statement. We’ll talk about it in detail when we get a chance. Here you just copy it in and execute it.

with open("safe-loans.dot".'w') as f:
     f = tree.export_graphviz(clf,
                              out_file=f,
                              max_depth = 3,
                              impurity = True,
                              feature_names = list(X_train),
                              class_names = ['not safe'.'safe'],
                              rounded = True,
                              filled= True )

from subprocess import check_call
check_call(['dot'.'-Tpng'.'safe-loans.dot'.'-o'.'safe-loans.png'])

from IPython.display import Image as PImage
from PIL import Image, ImageDraw, ImageFont
img = Image.open("safe-loans.png")
draw = ImageDraw.Draw(img)
img.save('output.png')
PImage("output.png")
Copy the code

The time has come for a miracle:

Are you as surprised as I was when I first saw the visualization of the decision tree?

We actually had Python generate a simple decision tree (only three layers deep), but Python did its job of helping us take into account the impact of various variables on the final decision.

test

Ecstatic you, quietly reciting what? You say you want to memorize the conditions of this decision tree and make loan risk judgments?

Cut it out. What age are you still so fond of reciting?

In the future, the computer can automate the decision for you.

You don’t believe?

Let’s take a random piece of data from the test set. Let the computer use the decision tree to help us decide.

test_rec = X_test.iloc[1,:]
clf.predict([test_rec])
Copy the code

The computer told us that it had investigated the risk and the result was this:

array([1])
Copy the code

As I mentioned before, 1 means the loan is safe. What is the reality? So let’s verify that. Retrieves the corresponding tag from the test set target:

y_test.iloc[1]
Copy the code

The result is:

1
Copy the code

As it turns out, the computer correctly judged the risk of this new loan application through the decision tree.

But we can’t use the evidence to prove it. Let’s verify how high the judgment accuracy of loan risk category is according to the trained decision tree model.

from sklearn.metrics import accuracy_score
accuracy_score(y_test, clf.predict(X_test))
Copy the code

Although the test set contained nearly 10,000 pieces of data, the computer was done immediately:

0.61615205986066912
Copy the code

You might be a little disappointed — after all that work, how come you’re only more than 60 percent accurate? Just barely passed.

Don’t lose heart. Because in the whole machine learning process, you are using default values, and there is no time to do the important work of optimization.

Think about it. When you buy a new phone, you have to set it up, right? For corporate lending, you’re using a default model that isn’t optimized. But even so, the accuracy has already exceeded the pass line.

As for the optimization problem, we will talk about it in detail when we have the opportunity.

You finally got out of the first day of your internship. I see a future Wall Street star in the making.

Wealth, no forget oh.

discuss

What else do you think decision trees can be used for, beyond judging loan risk? In addition to the decision trees mentioned in this article, what other machine learning algorithms do you know for classification? Welcome to leave a message to share with you, we exchange and discuss.

If you like, please give it a thumbs up. You can also follow and top my official account “Nkwangshuyi” on wechat.

If you’re interested in data science, check out my series of tutorial index posts entitled how to Get started in Data Science Effectively. There are more interesting problems and solutions.