This article originated from personal public account: TechFlow, original is not easy, for attention


Today, the 26th article in our machine learning series, we’ll talk about another integrated learning model, the famous random forest.

Random forest is well-known and widely used in the industry and has won many algorithm competitions. In addition, it is also a classical model to construct a strong classifier by combining several weak classifiers, so it is widely popular in the industry.

This article is based on the decision tree articles. Those of you who have not read it can view the previous decision tree articles from the album at the top.

Algorithm principle

The last article introduced AdaBoost and gave an overview of several approaches to ensemble learning. There are three approaches to ensemble learning, Bagging, Boosting, and Stacking. AdaBoost is Boosting, while Random forest is Bagging.

The main feature of Bagging is “democracy”. The idea is easy to understand. We can train multiple weak classifiers for the same training data. When faced with a new sample, these classifiers are grouped together for a democratic vote, each of which is equal and has the same weight. Finally, the final result is integrated according to the results of all these weak classifiers. For example, we have trained a total of 50 weak classifiers. When facing a sample, 35 of them give category 1 and 15 give category 0, then the result of the whole model is category 1.

Another point is that Bagging does not train each classifier with a fixed set of data, but rather with a sampling of returns. Therefore, the distribution of samples in the training set is also different. Some samples may appear several times in the same classifier, and some samples do not appear at all. The randomness of samples alone is not enough, because the distribution of training samples is roughly the same. Although we sample randomly, there will not be a big difference between each training sample. Therefore, in order to ensure that each classifier has different emphasis and stronger randomness, we can also start from features and limit each classifier to randomly use only some features. In this way, the characteristics of each classifier are different and the emphasis is also different, which can enhance the randomness effect as much as possible and make the model focus on data.

Two key points hidden in the name of the random forest model are random and forest. Randomness we have already explained, on the one hand is the randomness of each classifier sample, and on the other hand is the randomness of the features that the classifier can use. Forest is also easy to understand, because the classifier we use is a decision tree, so a model composed of multiple decision “trees” is naturally a forest.

With these two features, random forests are easy to understand and easy to implement, since we’ve implemented decision tree models several times before.

Code implementation

We choose the most classic CART algorithm among decision trees to realize the decision tree, and we still use the data of breast cancer prediction in AdaBoost model last time.

First, as usual, we read in the data:

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

breast = load_breast_cancer()
X, y = breast.data, breast.target 0 0 0 0 0 0 0 0 0 0 0 0 0 0 y = y.reshape((- 1.1)) Copy the code

Then we used the train_test_split tool in the SkLearn library to split the data into training data and test data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)
Copy the code

The core structure of the random forest is still the decision tree, and we can just use the previous CART decision tree code with a few minor changes. The first is the function that calculates the GIni index and splits the data based on the selected features. These two functions do not require any changes and are copied from the previous article.

from collections import Counter

def gini_index(dataset):
    dataset = np.array(dataset)
    n = dataset.shape[0]
 if n == 0:  return 0   counter = Counter(dataset[:, - 1])  ret = 1.0  for k, v in counter.items():  ret -= (v / n) ** 2  return ret  def split_gini(dataset, idx, threshold):  left, right = [], []  n = dataset.shape[0]  The new Gini index is calculated after the split  for data in dataset:  if data[idx] <= threshold:  left.append(data)  else:  right.append(data)  left, right = np.array(left), np.array(right)  # Divide in half and multiply by the proportion  return left.shape[0] / n * gini_index(left) + right.shape[0] / n * gini_index(right) Copy the code

Then there is the splitting of the data set, where we split the data set into two parts based on characteristics and thresholds. The split_gini function is similar to the split_gini function above, except that the split_gini function calculates the Gini index after splitting.

from collections import defaultdict

def split_dataset(dataset, idx, thread):
    splitData = defaultdict(list)
    Otherwise, according to the threshold, it can be divided into two categories: greater than and less than
 for data in dataset:  splitData[data[idx] <= thread].append(np.delete(data, idx))  return list(splitData.values()), list(splitData.keys()) Copy the code

Then there is the function to get all the thresholds and choose the best feature and split thresholds, and the logic of these two functions is exactly the same as before, so let’s copy that as well.

def get_thresholds(X, idx):

    # numpy multidimensional index usage
    new_data = X[:, [idx, - 1]].tolist()
    # Sort by eigenvalue
 new_data = sorted(new_data, key=lambda x: x[0], reverse=True)  base = new_data[0] [1]  threads = []   for i in range(1, len(new_data)):  f, l = new_data[i]  Record if the label changes  ifl ! = base: base = l  threads.append(f)   return threads   def choose_feature_to_split(dataset):  n = len(dataset[0])- 1  m = len(dataset)  # Record the best Gini, characteristics and thresholds  bestGini = float('inf')  feature = - 1  thred = None  for i in range(n):  threds = get_thresholds(dataset, i)   for t in threds:  Pass through all thresholds and calculate Gini for each threshold  gini = split_gini(dataset, i, t)  if gini < bestGini:  bestGini, feature, thred = gini, i, t  return feature, thred Copy the code

And then there’s the building part, where we make a little adjustment. Because we only used part of the training data and part of the decision tree of feature training, there would be no over-fitting in general, so I removed the pre-pruning logic to prevent over-fitting. Another change is the addition of the backup setting, because of sparse features and data, it is possible to have only one branch on a node. To prevent this, I store a backup on each node that represents the category that occurs most frequently in the current node’s data.

def create_decision_tree(dataset):
    dataset = np.array(dataset)
    If both are of the same type, return the category directly
    counter = Counter(dataset[:, - 1])
    if len(counter) == 1:
 return dataset[0.- 1]   # if you have run out of features  if dataset.shape[1] = =1:  return counter.most_common(1) [0] [0]   Record the characteristics and thresholds for optimal splitting  fidx, th = choose_feature_to_split(dataset)   node = {'threshold': th, 'feature': fidx, 'backup': counter.most_common(1) [0] [0]}   split_data, vals = split_dataset(dataset, fidx, th)  for data, val in zip(split_data, vals):  node[val <= th] = create_decision_tree(data)  return node Copy the code

Then there are two tool functions, the first is the prediction function of a single subtree for a single sample, the second is the prediction function of a single subtree for a batch of samples. The overall logic is the same as before, so let’s just look at the code:

def subtree_classify(node, data):
    key = node['feature']
    pred = None
    thred = node['threshold']
    
 If the current branch does not exist, return backup  if (data[key] < thred) not in node:  return node['backup']   if isinstance(node[data[key] < thred], dict):  pred = subtree_classify(node[data[key] < thred], data)  else:  pred = node[data[key] < thred]   To prevent pred from being empty, select a leaf node as an alternative  if pred is None:  for key in node:  if not isinstance(node[key], dict):  pred = node[key]  break  return pred   def subtree_predict(node, X):  y_pred = []  # Traverse data, batch prediction  for x in X:  y = subtree_classify(node, x)  y_pred.append(y)  return np.array(y_pred) Copy the code

This is all part of the decision tree, and once the decision tree is implemented, it’s very easy to build the forest. It does one thing, random samples and features, and then it creates a new decision tree with random samples and features and records them.

First let’s look at the function of random samples and features:

def get_random_sample(X, y):
    rnd_idx = np.random.choice(range(X.shape[0]), 350, replace=True)
    ft_idx = np.random.choice(range(X.shape[1]), 15, replace=False)
    x_ret = X[rnd_idx][:, ft_idx]
    # concatenate the category into X
 x_ret = np.hstack((x_ret, y[rnd_idx]))  # Sort the subscripts of features  ft_idx.sort()  return x_ret, np.array(ft_idx) Copy the code

In this function, we set the number of samples to be 350 and the number of features to be 15 each time we train a new decision tree. In this way, the randomness of the constructed decision tree is guaranteed.

With the random sample out of the way, the code to build the forest is simple, creating the tree through a loop and recording it:

trees_num = 20
subtrees = []

for i in range(trees_num):
    X_rnd, features = get_random_sample(X_train, y_train)
 node = create_decision_tree(X_rnd)  Record the decision tree created and the features used  subtrees.append((node, features)) Copy the code

We use random features when we create the tree, and we need to use the same features in our subsequent predictions. Therefore, we need to record which feature values are used in each tree, and the decision tree only records the subscripts of features, so we need to sort these features before recording.

Once these functions are implemented, the predict method is used. According to Bagging’s definition, for each sample, we need to obtain the classification of each decision tree. And then choose the one with the largest number as the final prediction result, this logic is not complicated, I believe you can see the code to understand.

def predict(X, trees):
    y_pred = []
    for x in X:
        ret = []
        for tree, features in trees:
 ret.append(subtree_classify(tree, x[features]))  ret = Counter(ret)  y_pred.append(ret.most_common(1) [0] [0])  return np.array(y_pred) Copy the code

Finally, we verify the effect of the model on the test set:


The accuracy rate is 65.7%, which seems not high and is a bit different from what we expected. Let’s take a look at the effect of a single tree in the forest:


You’ll find that some trees are only more than 30% accurate, and none of them reach the theoretical value of 50%. There are a number of reasons for this. On the one hand, the randomness of our selection of features leads to the possibility that we may pick some features that are less effective. On the other hand, it is also related to the number of samples we trained. In this example, the number of samples we can use is too small, so the randomness is large and the deviation is not small.

In addition, we can see the effect of calling the random forest in Sklearn. We also set the number of decision trees in the forest to 40, and choose the Gini index as the basis of dividing samples.


We can see that its accuracy rate is even lower than that of our own development, which is also very normal. After all, the model is not universal, and there will be some scenarios that are not applicable or the effect is not good, which is very normal.

conclusion

In fact, the principle of random forest model is very close to AdaBoost. We guarantee the randomness of the model through random samples and features, and ensure that the model trained in this way is a weak classifier. In addition to weak effects, weak classifiers also have a great feature that they cannot be over-fitted, that is to say, they are all under-fitted models and cannot be over-fitted naturally. We achieve powerful classification results by integrating a large number of weak classifiers.

Compared with AdaBoost, random forest has stronger randomness and higher dependence on parameters, the number of decision trees in the forest, the number of features needed for each decision tree, and pruning strategies, etc. All these factors affect the final model, but it also has the advantage that AdaBoost does not, because randomness makes it possible to discover some feature combinations or unique data distributions in the sample that cannot be found by conventional methods. This is why, with proper tuning, random forests can often achieve very good results.

This is the end of the introduction of random forest, if you like this article, if you can, please click the following, give me a little encouragement, but also convenient access to more articles.

This article is formatted using MDNICE