Scikit-learn source code reading (Issue 2) Base class implementation details

The SkLearn project can be regarded as a big tree. Various Estimators are the fruits, and the backbone supporting these estimators are a few base classes. Common classes include BaseEstimator, BaseSGD, ClassifierMixin, RegressorMixin, and so on.

The API reference page of the official documentation lists the main API interfaces, so let’s take a look at the Base class

In this installment, we only study BaseEstimator, ClassifierMixin, RegressorMixin, and TransformerMixin. BaseSGD is a big topic that needs to be studied in a separate issue.

BaseEstimator

At the bottom is the BaseEstimator class. Two main methods are exposed: set_params and get_params.

`get_params`

This method is designed to get the parameters of an object and returns the key-value pair of the object whose default value is {parameter: parameter value}. If the get_params parameter deep is set to True, child objects (which are estimators) are also returned, if any. Let’s take a closer look at the implementation details of this method:

In order to save space, I will omit the unimportant notes, which will be treated as such in the future and will not be repeated unless otherwise specified.

Getattr (object, property to retrieve [, value to return if property does not exist]). The Line200~208 task is to determine whether self (typically an instance of an estimator) contains the key argument, and if so return its argument value, otherwise set to None.

Why is it so complicated? Value = getattr(self, key, None)

If the user sets deep=True and the value object implements get_params, then the get_params method will not be implemented again. The key-value pairs of the parameter dictionary are extracted and written to the dictionary. The whole function returns a dictionary.

(3) Let’s take a quick look at how this method is used, and then continue to track the implementation of the source code.

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=0)
X = [[ 1.2.3].# 2 samples, 3 features
     [11.12.13]]
y = [0.1]  # classes of each sample
clf.fit(X, y)
Copy the code

To simply instantiate an object that is a random forest classifier, let’s see what get_params returns:

clf.get_params()

{'bootstrap': True.'class_weight': None.'criterion': 'gini'.'max_depth': None.'max_features': 'auto'.'max_leaf_nodes': None.'min_impurity_decrease': 0.0.'min_impurity_split': None.'min_samples_leaf': 1.'min_samples_split': 2.'min_weight_fraction_leaf': 0.0.'n_estimators': 10.'n_jobs': None.'oob_score': False.'random_state': 0.'verbose': 0.'warm_start': False}
Copy the code

Obviously, this is the default parameter scheme for the random forest classifier.

For key in self._get_param_names(): for key in self._get_param_names()

Again, in a large Python project like Sklearn, many of the exposed methods are essentially a shell, which you can interpret as carrying around something someone else has done, just wrapped up and handed to the caller. The get_params method, for example, doesn’t really get the parameters of the estimator instance because _get_param_names does the work for it.

The @classMethod decorator directly tells us that the object for this method is the class itself, not the instance object.

This function has a number of checks, the actual parameters are inspect.signature(init).parameters.values(), and finally the name property of each object in the list.

`set_params`

This method sets parameters. Normally, we customize the parameters when initializing the estimator, but there is a need to change the parameters temporarily, which can be done manually by calling the set_params method. But this method is more often called by classes that inherit from BaseEstimator.

Specifically, let’s look at the implementation details:

This scheme supports handling nested dictionaries, but let’s not get too bothersome and go straight to L251, setattr(self, key, value), which sets a new value for the key property of the estimator.

Examples of application:

ClassifierMixin

Mixins refer to mixins, which can simply be interpreted as adding extra methods to other classes. Sklearn’s classification and regression mixins only implement score methods, and any classes that inherit them need to implement fit, Predict and other methods themselves.

A mixin class is simply a parent class, but a little different from a normal class in that it needs to specify a meta-object, _estimator_type. There is no discussion here, but if you are interested, please read this discussion. What is a mixin, and why are they Useful?

As you can see, the implementation of this mixin class is very simple: evaluate the accuracy of the predicted value and the real value, and return a floating-point number. Note that the predicted value comes from self.predict(), so classes that inherit from mixins must implement predict methods themselves or raise an error. This detail will not be repeated later.

Again, the mixing of classification tasks is carrying the fruits of labor of other functions, so we will study the implementation details of accuracy_score

For the sake of brevity, we’ll ignore the code between L185 and 189 and look at it in more detail in a later article devoted to the measurement of categorization tasks. Look directly at L191, y_ture == y_pred. This is a simple notation that avoids the for loop, quickly checks whether each element between two objects is equal and returns True/False. L193 makes a layer of packaging for score results.

L116: If setnormalizeIf the parameter is True, the average value of score list is taken, that is, the number of correctly predicted samples/total number = prediction accuracy
L118: If there is a weight, the score of each sample will be weighted according to the weight as the final prediction accuracy
L121: If neither of the above Settings is present, the number of samples with correct predictions is returned directly. Note: Sklearn is the defaultscoreMethod returns the prediction accuracy, not the number of samples to predict correctly.

RegressorMixin

Not surprisingly, the regression task’s mixin class was only implementedscoreMethod. The core mathematical principle isValue. The formula is 1-((y_true – y_pred)**2)/((y_true – y_true_mean)**2). Intuitively, this value is a ratio of the deviation between the predicted value and the real value and the deviation of the real value itself.The maximum value is 1, indicating that the prediction is completely accurate. When the value is 0, the model has no prediction ability.

The score method calls the r2_score method of the Metrics module and returns a floating-point value. Let’s look at r2_score, which is by far the most complex one we’ve seen. So let’s look at it piece by piece.

Check the incoming object

L577 call check_consistent_length to check whether the input labels, output labels, and weights have the same length. The method of checking is also very simple, calculate the length of each object, and then take the number of different length values, if more than 1, indicating that the length of several objects is different, then raise an error to warn.

L575 calls the _check_reg_targets method to check whether the passed parameter is valid.

This function is a bit longer, but it does the following things:

L83~95 are checking and format conversion.
L97 to 114 Check the inputmultioutputandy_trueDoes it match, that is, the dimension of the real tag array if it’s 1, obviously setmultioutputThis argument other than None is illegal. And when the dimension of the real label array is greater than 1, if the dimension andmultioutputAn error or alarm will be generated if the error occurs.
L115 according toy_trueThe dimension of determines the type of label, which can be divided into: continuous type and continuous type of multi-type output. Note:multioutputIt can be a string, an array, or (for backward compatibility) None, so this argument is very flexible. It will be mentioned again when we study the specific algorithm later. Here we do not make too much entanglement.

Check sample size and weight coefficient

Moving on to the r2_score implementation:

(3) L597-582 Check the number of samples of the predicted value If the number of samples of the predicted value is less than 2, an error alarm will be generated. Because the determining coefficient (i.e) requires at least 2 samples

(4) Weight coefficient of L584-588 treatment

L585 callnp.ravel(), flatten the weight array to one dimension
L586 tosample_weightsDimensional expansion, expand one to two, two to three, and so on. It’s worth noting that,np.newaxisThe direction of expansion is different depending on the position, see the following small example:
L588, if no weighting coefficient is passed, the default is set to 1

implementationCalculation details of

(5) Construct the numerator and denominator

(6) Calculate the score of each sample

L595 to 596 records the non-zero index of the array in the denominator and numerator.
L597 records the index value of the sample whose numerator and denominator are not both 0. If you’re not familiar with this, here’s a quick example to help you understand:
L598 to 599 create an array of all 1’s of the same length as the real label, and then compute the real ones for the valid index positionsValue.
L603 sets the value of the index position with the denominator of 0 to 0. It is also ok to set other constants here, which has no impact on the evaluation of the same regression task.

(7) Determine the weight of scores obtained by each sample according to the multioutput parameter

L605 to 607 If specifiedraw_values, output the score of each sample
L608 to 610 If specifieduniform_average, avg_weights is set to None, which evenly distributes the weights
L611 to 612 If specifiedvariance_weighted, the denominator is directly used as the weight
L614 to 618 handles the case of constant y values or one-dimensional arrays. If the denominator is all 0, then: if the numerator has non-0, return 1 directly; Otherwise return 0
L620 ifmultioutputIf it is not a string, it is directly used as the final weighting coefficient

(8) Return score

return np.average(output_scores, weights=avg_weights)
Copy the code

If uniform_average is specified, set avg_weights to None. In numpy.average, if the weight is None, calculating the mean is a simple mean() function.

TransformerMixin

The implementation of this mixin class is relatively simple, relying entirely on the fit and Transform methods implemented by the class that uses it. But it will decide whether there is a supervised task or an unsupervised task depending on whether there is a label. We will discuss it in detail when we meet later.

supplement

When we study classification mixing class and regression mixing class, we find the variable _estimator_type, whose specific function is seen here, to judge whether an estimator is used for classification task or regression task.

If there are any mistakes, feel free to comment on the interactive.

Scikit-learn source code reading (Issue 2) Base class implementation details

BaseEstimator

get_params

set_params

ClassifierMixin

RegressorMixin

Check the incoming object

Check sample size and weight coefficient

implementationCalculation details of

TransformerMixin

supplement

Related Posts

What is engineer culture?

MacOS Big Sur, Parallels Desktop can’t boot!

[APICloud practical tutorial] Solve APICloud platform PNG photo to Base64 encoding error problem

`get_params`

`set_params`