preface

This paper mainly introduces one of the most important bases of machine learning, logistic regression, including conditional probability, logistic regression objective function, gradient descent method, stochastic gradient descent method, and the case of predicting whether a bank customer opens an account.

Conditional probability in logistic regression

1. Application of logistic regression

There are some classic dichotomies:

  • Predicting loan defaults (will default/will not default)
  • Sentiment analysis (positive/negative)
  • Predicting click-through rates (yes/no)
  • Prediction of disease (positive/negative)

As follows:

All these dichotomies can be solved by logistic regression. Logistic regression is the “magic tool” of dichotomous problems. It is very simple and practical, and is the most used model in online systems.

2. Understand benchmarks

Baseline is very important in AI modeling. Models can be built from simple to complex. A benchmark is a simple model that can be used to verify the effect of the model and the feasibility of the model to be verified. To put it simply, in the stage of model design, we first try to put the system together quickly through simple methods, and then gradually refine the modules, so as to continuously get better solutions. For classification tasks, logistic regression model can be regarded as the best and more reliable benchmark, so when building any classification model, logistic regression can be considered first, and then gradually try more complex models.

As follows:

It can be seen that there is generally an upper limit to the effect of the model, at which point the decision can be made whether to further use a more complex model to determine the input-output ratio, and thus whether to further select a more complex model. Different application scenarios have their own thresholds, and from baseline you can roughly determine the upper limit of the effect to decide whether to proceed with the current project. With a baseline as a reference, when you test your model, you can better locate errors if there are deviations from your expectations. Therefore, when modeling AI, it is important to start from the baseline.

In practical work, it is not advisable to start with a complex model. In fact, people who just entered the AI industry often make this problem. They want to use deep learning for training at the beginning, such as BERT or deep enhancement learning. These are not the right things to do.

3. Classification problems

Examples of classification problems are as follows:

As you can see, this problem is really a dichotomy problem.

The classification rules of dichotomous problems are as follows:

The core issue here is: how to use the conditional probability p (y | x) to describe the relationship between the x and y. Logistic regression is actually based on the linear regression model, and build up step by step, so here also hope that through the linear regression equation to construct the conditional probability p (y | x).

Properties required for a valid probability:

Transforming linear regression WTx+bW^Tx+bWTx+b so that its range maps to the interval (0, 1) translates to mapping the values of the interval from + infinity to -infinity to the interval (0, 1). As long as there’s a way to do that, we can express it as a probability. The answer is a logical function.

4. Logical functions

The form and visualization of the logical function is as follows:

As you can see, the logical function converts the interval of (-∞, +∞) to between (0, 1), so that the mapping between y and x can be expressed as a probability.

The application of logical function is very extensive, especially in neural network everywhere, largely due to its indispensable property: the value of any interval can be mapped to **(0, 1) interval. Such values can be used either as probabilities or as weights. In addition, most models involve derivative calculation in training, and the derivative of logical function has a very simple form **, which makes logical function very popular.

5. Sample conditional probability

The expression of conditional probability can be obtained by combining the expression of linear regression with the logical function:

Where Y may be 0 or 1;

Logistic regression deals with dichotomous problems, so a sample must belong to one of these categories. This means that the conditional probability p (y | x = 1) and p (y = 0 | x) must be equal to the sum of 1.

Two category expressions can be combined as follows:

Objective function of logistic regression

1. Maximum Likelihood Estimation

The process of model training is the optimization objective function, and one of the important methods is Maximum Likelihood Estimation. Maximum likelihood estimation is the most commonly used method to construct Objective Function in the field of machine learning. Its core idea is to estimate the unknown parameters according to the observed results, which can guide us to construct the Objective Function of the model and find the maximum or minimum parameter value of the Objective Function.

A more abstract interpretation of the maximum likelihood estimate is if you have an unknown model (think of it as a black box) and it produces a lot of observed samples. We can in turn work out the optimal parameters of the model by maximizing the probability of these samples, a process called maximum likelihood estimation.

A hint is as follows:

The maximum likelihood estimation is used to estimate the unknown parameter θ by working backwards from the samples obtained. Maximum likelihood estimation is a methodology, which can be understood as a framework for problem solving.

Examples are as follows:

Suppose you have a coin of uneven, appear positive or old negative probability is different, we set a positive probability for theta, we use H to represent positive, T to represent the opposite, when we after throwing a 6 times to get the results for H, T, T, H, H, H, and we assume that each throwing events are independent of each other, the theta?

The process of solving parameters through maximum likelihood estimation is as follows:

It is necessary to predict unknown parameters. Assume that visible samples are generated under the re-parameter θ, the value of unknown parameters can be inversely derived by maximizing the probability of visible samples. This is the core idea of maximum likelihood estimation, which is transformed into optimization problems.

2. Likelihood function of logistic regression

The conditional probability for a single sample is defined, and this probability can be thought of as likelihood probability. You have to take all the samples into account, and you get the likelihood probability for all the samples.

The likelihood probability of all samples is as follows:

After obtaining the likelihood probability of all samples, the goal is to ask the parameters of the model that maximizes the likelihood probability (the logistic regression model in this case is W and B), and this process is called maximum likelihood estimation.

3. Maximum likelihood estimation of logistic regression

The objective function to be maximized is as follows:

We need to find w and b to maximize our objective function. This is an optimization problem.

After obtaining the objective function of logistic regression, the first simplification is to rewrite the product form into the addition form, and then into the minimization problem:

This is of great help to the subsequent operations, because the product of many probabilities is easy to form a very small number, resulting in underflow and other phenomena. We may see inf, nan, etc., and we may need to check to see if underflow, overflow, or divisor is zero.

At this time, the expression of conditional probability needs to be specified, and optimization method is used to solve the optimal W and B, as follows:

After a series of simplification operations, no further simplification is possible, and all that remains is to find parameters W and B that minimize the objective function.

Gradient descent method

1. Find the minimum and maximum value of the function

There are two main approaches to solving the maximum/minimum value of a function and its corresponding optimal parameters (such as W and B) :

  • The derivative is set to 0 to solve the reciprocal of the objective function to the parameter to be solved, and the derivative is set to 0 to solve the optimal parameter. This method is also called an Analytic Solution. It should be noted that not all objective functions can be solved by setting the derivative to 0. A classic example is the objective function of logistic regression. You can try to solve it by setting the derivative to 0, and you’ll find that it doesn’t work.
  • Another more general method — iterative optimization algorithm is needed to solve the problem of logistic regression objective function based on cyclic iterative algorithm. One of the most classic is gradient descent.

The gradient descent method is an iterative algorithm, which cannot get the optimal solution immediately. Instead, it keeps updating parameters in a circular way to approach the optimal solution step by step.

The schematic diagram of gradient descent method is as follows:

2. Gradient descent

Assuming that there is a function f(x), find the parameter x that minimizes the value of f(x), and solve the optimal solution x according to the gradient descent method:

Among them, η is the learning rate, which is an important parameter of gradient descent method. It is used to control the step size of learning and can be regarded as an adjustable parameter (also known as a hyperparameter). The learning rate plays an important role in convergence and in the final result.

Gradient descent is very practical, almost all models can be trained using gradient descent. This also shows that the method is universal, no matter how complex the problem. Especially for deep learning, it plays an irreplaceable role. Back Propagation algorithm is actually gradient descent. It’s important to note that gradient descent depends on taking derivatives.

Examples are as follows:

Here’s another example:

Now further analyze the influence of learning rate:

The learning rate is relatively small

The forward progress length is small, the parameter update is slow, and the convergence speed is slow:

The learning rate is relatively large

The algorithm is unstable and may not converge when the forward progress length is large, the parameters are updated quickly and the learning rate is particularly large:

Therefore, it is necessary to find an appropriate learning rate to ensure both fast learning and convergence.

In the process of gradient descent method, there are two criteria for judging whether the iterative process has converged:

  • There is no change or very little change in the loss function between two adjacent time periods
  • The value of the parameter does not change or changes very much in the next two time periods

3. Take derivatives of logical functions

The core of gradient descent is the derivation of a function. The objective function of logistic regression itself has a certain complexity, which involves the derivation of the logical function. First, the derivation of the logical function is as follows:

4. Logistic regression gradient descent method

Now using gradient descent to solve the logistic regression, That minimize – ∑ I = 1 nyilog ⁡ sigma (wTxi + b) + (1 – yi) log ⁡ [1 – sigma (wTxi + b)] – \ sum_ {I = 1} ^ y_ {n} {I} \ log \ sigma \ left (w ^ {T} x_{i}+\mathrm{b}\right)+\left(1-y_{i}\right) \log \left[1-\sigma\left(w^{T} + \ mathrm x_ {I} {b} \ \ right right)] – ∑ I = 1 nyilog sigma (wTxi + b) + (1 – yi) log (1 – sigma (wTxi + b)]. For logistic regression, there are two sets of parameters w and B respectively, and gradient descent depends on derivation, so the key of derivation is also around these two sets of parameters. And there’s going to be some rules for taking derivatives of composition functions.

First, solve the updating formula of parameter W:

Then solve the updating formula of parameter B:

The final result is:

There are two cases of optimal solution, namely global optimal solution and local optimal solution, as follows:

For convex functions, there is only one global optimal solution. The optimal solution found by gradient descent method is the global optimal solution. However, for non-convex functions, there is one global optimal solution and several local optimal solutions, so the optimal solution is not guaranteed to be the global optimal solution. Deciding whether a function is convex or not is the domain of convex programming.

The gradient function of the logic function is convex, so the optimal solution is the global optimal solution. So when you use gradient descent to find the optimal solution of logistic regression, no matter how initializing it is, the result is the same. The main reason is that the objective function of logistic regression is convex. If a function is convex, it only has a global optimal solution, so no matter how you initialize it, you converge to the same point.

Random gradient descent method

1. Disadvantages of gradient descent method

It can be seen from the parameter update formula that the amount of calculation is related to the amount of data N. To be precise, the calculation complexity is in a linear relationship with the amount of data. When the amount of data is large, it is obvious that the amount of calculation will be large, and the time cost of each iteration is high, so the efficiency of updating parameters may be reduced. For example, when you have a million samples in a data set, each parameter update requires you to loop through all the samples and add up their gradients.

2. Stochastic gradient descent

Stochastic gradient descent (SGD) can be seen as the extreme case of gradient descent. In the gradient descent method, each parameter update depends on all samples. However, in stochastic gradient descent, each iteration no longer depends on the sum of the gradients of all samples, but on the gradient of only one of them.

The process of stochastic gradient descent is as follows:

Compared with the gradient descent method, stochastic gradient descent method each iteration only randomly choose a sample calculation, instead of all the samples, through a very cheap way to get the gradient, and frequent iteration to update the parameters, and so the update will be faster, which helps in a shorter period of time of convergence results are obtained.

The convergence efficiency of SGD is usually higher, and sometimes the final solution of SGD is better, with higher accuracy than THAT of GD. However, each update only relies on a single sample, that is, only the gradient of one sample is used to estimate the sum of the gradients of all samples, so the gradient obtained is noisy and not very stable, so the decline of the loss function will have certain fluctuations. But as long as it makes sense, the big trend is in the right direction. To partially solve this problem, we usually set the learning rate to a small value, which can effectively reduce the instability caused by gradient calculation.

3. Small batch gradient descent method

In fact, gDA and STOCHASTIC gDA can be regarded as two extremes. The former considers all samples when calculating the gradient, while the latter only considers one sample. Therefore, GDA and STOCHASTIC GDA have their own advantages and disadvantages.

At this point, a compromise can be made between the two, absorbing the advantages of both and avoiding the disadvantages of both. This is mini-batch gradient Descent, which randomly selects a subset of samples each time, calculates the gradient based on these selected samples, and updates the parameters. It ensures both speed and stability. Among them, the sample size m selected each time is a hyperparameter, which can be set to 32, 50, 64 or 100 through manual adjustment.

4. Realized logistic regression from zero based on small batch gradient descent method

The implementation is as follows:

# -*- coding: utf-8 -*-

"" @author: Corley @Time: 2022-02-21 13:25 @project: NLPDevilCamp-MgD Small Batch Gradient descent method

# import the appropriate library
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression


# Implement sigmoid function
def sigmoid(x) :
    return 1 / (1 + np.exp(-x))


# Calculate log likelihood
def log_likelihood(X, y, w, b) :
    For all sample data, the (negative)log likelihood, also called cross-entropy loss, is calculated as small as possible X: training data (feature vector), size N * D Y: training data (tag), one-dimensional vector, length N w: Parameters of the model, one-dimensional vector, length D b: offset of the model, scalar ""
    First, the subscripts of positive and negative samples are extracted according to the tags
    pos, neg = np.where(y == 1), np.where(y == 0)
    # For positive sample calculation of Loss, matrix operation is used here. It would be inefficient to loop through every sample.
    pos_sum = np.sum(np.log(sigmoid(np.dot(X[pos], w) + b)))
    # Calculate loss for negative samples
    neg_sum = np.sum(np.log(1 - sigmoid(np.dot(X[neg], w) + b)))
    # return cross entropy loss
    return -(pos_sum + neg_sum)


# Implement logistic regression model
def logistic_regression_minibatch(X, y, num_steps, learning_rate) :
    X: training data (feature vector), size N * D Y: training data (label), one-dimensional vector, length N num_steps: number of iterations of gradient descent learning_rate: step size ""
    w, b = np.zeros(X.shape[1]), 0
    for step in range(num_steps):
        Sample a batch randomly. The batch size is 100
        batch = np.random.choice(X.shape[0].100)
        X_batch, y_batch = X[batch], y[batch]
        Calculate the error between the predicted value and the actual value
        error = sigmoid(np.dot(X_batch, w) + b) - y_batch
        # Compute the gradient of w, b
        grad_w = np.matmul(X_batch.T, error)
        grad_b = np.sum(error)
        # Gradient update for w, b
        w = w - learning_rate * grad_w
        b = b - learning_rate * grad_b
        # Every once in a while, calculate log likelihood to see if there is any change
        # Under normal circumstances, it will gradually shrink and eventually converge
        if step % 10000= =0:
            print(log_likelihood(X, y, w, b))
    return w, b


if __name__ == '__main__':
    # Generate sample data randomly. A dichotomous problem in which each category generates 5000 sample data
    num_observations = 5000
    x1 = np.random.multivariate_normal([0.0], [[1.75.], [75..1]], num_observations)
    x2 = np.random.multivariate_normal([1.4], [[1.75.], [75..1]], num_observations)
    X = np.vstack((x1, x2)).astype(np.float32)
    y = np.hstack((np.zeros(num_observations), np.ones(num_observations)))
    print(X.shape, y.shape)
    # Visualization of data
    plt.figure(figsize=(12.8))
    plt.scatter(X[:, 0], X[:, 1], c=y, alpha=4.)
    plt.show()

    w, b = logistic_regression_minibatch(X, y, num_steps=500000, learning_rate=5e-4)
    print("The parameters w and b of the customized logistic regression are respectively:", w, b)

    # Here we directly call the sklearn module to train, see if there is any difference with my own handwriting. If the result is the same then it is correct!
    # C sets a large value, indicating that the re entry is not wanted
    clf = LogisticRegression(fit_intercept=True, C=1e15)
    clf.fit(X, y)
    print("Parameters w and b of logistic regression of sklearn are:", clf.coef_[0], clf.intercept_[0],)Copy the code

Output:

(10000.2) (10000.)6456.079337492904
252.61723903340507
207.2599674752204
190.2482118365183
181.2258806322247
175.70250837511242
172.1469799496025
169.42402800140385
167.48050073469278
165.99127124664312
164.87442305886168
163.92801766677047
163.24327756952343
162.63901476606844
162.14005293987447
161.69212583923496
161.35096187895385
161.22873475722486
160.81951964316465
160.605551968888
160.44293658603038
160.30606827535223
160.1363754760282
160.01920664563855
159.94308983490592
159.96897449656237
159.80386842353911
159.709997638158
159.69918151092125
159.61997882719152
159.56199544781992
159.5750519330163
159.6000108282866
159.47238691041457
159.4387873972991
159.4373229495451
159.388153955921
159.4263532183712
159.34796773634417
159.33595853070108
159.32089313081423
159.36020474342533
159.3174337952185
159.2967384108
159.29075249806633
159.31540388014997
159.2748102227836
159.3116809830563
159.2738219648989
159.29578895200527The parameters w and b of the customized logistic regression are: [-4.69480583  7.60232553] -12.91508509920719The parameters w and b of logistic regression of sklearn are: [-4.77246524  7.72230433] -13.126019068675237

Copy the code

It can be seen that the logistic regression implemented by myself using small batch gradient descent is close to the sciKit-Learn library.

5. Comparison between different algorithms

The comparison of the three gradient descent algorithms is as follows:

More zhuanlan.zhihu.com/p/25765735 for reference.

A summary of various gradient descent methods is made:

  • In practical application, the most commonly used method is small batch gradient descent method

    • GPU is good at parallel computing, and small-batch samples have a very good tacit understanding with GPU. Therefore, such a small number of samples can be processed by GPU at the same time, thus improving computing efficiency. Therefore, small-batch gDA has the highest practicability.
    • Small batch gradient descent method compromises the advantages and disadvantages of gradient descent method and stochastic gradient descent method, which can better solve the problem of gradient noise and update more stable.
  • Stochastic gradient descent, or small-batch gradient descent, helps solve the Saddle Point problem. Saddle points are as follows:

    The saddle point may be mistaken for an optimal value point. For the saddle point, the performance of THE STOCHASTIC gradient descent method may be better, because the noise caused by the randomness of the saddle point may make the next updated point jump out near the saddle point, while the stability of the gradient descent method may not jump out near the saddle point.

V. Case: Predict whether a bank customer will open a time deposit account

1. Problem description

A specific question: based on the user’s relevant information (such as age, education level, marital status, etc.) to predict whether the user will have the need to open a time deposit account in the future. This kind of problem is very common in banking business. For banks, they hope to find out some potential customers and provide some relevant services precisely. This problem is a classic dichotomy problem (the predicted result is yes or no) and is well suited to be solved by logistic regression.

The case will involve the following aspects:

  • Data understanding and analysis, specific analysis of which feature has a greater impact on the results
  • The representation of the unique thermal coding, for the category features need to be converted into the form of the unique thermal coding
  • Use of logistic regression
  • Accuracy, recall rate, use of F1-score

2. Data understanding

Understanding the data is the first step to solve the problem. First, we will understand some characteristics of the data from various aspects, such as which features have better predictive ability. The following are some snapshots of the data and the description of the fields.

The following is an example:

Each field and its meaning are as follows:

First check the basic data information, the code is as follows:

data = data.dropna()

data['y'] = data['y'].apply(lambda x: 1 if x == 'yes' else 0)

display(data.shape, list(data.columns))
Copy the code

Output:

(41188.21)
['age'.'job'.'marital'.'education'.'default'.'housing'.'loan'.'contact'.'month'.'day_of_week'.'duration'.'campaign'.'pdays'.'previous'.'poutcome'.'emp.var.rate'.'cons.price.idx'.'cons.conf.idx'.'euribor3m'.'nr.employed'.'y']
Copy the code

View the percentage of data:

# Check the data ratio
# Look at the probability of positive and negative samples
count_no_sub = len(data[data['y'] = =0])  Negative number of samples
pct_of_no_sub = count_no_sub / len(data)   Negative sample ratio
pct_of_sub = 1 - pct_of_no_sub             # Positive sample ratio
print('Percentage of unopened accounts: %.2f%%' % (pct_of_no_sub * 100))
print('Account percentage: %.2f%%' % (pct_of_sub * 100))
Copy the code

Output:

Percentage of unopened accounts:88.73Percentage of accounts opened:11.27%
Copy the code

As can be seen from the data, the proportion of accounts opened is much smaller than that of unopened accounts. The data distribution is not balanced, but it belongs to the normal range. If the data imbalance is serious, you need to handle it.

For each feature, some analysis can also be done to see which features have a more important effect on the final prediction. The analysis of these characteristics can be obtained through some visual analysis.

Job field

Marital Status fields

Education field

Day of Week field

From the figure, we can find out whether different values of a field are related to the prediction of Y. If the values of this field are different, the predicted values are also different, it can be shown that this field has an impact on Y.

As can be seen, the distribution is different with different Job fields. Retirees and students have a larger proportion of purchases, so the Job field has an impact on Y.

At the same time, the proportion of single and unknown purchase is larger, so there is a certain relationship between marital status and purchase intention.

Education also provides useful information about buying intentions;

However, the Day of Week fields are relatively evenly distributed, which is not very helpful for predicting purchases.

In this way, each field can be analyzed, and the function of each field can be more clearly understood. If some fields are helpful for prediction, we can explore these fields in more depth and remove the unimportant fields.

Feature selection:

Once you have a general understanding of the importance of the features themselves, you can choose to exclude those features that are not highly correlated from the data. For data with a large number of features, this operation is very effective. However, since the data features of the current case are not very many, all the features are reserved here. As for feature selection, there are many different methods such as correlation selection, greedy algorithm selection, tree model selection and so on.

3. Data preprocessing

Category variables and unique thermal coding:

For fields of category type, such as gender, education background, courses, etc., such as bachelor’s degree, master’s degree and doctor’s degree, the input of the model must be quantitative information, so special processing needs to be done before putting it into the model. In this case, one-hot encoding is required.

Examples are as follows:

In the process of unique thermal coding, that is, a vector is taken with the same length as the number of values. Only one position is 1 and the other positions are 0. The position of 1 represents the current value to be encoded. Meanwhile, the order of values in each field does not matter, as long as each value is unique.

The transformation of unique thermal coding is as follows:

After transformation, the new data can be input into the model for training.

At the same time, it should be noted that if a category variable has many values, the dimension of the transformed data will also become large.

Now use unique thermal encoding to perform unique thermal encoding on the raw data as follows:

# Category variables need to be converted to a unique hot-coded form. List all category variables
cat_vars = ['job'.'marital'.'education'.'default'.'housing'.'loan'.'contact'.'month'.'day_of_week'.'poutcome']
for var in cat_vars:
    # Call the pandas. Getdummies method for unique thermal coding
    cat_list = pd.get_dummies(data[var], prefix=var)
    data = data.join(cat_list)
    
data.head()
Copy the code

Output:

View data dimensions:

display(data.shape, list(data.columns))
Copy the code

Output:

(41188.74)
['age'.'job'.'marital'.'education'.'default'.'housing'.'loan'.'contact'.'month'.'day_of_week'.'duration'.'campaign'.'pdays'.'previous'.'poutcome'.'emp.var.rate'.'cons.price.idx'.'cons.conf.idx'.'euribor3m'.'nr.employed'.'y'.'job_admin.'.'job_blue-collar'.'job_entrepreneur'.'job_housemaid'.'job_management'.'job_retired'.'job_self-employed'.'job_services'.'job_student'.'job_technician'.'job_unemployed'.'job_unknown'.'marital_divorced'.'marital_married'.'marital_single'.'marital_unknown'.'education_basic.4y'.'education_basic.6y'.'education_basic.9y'.'education_high.school'.'education_illiterate'.'education_professional.course'.'education_university.degree'.'education_unknown'.'default_no'.'default_unknown'.'default_yes'.'housing_no'.'housing_unknown'.'housing_yes'.'loan_no'.'loan_unknown'.'loan_yes'.'contact_cellular'.'contact_telephone'.'month_apr'.'month_aug'.'month_dec'.'month_jul'.'month_jun'.'month_mar'.'month_may'.'month_nov'.'month_oct'.'month_sep'.'day_of_week_fri'.'day_of_week_mon'.'day_of_week_thu'.'day_of_week_tue'.'day_of_week_wed'.'poutcome_failure'.'poutcome_nonexistent'.'poutcome_success']
Copy the code

As you can see, the dimension of the data expands after the unique thermal coding.

Unique thermal coding is the most common encoding method for category variables. Generally speaking, the transformation of the unique thermal coding will increase the feature dimension. For example, for a category variable “city”, the value may be more than 1000, which will become 1000 multidimensional vectors after the variable is converted into the unique thermal coding.

4. Model training

The training code of the model is as follows:

(1) Classification of features and labels

Divide the features and tags in the dataset
X = data.loc[:, data.columns != 'y']
y = data.loc[:, data.columns == 'y'].values.ravel()

display(X.shape, y.shape)
Copy the code

Output:

(41188.63)
(41188.)Copy the code

(2) Divide training data and test data

# Divide data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

display(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
Copy the code

Output:

(28831.63)
(12357.63)
(28831(,)12357.)Copy the code

(3) Use logistic regression model for training

Initialize logistic regression model and train on training data
log_reg = LogisticRegression(max_iter=10000)

log_reg.fit(X_train,y_train)
Copy the code

5. Model evaluation

In this case, the proportion of label types is very imbalanced, which is called imbalanced data. For unbalanced data, extra attention should be paid to the evaluation criteria, because the wrong evaluation criteria can make the modeling itself meaningless.

When evaluating a model, there are multiple metrics.

Precision and Recall are as follows:

Accuracy and recall are very important indicators for dichotomies because many data are unbalanced and these two indicators can better evaluate models for such data.

In the case of unbalanced data, accuracy is meaningless. Even without training any model, it is still very accurate to directly classify all samples into samples with high proportion, but it is actually meaningless. In such cases, another evaluation criterion, F1-Score, tends to be used. F1 – Score as follows:

It can be seen that f1-score of positive and negative samples is obtained respectively. If you want to obtain the F1-score of all samples, you can directly calculate the mean value of the F1-score of the positive and negative samples.

Firstly, the model is predicted as follows:

# Make a prediction
y_pred = log_reg.predict(X_test)
Copy the code

Evaluate again as follows:

# calculation formula one - Score
print(classification_report(y_test, y_pred))
Copy the code

Output:

              precision    recall  f1-score   support

           0       0.93      0.97      0.95     10969
           1       0.67      0.42      0.51      1388

    accuracy                           0.91     12357
   macro avg       0.80      0.70      0.73     12357
weighted avg       0.90      0.91      0.90     12357
Copy the code

6. Project structure

The overall case project structure is as follows:

Bank_market_predict: │ data_preprocess ipynb# Data preprocessing
│  data_preview.ipynb               # Data check│ model. Ipynb# Modeling training and evaluation│ └ ─ data banking. CSV# Raw data
        prerocessed_banking.csv     # Preprocessed data

Copy the code

Complete project code and data can be click download.csdn.net/download/CU… Download and reference.

conclusion

Logistic regression is a classical classification method in statistical learning, which belongs to log-linear model. It uses the idea of regression to solve classification problems. Logistic model is the most important basic model in machine learning, so it is very important to master the principle of logical model.

In this paper, the original reproduced ask my community, the original link www.wenwoha.com/19/course_a… And www.wenwoha.com/19/course_a… .