1, the background

When undertaking each stock trades, trader (share person) the securities company that should give its account place pays a few poundage, although single the poundage that trades is not tall, but the stock market has huge turnover daily, the poundage that each trades is aggregated rise, amount is quite considerable. This part of income is very important for some securities companies, and even accounts for more than 50% of the total operating income. Therefore, securities companies value the loyalty and activity of customers (i.e. traders) very much. If a customer is no longer trading through a securities company, namely the customer loss, the securities company has lost a source of income, therefore, securities companies will set up a set of customer churn warning model to predict whether a customer will be lost, and the loss probability larger customers to take corresponding measures to save, because usually, The cost of acquiring new customers is much higher than the cost of retaining existing ones.

2. Read data

Import pandas as pd df = pd.read_excel() df.head()Copy the code

3. Divide characteristic variables and target variables

X = df.drop(columns=" xtype ") y = df.drop(columns=" xtype ") y = df.drop(columns=" xtype ")Copy the code

Line 1 removes the “drain or not” column with drop() and assigns the remaining data to variable X as the characteristic variable. Line 2 extracts the “run off” column as the target variable by DataFrame extraction column and assigns it to variable Y.

4. Divide training set and test set

Before model building and use, the data will be divided into training set data (training set) and test set data (test set). As the name implies, the training set is used to train data and build models, while the test set is used to verify the effect of the model built after training. The purpose of dividing the training set and the test set is to evaluate the model and to tune the model through the test set.

From sklearn. Model_selection import train_test_split x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)  x_train.head()Copy the code

The division of data is random. The head() function can be used to check the first 5 lines of data of the divided training set and test set, as shown in the figure below.

The train_test_split() function randomly splits the data each time you run the program. If you want to keep the results consistent each time you split the data, you can set the random_state parameter as follows.

X_train, x_test y_train, y_test = train_test_split (x, y, test_size = 0.2, random_state = 1)Copy the code

The random_state parameter is assigned a value of 1, which has no special meaning and can be replaced with another number. It acts as a seed parameter to make the results consistent each time the data is divided.

5. Model building

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train,y_train)
Copy the code

6. Predicted data results

y_pred = model.predict(x_test)
y_pred[0:100]
Copy the code

The predicted value of the model y_pred and the actual value of the test set y_test are then summarized. Where y_pred is a one-dimensional array structure of type numpy. Ndarray, y_test is a one-dimensional sequence structure of type Series. Use list() to convert them to a list, and then integrate them into a DataFrame according to the knowledge points in Section 2.2.1.

A = pd. DataFrame () a [" forecast "] = list (y_pred) a [" actual value "] = list (y_test) a.h ead ()Copy the code

If you want to see the predictive accuracy of all test set data, you can use the following code.

model.score(x_test,y_test)
Copy the code
0.7977288857345636
Copy the code

7. Prediction probability

The nature of logistic regression models is to predict probabilities rather than directly predict specific categories. The probability value can be obtained by using the following code.

Y_pred_proba = model.predict_proba(x_test) a = pd.dataframe (y_pred_proba,columns=["不 确 定 probability "," 确 定 probability "]) a.read ()Copy the code

8. Obtain logistic regression coefficients

Logistic regression model in essence is a linear regression model through Sigmoid function for nonlinear transformation. There are 5 characteristic variables in this case, so the formula to predict the probability P of y = 1 (loss) is as follows.

The coefficient and intercept items in the above formula can be obtained by the following code, where model is the name of the previously trained model. The coefficient ki before the feature variable can be obtained by the coef_ attribute, and the intercept item k0 can be obtained by the intercept_ attribute.

Model. Coef_ array([[2.41952469E-05, 8.16881491E-03, 1.04320950E-02, -2.54894468e-03, -1.1012060904]])Copy the code
Model. Intercept_ array (e-06 [1.43393291])Copy the code