Everybody is good, in the 100 days of machine learning | Day63 thoroughly mastered LightGBM, I introduced the principles and a minimalist LightGBM model. Recently I found that Huggingface seemed to work better with Streamlit, so I developed a simple visual tuning tool for LightGBM to give you a deeper understanding of LightGBM.

Web site:

Huggingface. Co/Spaces/beih…

I’ve just thrown in a few parameters that you can tweak to see how the model’s metrics change in real time. Code I also put in the article, we have good optimization ideas can leave a message. Here is a detailed introduction to the implementation process:

The parameters of the LightGBM

After the completion of model construction, it is necessary to evaluate the effect of the model and continue to adjust the parameters, features or algorithms of the model according to the evaluation results to achieve satisfactory results.

LightGBM has core parameters, learning control parameters, IO parameters, target parameters, metric parameters, network parameters, GPU parameters, model parameters, here I often modify the core parameters, learning control parameters, metric parameters and so on.

Control Parameters meaning usage
max_depth The maximum depth of a tree When the model is too fit, consider lowering max_depth first
min_data_in_leaf The minimum number of records a leaf can have Default 20, used when over-fitting
feature_fraction For example, 0.8 means that 80% of the parameters are randomly selected to build the tree in each iteration Boosting is used when boosting is random forest
bagging_fraction The percentage of data used in each iteration It is used to speed up training and reduce overfitting
early_stopping_round If one measure of validation data at a time does not improve in the most recent early_stopping_round round, the model stops training Speed up analysis and reduce excessive iterations
lambda Specified regularization 0 ~ 1
min_gain_to_split Describe the minimum gain for splitting Useful split of control tree
max_cat_group Find the split point on the group boundary When the number of categories is large, it is easy to find segmentation points

CoreParameters meaning usage
Task Use of data Select train or predict
application Purpose of model Regression, binary, multiclass
boosting The algorithm to use GBDT, RF: Random forest, DART: Dropouts meet Multiple Additive Regression Trees, GoSS: gradient-based one-side Sampling
num_boost_round The number of iterations Usually 100 +
learning_rate If one measure of validation data at a time does not improve in the most recent early_stopping_round round, the model stops training Commonly used 0.1, 0.001, 0.003…
num_leaves The default 31
device CPU or gpu
metric Mae: mean absolute error, MSE: mean squared error, binary_logloss: loss for binary classification, multi_logloss: loss for multi classification

Faster Speed better accuracy over-fitting
Make max_bin smaller Use the larger max_bin Max_bin smaller
Num_leaves larger Num_leaves smaller
Use feature_fraction for sub-sampling Use feature_fraction
Use bagging_fraction and bagging_freq Set bagging_fraction and bagging_freq
Training data more Training data more
Use save_binary to speed up data loading Use the categorical feature directly Use gmin_data_in_leaf and min_sum_hessian_in_leaf
With the parallel learning Use the dart Use lambda_L1, lambda_L2, min_gain_to_split for regularization
Num_iterations is larger and learning_rate is smaller Use max_depth to control the tree depth

Model evaluation index

Taking classification model as an example, common model evaluation indicators are as follows:

The obfuscation matrix can reflect the performance of the model comprehensively, and many indexes can be derived from the obfuscation matrix.

ROC Curve, known as The Receiver Operating Characteristic Curve, is translated into The Receiver Operating Characteristic Curve. This is a curve with false positive rate FPR under different thresholds as the abscissa and Recall rate under different thresholds as the ordinate. Let’s measure how the model tries to capture a small number of classes while mistakenly hitting a large number of classes.

AUC AUC (Area Under the ROC Curve) is often used as the most important evaluation index to measure the stability of the model in dichotomies. The area under the ROC curve is called the AUC area. The larger the AUC area is, the closer the ROC curve is to the upper left corner, the better the model is.

Streamlit implementation

I won’t go into Streamlit any more, it should be very familiar to regular readers. Just to list a few of the little things that were developed before:

  • Building machine learning apps, it’s so easy
  • I created a website for this GIF
  • Stop it, Huaqiang! I’m using machine learning to help you pick out watermelons
  • It took a month to make a pure machine learning website

The core code is as follows, the complete code I put on Github, welcome everyone to give a Star

Github.com/tjxj/visual…

from definitions import * st.set_option('deprecation.showPyplotGlobalUse', Breast_cancer = load_breast_cancer() data = breast_cancer X_train, X_test, y_train, y_test = train_test_split(data, target, Lgb_train = LGB.Dataset(X_train, y_train) lGB_eval = LGB.Dataset(X_test, y_test, Params = {'num_leaves': num_leaves, 'max_depth': max_depth, 'min_data_in_leaf': min_data_in_leaf, 'feature_fraction': feature_fraction, 'min_data_per_group': min_data_per_group, 'max_cat_threshold': max_cat_threshold, 'learning_rate':learning_rate,'num_leaves':num_leaves, 'max_bin':max_bin,'num_iterations':num_iterations } gbm = lgb.train(params, lgb_train, num_boost_round=2000, valid_sets=lgb_eval, early_stopping_rounds=500) lgb_eval = lgb.Dataset(X_test, y_test, Reference = LGB_train) probs =gbm. predict(X_test, num_iteration= gbM.best_iteration) # thresholds = roc_curve(y_test, probs) st.write('------------------------------------') st.write('Confusion Matrix:') st.write(confusion_matrix(y_test, Np. The where (probs > 0.5, 1, 0))) st.write('------------------------------------') st.write('Classification Report:') report = Classification_report (y_test, NP. Where (probs > 0.5, 1, 0), output_dict=True) report_matrix = pd.DataFrame(report).transpose() st.dataframe(report_matrix) st.write('------------------------------------') st.write('ROC:') plot_roc(fpr, tpr)Copy the code

Upload Huggingface

Huggingface previous article (Tencent this algorithm, I moved online, casual play!) I’ve already introduced it, so let’s talk about the steps again.

Step1: sign up for Huggingface

Step2: create Space, SDK remember to select Streamlit

Step3: Clone the newly created space code, and then push the modified code

git lfs install 
git add .
git commit -m "commit from $beihai"
git push
Copy the code

Git config –global credential. Helper Store git config –global credential. Helper store git config –global credential

Push is done, go back to your space page and see the effect.