Everybody is good, in the 100 days of machine learning | Day63 thoroughly mastered LightGBM, I introduced the principles and a minimalist LightGBM model. Recently I found that Huggingface seemed to work better with Streamlit, so I developed a simple visual tuning tool for LightGBM to give you a deeper understanding of LightGBM.
Web site:
Huggingface. Co/Spaces/beih…
I’ve just thrown in a few parameters that you can tweak to see how the model’s metrics change in real time. Code I also put in the article, we have good optimization ideas can leave a message. Here is a detailed introduction to the implementation process:
The parameters of the LightGBM
After the completion of model construction, it is necessary to evaluate the effect of the model and continue to adjust the parameters, features or algorithms of the model according to the evaluation results to achieve satisfactory results.
LightGBM has core parameters, learning control parameters, IO parameters, target parameters, metric parameters, network parameters, GPU parameters, model parameters, here I often modify the core parameters, learning control parameters, metric parameters and so on.
Control Parameters | meaning | usage |
---|---|---|
max_depth | The maximum depth of a tree | When the model is too fit, consider lowering max_depth first |
min_data_in_leaf | The minimum number of records a leaf can have | Default 20, used when over-fitting |
feature_fraction | For example, 0.8 means that 80% of the parameters are randomly selected to build the tree in each iteration | Boosting is used when boosting is random forest |
bagging_fraction | The percentage of data used in each iteration | It is used to speed up training and reduce overfitting |
early_stopping_round | If one measure of validation data at a time does not improve in the most recent early_stopping_round round, the model stops training | Speed up analysis and reduce excessive iterations |
lambda | Specified regularization | 0 ~ 1 |
min_gain_to_split | Describe the minimum gain for splitting | Useful split of control tree |
max_cat_group | Find the split point on the group boundary | When the number of categories is large, it is easy to find segmentation points |
CoreParameters | meaning | usage |
---|---|---|
Task | Use of data | Select train or predict |
application | Purpose of model | Regression, binary, multiclass |
boosting | The algorithm to use | GBDT, RF: Random forest, DART: Dropouts meet Multiple Additive Regression Trees, GoSS: gradient-based one-side Sampling |
num_boost_round | The number of iterations | Usually 100 + |
learning_rate | If one measure of validation data at a time does not improve in the most recent early_stopping_round round, the model stops training | Commonly used 0.1, 0.001, 0.003… |
num_leaves | The default 31 | |
device | CPU or gpu | |
metric | Mae: mean absolute error, MSE: mean squared error, binary_logloss: loss for binary classification, multi_logloss: loss for multi classification |
Faster Speed | better accuracy | over-fitting |
---|---|---|
Make max_bin smaller | Use the larger max_bin | Max_bin smaller |
Num_leaves larger | Num_leaves smaller | |
Use feature_fraction for sub-sampling | Use feature_fraction | |
Use bagging_fraction and bagging_freq | Set bagging_fraction and bagging_freq | |
Training data more | Training data more | |
Use save_binary to speed up data loading | Use the categorical feature directly | Use gmin_data_in_leaf and min_sum_hessian_in_leaf |
With the parallel learning | Use the dart | Use lambda_L1, lambda_L2, min_gain_to_split for regularization |
Num_iterations is larger and learning_rate is smaller | Use max_depth to control the tree depth |
Model evaluation index
Taking classification model as an example, common model evaluation indicators are as follows:
The obfuscation matrix can reflect the performance of the model comprehensively, and many indexes can be derived from the obfuscation matrix.
ROC Curve, known as The Receiver Operating Characteristic Curve, is translated into The Receiver Operating Characteristic Curve. This is a curve with false positive rate FPR under different thresholds as the abscissa and Recall rate under different thresholds as the ordinate. Let’s measure how the model tries to capture a small number of classes while mistakenly hitting a large number of classes.
AUC AUC (Area Under the ROC Curve) is often used as the most important evaluation index to measure the stability of the model in dichotomies. The area under the ROC curve is called the AUC area. The larger the AUC area is, the closer the ROC curve is to the upper left corner, the better the model is.
Streamlit implementation
I won’t go into Streamlit any more, it should be very familiar to regular readers. Just to list a few of the little things that were developed before:
- Building machine learning apps, it’s so easy
- I created a website for this GIF
- Stop it, Huaqiang! I’m using machine learning to help you pick out watermelons
- It took a month to make a pure machine learning website
The core code is as follows, the complete code I put on Github, welcome everyone to give a Star
Github.com/tjxj/visual…
from definitions import * st.set_option('deprecation.showPyplotGlobalUse', Breast_cancer = load_breast_cancer() data = breast_cancer X_train, X_test, y_train, y_test = train_test_split(data, target, Lgb_train = LGB.Dataset(X_train, y_train) lGB_eval = LGB.Dataset(X_test, y_test, Params = {'num_leaves': num_leaves, 'max_depth': max_depth, 'min_data_in_leaf': min_data_in_leaf, 'feature_fraction': feature_fraction, 'min_data_per_group': min_data_per_group, 'max_cat_threshold': max_cat_threshold, 'learning_rate':learning_rate,'num_leaves':num_leaves, 'max_bin':max_bin,'num_iterations':num_iterations } gbm = lgb.train(params, lgb_train, num_boost_round=2000, valid_sets=lgb_eval, early_stopping_rounds=500) lgb_eval = lgb.Dataset(X_test, y_test, Reference = LGB_train) probs =gbm. predict(X_test, num_iteration= gbM.best_iteration) # thresholds = roc_curve(y_test, probs) st.write('------------------------------------') st.write('Confusion Matrix:') st.write(confusion_matrix(y_test, Np. The where (probs > 0.5, 1, 0))) st.write('------------------------------------') st.write('Classification Report:') report = Classification_report (y_test, NP. Where (probs > 0.5, 1, 0), output_dict=True) report_matrix = pd.DataFrame(report).transpose() st.dataframe(report_matrix) st.write('------------------------------------') st.write('ROC:') plot_roc(fpr, tpr)Copy the code
Upload Huggingface
Huggingface previous article (Tencent this algorithm, I moved online, casual play!) I’ve already introduced it, so let’s talk about the steps again.
Step1: sign up for Huggingface
Step2: create Space, SDK remember to select Streamlit
Step3: Clone the newly created space code, and then push the modified code
git lfs install
git add .
git commit -m "commit from $beihai"
git push
Copy the code
Git config –global credential. Helper Store git config –global credential. Helper store git config –global credential
Push is done, go back to your space page and see the effect.