Machine learning series: LightGBM Visual parameter tuning

Everybody is good, in the 100 days of machine learning | Day63 thoroughly mastered LightGBM, I introduced the principles and a minimalist LightGBM model. Recently I found that Huggingface seemed to work better with Streamlit, so I developed a simple visual tuning tool for LightGBM to give you a deeper understanding of LightGBM.

Web site:

Huggingface. Co/Spaces/beih…

I’ve just thrown in a few parameters that you can tweak to see how the model’s metrics change in real time. Code I also put in the article, we have good optimization ideas can leave a message. Here is a detailed introduction to the implementation process:

The parameters of the LightGBM

After the completion of model construction, it is necessary to evaluate the effect of the model and continue to adjust the parameters, features or algorithms of the model according to the evaluation results to achieve satisfactory results.

LightGBM has core parameters, learning control parameters, IO parameters, target parameters, metric parameters, network parameters, GPU parameters, model parameters, here I often modify the core parameters, learning control parameters, metric parameters and so on.

Control Parameters	meaning	usage
max_depth	The maximum depth of a tree	When the model is too fit, consider lowering max_depth first
min_data_in_leaf	The minimum number of records a leaf can have	Default 20, used when over-fitting
feature_fraction	For example, 0.8 means that 80% of the parameters are randomly selected to build the tree in each iteration	Boosting is used when boosting is random forest
bagging_fraction	The percentage of data used in each iteration	It is used to speed up training and reduce overfitting
early_stopping_round	If one measure of validation data at a time does not improve in the most recent early_stopping_round round, the model stops training	Speed up analysis and reduce excessive iterations
lambda	Specified regularization	0 ~ 1
min_gain_to_split	Describe the minimum gain for splitting	Useful split of control tree
max_cat_group	Find the split point on the group boundary	When the number of categories is large, it is easy to find segmentation points

CoreParameters	meaning	usage
Task	Use of data	Select train or predict
application	Purpose of model	Regression, binary, multiclass
boosting	The algorithm to use	GBDT, RF: Random forest, DART: Dropouts meet Multiple Additive Regression Trees, GoSS: gradient-based one-side Sampling
num_boost_round	The number of iterations	Usually 100 +
learning_rate	If one measure of validation data at a time does not improve in the most recent early_stopping_round round, the model stops training	Commonly used 0.1, 0.001, 0.003…
num_leaves		The default 31
device		CPU or gpu
metric		Mae: mean absolute error, MSE: mean squared error, binary_logloss: loss for binary classification, multi_logloss: loss for multi classification

Faster Speed	better accuracy	over-fitting
Make max_bin smaller	Use the larger max_bin	Max_bin smaller
Num_leaves larger	Num_leaves smaller
Use feature_fraction for sub-sampling		Use feature_fraction
Use bagging_fraction and bagging_freq		Set bagging_fraction and bagging_freq
Training data more	Training data more
Use save_binary to speed up data loading	Use the categorical feature directly	Use gmin_data_in_leaf and min_sum_hessian_in_leaf
With the parallel learning	Use the dart	Use lambda_L1, lambda_L2, min_gain_to_split for regularization
Num_iterations is larger and learning_rate is smaller	Use max_depth to control the tree depth

Model evaluation index

Taking classification model as an example, common model evaluation indicators are as follows:

The obfuscation matrix can reflect the performance of the model comprehensively, and many indexes can be derived from the obfuscation matrix.

ROC Curve, known as The Receiver Operating Characteristic Curve, is translated into The Receiver Operating Characteristic Curve. This is a curve with false positive rate FPR under different thresholds as the abscissa and Recall rate under different thresholds as the ordinate. Let’s measure how the model tries to capture a small number of classes while mistakenly hitting a large number of classes.

AUC AUC (Area Under the ROC Curve) is often used as the most important evaluation index to measure the stability of the model in dichotomies. The area under the ROC curve is called the AUC area. The larger the AUC area is, the closer the ROC curve is to the upper left corner, the better the model is.

Streamlit implementation

I won’t go into Streamlit any more, it should be very familiar to regular readers. Just to list a few of the little things that were developed before:

Building machine learning apps, it’s so easy
I created a website for this GIF
Stop it, Huaqiang! I’m using machine learning to help you pick out watermelons
It took a month to make a pure machine learning website

The core code is as follows, the complete code I put on Github, welcome everyone to give a Star

Github.com/tjxj/visual…

from definitions import * st.set_option('deprecation.showPyplotGlobalUse', Breast_cancer = load_breast_cancer() data = breast_cancer X_train, X_test, y_train, y_test = train_test_split(data, target, Lgb_train = LGB.Dataset(X_train, y_train) lGB_eval = LGB.Dataset(X_test, y_test, Params = {'num_leaves': num_leaves, 'max_depth': max_depth, 'min_data_in_leaf': min_data_in_leaf, 'feature_fraction': feature_fraction, 'min_data_per_group': min_data_per_group, 'max_cat_threshold': max_cat_threshold, 'learning_rate':learning_rate,'num_leaves':num_leaves, 'max_bin':max_bin,'num_iterations':num_iterations } gbm = lgb.train(params, lgb_train, num_boost_round=2000, valid_sets=lgb_eval, early_stopping_rounds=500) lgb_eval = lgb.Dataset(X_test, y_test, Reference = LGB_train) probs =gbm. predict(X_test, num_iteration= gbM.best_iteration) # thresholds = roc_curve(y_test, probs) st.write('------------------------------------') st.write('Confusion Matrix:') st.write(confusion_matrix(y_test, Np. The where (probs > 0.5, 1, 0))) st.write('------------------------------------') st.write('Classification Report:') report = Classification_report (y_test, NP. Where (probs > 0.5, 1, 0), output_dict=True) report_matrix = pd.DataFrame(report).transpose() st.dataframe(report_matrix) st.write('------------------------------------') st.write('ROC:') plot_roc(fpr, tpr)Copy the code

Upload Huggingface

Huggingface previous article (Tencent this algorithm, I moved online, casual play!) I’ve already introduced it, so let’s talk about the steps again.

Step1: sign up for Huggingface

Step2: create Space, SDK remember to select Streamlit

Step3: Clone the newly created space code, and then push the modified code

git lfs install 
git add .
git commit -m "commit from $beihai"
git push
Copy the code

Git config –global credential. Helper Store git config –global credential. Helper store git config –global credential

Push is done, go back to your space page and see the effect.

Machine learning series: LightGBM Visual parameter tuning

The parameters of the LightGBM

Model evaluation index

Streamlit implementation

Upload Huggingface

Related Posts

[Nova Project] MATLAB plot

Fu Jianquan, MVP of Huawei Cloud: A magnificent transformation from mechanical engineer to AI developer

With Tensorflow. js, real-time human posture estimation can also be achieved in the browser