- By Han Xinzi @Showmeai
- Tutorial address: www.showmeai.tech/tutorials/4…
- This paper addresses: www.showmeai.tech/article-det…
- Statement: All rights reserved, please contact the platform and the author and indicate the source
- collectionShowMeAICheck out more highlights
The introduction
LightGBM is a Boosting integration model developed by Microsoft. Like XGBoost, LightGBM is an optimized and efficient implementation of GBDT. It has some similarities in principle, but it is better than XGBoost in many aspects.
This content ShowMeAI spread to everyone LightGBM engineering application method, for anyone interested in LightGBM principle knowledge, welcome another article reference ShowMeAI graphic machine learning | LightGBM model explanation.
1. LightGBM installation
LightGBM, a common library of powerful Python machine learning tools, is also relatively easy to install.
1.1 Python and IDE Environment Settings
Python environment and IDE Settings can reference ShowMeAI articles graphic python | install set set with the environment.
1.2 Installing the Tool Library
(1) Linux/Mac, etc
The XGBoost installation for these systems can be easily completed by simply typing the following command on the command line and waiting for the installation to complete.
pip install lightgbm
Copy the code
You can also choose a domestic PIP source to get a better installation speed:
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple lightgbm
Copy the code
(2) the Windows
For Windows, the most efficient and convenient installation method is: www.lfd.uci.edu/~gohlke/pyt… To download the LightGBM installation package of the corresponding version, and then run the following command to install the LightGBM. PIP install lightgbm ‑ 3.3.2 rainfall distribution on 10-12 ‑ cp310 ‑ cp310 ‑ win_amd64, WHL
2.LightGBM Parameter manual
In ShowMeAI’s previous article, we explained the three types of XGBoost parameters: general parameters, learning goal parameters, and Booster parameters. The tunable parameters of LightGBM are more abundant, including core parameters, learning control parameters, IO parameters, target parameters, metric parameters, network parameters, GPU parameters, model parameters. Here I often modify the core parameters, learning control parameters, metric parameters and so on. Let’s expand on these model parameters and refer to the LightGBM Chinese documentation for more details.
2.1 Parameter Description
(1) Core parameters
-
Config or config_file: a string that specifies the path to the configuration file. The default is an empty string.
-
Task: A string specifying the task to be performed. To:
train
ortraining
: Indicates a training mission. The default istrain
.predict
orprediction
ortest
: indicates a prediction task.convert_model
: represents a model transformation task. Convert the model file to if-else format.
-
Application or Objective or App: A string representing the problem type. To:
regression
或regression_l2
或mean_squared_error
或mse
orl2_root
或root_mean_squred_error
或rmse
: represents a regression task, but uses L2 loss function. The default isregression
.regression_l1
ormae
ormean_absolute_error
: represents a regression task, but uses the L1 loss function.huber
: represents a regression task, but uses the Huber loss function.fair
: represents a regression task, but uses the Fair loss function.poisson
: indicates a Poisson regression task.quantile
: represents the Quantile regression task.quantile_l2
: represents the Quantile regression task, but uses the L2 loss function.mape
ormean_absolute_precentage_error
: represents a regression task, but uses the MAPE loss functiongamma
: represents the gamma regression task.tweedie
: indicates tweedie’s return to the mission.binary
: represents binary classification task, using logarithmic loss function as objective function.multiclass
: indicates multi-category tasks, using softmax function as the target function. You must set upnum_class
parametermulticlassova
ormulticlass_ova
orova
orovr
: indicates multi-category tasksone-vs-all
Dichotomous objective function of. You must set upnum_class
Parameters.xentropy
orcross_entropy
: The objective function is cross entropy (with selectable linear weight). The label is required to be between 0 and 1.xentlambda
orcross_entropy_lambda
: replaces parameterizedcross_entropy
. The label is required to be between 0 and 1.lambdarank
: Indicates sorting tasks. inlambdarank
In the task, the label should be of integer type, with a higher value indicating a higher correlation.label_gain
Parameter can be used to set the gain (weight) of integer labels.
-
Boosting or Boost or BooSTING_type: a string that gives the base learner model algorithm. To:
gbdt
: represents the traditional gradient ascending decision tree. The default value isgbdt
.rf
: indicates random forest.dart
: represents GBDT with Dropout.goss
: represents GBDT of gradient-based one-side Sampling.
-
Data or train or train_data: a string that specifies the name of the file where the training data resides. The default is an empty string. LightGBM will use it to train the model.
-
Valid or test or valid_data or test_DATA: a string representing the name of the file where the validation set is located. The default is an empty string. LightGBM will output a measure of this data set. If there are multiple validation sets, separate them with commas.
-
Num_iterations, num_iteration, num_tree, num_trees, num_round, num_rounds, or num_boost_round are integers that give boosting iterations. The default is 100.
- For Python/R packages, this parameter is ignored. For Python, use
train()/cv()
Input parameter ofnum_boost_round
Instead. - Internally, LightGBM is set up for multiclass problems
num_class*num_iterations
Tree.
- For Python/R packages, this parameter is ignored. For Python, use
-
Learning_rate or shrinkage_rate: a floating point number giving the learning rate. The default value is 1. In DART, it also affects the normalized weights of dropped trees.
-
Num_leaves or num_leaf: an integer that gives the number of leaves in a tree. The default value is 31.
-
Tree_learner or tree: a string representing tree learner, used for parallel learning. The default is serial. To:
serial
: Single machine tree learnerfeature
: features parallel tree learnerdata
: Data parallel tree learnervoting
: Vote parallel tree learner
-
Num_threads or num_threads or nthread: an integer that gives the number of LightGBM threads. The default is OpenMP_default.
- For faster speed, it should be set to the true number of CPU cores, not the number of threads (most cpus use hyperthreading to generate 2 threads per CPU core).
- When the data set is small, do not make it too large.
- For parallel learning, you should not use the entire CPU core, as this can lead to poor network performance.
-
Device: A string specifying the computing device. The default is CPU. The value can be GPU or CPU.
- A smaller one is recommended
max_bin
To get faster calculations. - To speed up learning, gpus use 32-bit floating-point sums by default. You can set
gpu_use_dp=True
To start 64-bit floating-point numbers, but it slows down training.
- A smaller one is recommended
(2) Learning control parameters
max_depth
: an integer limiting the maximum depth of the tree model. The default value is -1. If it is less than 0, there is no limit.min_data_in_leaf
ormin_data_per_leaf
ormin_data
ormin_child_samples
: an integer representing the minimum number of samples contained on a leaf node. The default value is 20.min_sum_hessian_in_leaf
ormin_sum_hessian_per_leaf
ormin_sum_hessian
ormin_hessian
ormin_child_weight
: a floating point number representing the sum of the smallest Hessians on a leaf node. (that is, the minimum sum of the weight of the leaf node sample) defaults to 1E-3.feature_fraction
orsub_feature
orcolsample_bytree
: a floating point number. The value range is [0.0,1.0]. The default value is 0. If less than 1.0, LightGBM randomly selects some of the features in each iteration. For example, 0.8 means that 80% of the features are selected for training before each tree is trained.feature_fraction_seed
: is an integerfeature_fraction
The default value is 2.bagging_fraction
orsub_row
orsubsample
: a floating point number. The value range is [0.0,1.0]. The default value is 0. If less than 1.0, LightGBM will randomly select a portion of samples for training in each iteration (non-repetitive sampling). For example, 0.8 means that 80% samples (non-repeated sampling) are selected for training before each tree training.bagging_freq
orsubsample_freq
: is an integer representing eachbagging_freq
Perform bagging. If the parameter is 0, bagging is disabled.bagging_seed
orbagging_fraction_seed
: an integer representing the random number seed of bagging. The default value is 3.early_stopping_round
orearly_stopping_rounds
orearly_stopping
: an integer. The default value is 0. If the metric of a validation set is inearly_stopping_round
If there is no promotion in the cycle, the training stops. If the value is 0, the early stop function is disabled.lambda_l1
orreg_alpha
: a floating point number representing the L1 regularization coefficient. The default is 0.lambda_l2
orreg_lambda
: a floating point number representing L2 regularization coefficient. The default is 0.min_split_gain
ormin_gain_to_split
: a floating point number representing the minimum gain for performing sharding. Default is 0.drop_rate
: a floating point number in the range of [0.0,1.0] that represents the ratio of dropout. The default is 1. This parameter is only used in DART.skip_drop
: a floating point number in the range of [0.0,1.0] that represents the probability of skipping dropout. The default is 5. This parameter is only used in DART.max_drop
: an integer representing the maximum number of trees to be deleted in an iteration. The default value is 50. If it is less than or equal to 0, there is no limit. This parameter is only used in DART.uniform_drop
: A Boolean value indicating whether you want to evenly delete the tree. The default value is False. This parameter is only used in DART.xgboost_dart_mode
: A Boolean value indicating whether the Xgboost Dart mode is used. The default value is False. This parameter is only used in DART.drop_seed
: An integer that represents the seed of the random number for dropout. The default value is 4. This parameter is only used in DART.top_rate
: a floating point number in the range of [0.0,1.0] that represents the retention ratio of large gradient data in GOSS. The default value is 2. This parameter is only used in GOSS.other_rate
: a floating point number in the range of [0.0,1.0]. It represents the retention ratio of small gradient data in GOSS. The default value is 1. This parameter is only used in GOSS.min_data_per_group
: An integer that indicates the minimum amount of data for each category group. The default value is 100. For sorting tasksmax_cat_threshold
: an integer that represents the maximum size of the set of values for the category feature. The default value is 32.cat_smooth
: a floating point number used for probabilistic smoothing of category features. The default value is 10. It reduces the effect of noise on category characteristics, especially for classes with very little data.cat_l2
: a floating point number used for L2 regularization coefficients in category segmentation. The default value is 10.top_k
ortopk
: an integer used in voting parallel. The default is 20. Setting it to a larger value will give more accurate results, but will slow down the training.
(3) the IO parameters
max_bin
: an integer that indicates the maximum number of buckets. The default value is 255. LightGBM automatically compresses memory based on this. Such asmax_bin=255
, LightGBM will use uint8 to represent each value of the feature.min_data_in_bin
: an integer that represents the minimum number of samples per bucket. The default value is 3. This method can avoid the situation that a bucket has only one sample.data_random_seed
: an integer that represents the seed of random numbers in parallel learning data separation. The default is 1 which does not include feature parallelism.output_model
ormodel_output
ormodel_out
: a string representing the name of the file in which the model output in the training is saved. The default TXT.input_model
ormodel_input
ormodel_in
: a string representing the file name of the input model file. Default empty string. For prediction tasks, this model will be used to predict data, and for train tasks, training will continue from this modeloutput_result
orpredict_result
orprediction_result
: a string specifying the name of the file to store the prediction result. The default value is TXT.pre_partition
oris_pre_partition
: A Boolean value indicating whether the data has been partitioned. The default value is False. If true, different machines use different partitions for training. It is used for parallel learning (excluding feature parallelism)is_sparse
oris_enable_sparse
orenable_sparse
: A Boolean value indicating whether sparse optimization is enabled. The default is True. If True, sparse optimization is enabled.two_round
ortwo_round_loading
oruse_two_round_loading
: A Boolean value indicating whether to start loading twice. The default value is False, which means that only one load is required. By default, LightGBM maps data files to memory and then loads features from memory, which provides faster data loading. But when the data files are large, memory can run out. If the data file is too large, set it to Truesave_binary
oris_save_binary
oris_save_binary_file
: A Boolean value indicating whether the data set, including the validation set, is saved to a binary file. The default value is False. If True, it speeds up data loading.verbosity
orverbose
: An integer indicating whether to output intermediate information. The default value is 1. If less than 0, only critical information is output; If the value is 0, error and warning messages are displayed. If greater than 0, info is also printed.header
orhas_header
: A Boolean value indicating whether the input data has a header. The default is False.label
orlabel_column
: A string representing the label column. The default is an empty string. You can also specify an integer, such as label=0 to indicate that the 0th column is the label column. You can also prefix column names, for examplelabel=prefix:label_name
.weight
orweight_column
: a string representing the sample weight column. The default is an empty string. You can also specify an integer, such as weight=0 to indicate that the 0th column is the weight column. Note: This is the index after the tag column has been removed. If the label is 0 and the weight is 1, then the weight=0. You can also prefix column names, for exampleweight=prefix:weight_name
.query
orquery_column
orgourp
orgroup_column
: a string, the query/groupID column. The default is an empty string. You can also specify an integer, such as query=0 to indicate that the 0th column is query. Note: This is the index after the tag column has been removed. If the label is 0 and query is 1, then query=0. You can also prefix column names, for examplequery=prefix:query_name
.ignore_column
orignore_feature
orblacklist
: a string representing columns to be ignored in training. Default is empty string. It can be indexed by numbers, as inIgnore_column = 0
Indicates that columns 0, 2 are ignored. Note: This is the index after the tag column has been removed.- You can also prefix column names, for example
ignore_column=prefix:ign_name1,ign_name2
. categorical_feature
orcategorical_column
orcat_feature
orcat_column
: a string specifying the column of category characteristics. The default is an empty string. It can be indexed by numbers, as inCategorical_feature = 0
Indicates that columns 0, 2 will be category characteristics. Note: This is the index after the tag column has been removed. You can also prefix column names, for examplecategorical_feature=prefix:cat_name1,cat_name2
In the CategoryCAL feature, a negative value is considered a missing value.predict_raw_score
orraw_score
oris_predict_raw_score
: A Boolean value indicating whether to predict the raw score. The default is False. If True, only the raw score is predicted. This parameter is only used for prediction tasks.predict_leaf_index
orleaf_index
oris_predict_leaf_index
: A Boolean value indicating whether to predict the number of leaf nodes in each tree for each sample. The default is False. During prediction, each sample is assigned to a leaf node in each tree. This parameter is to print the number of these leaf nodes. This parameter is only used for prediction tasks.predict_contrib
orcontrib
oris_predict_contrib
: A Boolean value indicating whether to output the predicted contribution of each feature for each sample. The default is False. The output result shape is [nsamples,nfeatures+1], and the +1 is due to the contribution of Bais. All the contributions add up to the predicted result for that sample. This parameter is only used for prediction tasks.bin_construct_sample_cnt
orsubsample_for_bin
: an integer representing the number of samples used to construct the histogram. The default value is 200,000. If the data is very sparse, it can be set to a larger value, which provides better training but increases the data load time.num_iteration_predict
: an integer indicating how many subtrees to use in the prediction. The default value is -1. Less than or equal to 0 means that all subtrees of the model are used. This parameter is only used for prediction tasks.pred_early_stop
: A Boolean value indicating whether early stops are used to accelerate the prediction. The default is False. If True, accuracy may be affected.pred_early_stop_freq
: an integer indicating the frequency of checking early stops. The default is 10pred_early_stop_margin
: a floating point number representing the marginal threshold for early stops. The default is 0use_missing
: A Boolean value indicating whether the missing value function is used. Default True if False disables the missing value feature.zero_as_missing
: A Boolean value indicating whether all zeros (including those not shown in the libSVM/SPARSE matrix) are considered missing. The default is False. If False, nan is considered a missing value. If True, thennp.nan
And zero are considered missing values.init_score_file
: a string representing the path to the initialization score file during training. Default empty string train_data_file+ “.init “(if present)valid_init_score_file
: a string representing the path to the initialization score file at the time of validation. The default is an empty string representing valid_data_file+ “.init “(if present). If there are more than one (corresponding to more than one validation set), you can use a comma.
To separate.
(4) Target parameters
sigmoid
: a floating point number that takes the argument of the sigmoid function and defaults to 0. It is used for binary tasks and Lambdarank tasks.alpha
: a floating point number used in the Huber loss function and Quantileregression, with a default value of 0. It is used in the Huber regression task and Quantile regression task.fair_c
: a floating point number used in the Fair loss function. The default value is 0. It is used for the Fair regression task.gaussian_eta
: a floating point number used to control the width of the Gaussian function. The default value is 0. It is used for regression_L1 and Huber regressions.posson_max_delta_step
: a floating point number used as a parameter to the Poisson Regression. The default value is 7. It is used for poisson regression tasks.scale_pos_weight
: a floating point number used to adjust the weight of positive samples, default to 0. It is used for binary tasks.boost_from_average
: A Boolean value indicating whether the initial score is adjusted to the average (which can make convergence faster). The default is True. It is used for regression tasks.is_unbalance
orunbalanced_set
: a Boolean value indicating whether training data is balanced. The default is True. It is used for binary tasks.max_position
: an integer indicating that the NDCG location will be optimized. The default is 20. It is used for the Lambdarank mission.label_gain
: a sequence of floating point numbers that gives the gain of each tag. The default values are 0,1,3,7,15… . It is used for the Lambdarank mission.num_class
ornum_classes
: an integer indicating the number of categories in a multicategory task. The default is 1, which is used for multi-category tasks.reg_sqrt
: a Boolean value, False by default. If True, the result of the fitting is: \ SQRT {label}. The predicted result is automatically converted to {pred}^2. It is used for regression tasks.
(5) Measurement parameters
metric
: a string specifying the metric of the measure. Default: for regression problems, use l2; For dichotomous problems, usebinary_logloss
; For the lambdarank problem, NDCG is used. If there are more than one metric, use a comma.
Space.l1
ormean_absolute_error
ormae
orregression_l1
: indicates absolute loss.l2
ormean_squared_error
ormse
orregression_l2
orregression
: represents the squared loss.l2_root
orroot_mean_squared_error
orrmse
: indicates the loss of opening.quantile
: represents the loss in Quantile regression.mape
ormean_absolute_percentage_error
: indicates MAPE loss.huber
: represents huber loss.fair
: indicates fair loss.poisson
: represents the negative logarithmic likelihood of Poisson regression.gamma
: represents the negative log likelihood of gamma regression.gamma_deviance
: represents the variance of the gamma regression residual.tweedie
: represents the negative logarithmic likelihood of Tweedie regression.ndcg
: indicates NDCG.map
ormean_average_precision
: indicates the average accuracy.auc
: indicates AUC.binary_logloss
orbinary
: represents the logarithmic loss function in binary classification.binary_error
: indicates the classification error rate in the binary classification.multi_logloss
ormulticlass
orsoftmax
Or ‘multiclassovaor
multiclass_ovaOr,
ovaor
Ovr ‘: represents the logarithmic loss function in multi-class classification.multi_error
: indicates the category error rate of multiple categories.xentropy
orcross_entropy
: indicates cross entropy.xentlambda
orcross_entropy_lambda
: indicates the cross-entropy weighted by intensity.kldiv
orkullback_leibler
: represents KL divergence.
metric_freq
oroutput_freq
: a formal, indicating the number of times a measurement result is output. The default value is 1.train_metric
ortraining_metric
oris_training_metric
: a Boolean value, False by default. If True, the measurement is printed at training time.ndcg_at
orndcg_eval_at
oreval_at
: a list of integers specifying the locations of NDCG evaluation points. The default values are 1, 2, 3, 4, and 5.
2.2 Parameter Impact and Parameter Adjustment Suggestions
The following is the summary of the influence of core parameters on the model and the corresponding parameter tuning suggestions.
(1) Control tree growth
num_leaves
: Number of leaf nodes. It is the main parameter controlling the complexity of tree model.- If it is
level-wise
, the parameter is
Where depth is the tree depth. However, when the number of leaves is the same, the leaf-wise tree is much deeper than the Level-Wise tree, which is very easy to lead to overfitting. Therefore, num_leaves should be less than
. In leaf-wise trees, there is no concept of depth. Because there is no proper mapping from leaves to depth.
- If it is
min_data_in_leaf
: Minimum number of samples per leaf node.- It is processing
leaf-wise
Important parameters of tree overfitting. Set it to a large value to avoid generating an overly deep tree. But it can also lead to under-fitting.
- It is processing
max_depth
: Maximum depth of a tree. This parameter explicitly limits the depth of the tree.
(2) Faster training speed
- By setting the
bagging_fraction
andbagging_freq
Parameter to use the Bagging method. - By setting the
feature_fraction
Parameters to use subsampling of features. - Use smaller
max_bin
. - use
save_binary
Data loading is accelerated in the future learning process.
(3) Better model effect
- Use larger
max_bin
(Learning may slow down). - Use smaller
learning_rate
And the largernum_iterations
. - Use larger
num_leaves
(May lead to overfitting). - Use larger training data.
- try
dart
.
(4) Alleviate the problem of over-fitting
- Use smaller
max_bin
. - Use smaller
num_leaves
. - use
min_data_in_leaf
andmin_sum_hessian_in_leaf
. - By setting the
bagging_fraction
andbagging_freq
To use thebagging
. - By setting the
feature_fraction
To use feature subsampling. - Use larger training data.
- use
lambda_l1
,lambda_l2
andmin_gain_to_split
To use the re. - try
max_depth
To avoid generating too deep a tree.
3. Built-in modeling method of LightGBM
3.1 Built-in modeling method
LightGBM has a built-in modeling method with the following data formats and core training methods:
- Based on the
lightgbm.Dataset
Format data. - Based on the
lightgbm.train
Interface training.
The following is an official simple example of reading data in libSVM format (in Dataset format) and specifying parameters for modeling.
# coding: utf-8
import json
import lightgbm as lgb
import pandas as pd
from sklearn.metrics import mean_squared_error
Load the data set
print('Load data... ')
df_train = pd.read_csv('./data/regression.train.txt', header=None, sep='\t')
df_test = pd.read_csv('./data/regression.test.txt', header=None, sep='\t')
# Set training set and test set
y_train = df_train[0].values
y_test = df_test[0].values
X_train = df_train.drop(0, axis=1).values
X_test = df_test.drop(0, axis=1).values
Construct Dataset format in LGB
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
Nail down a set of metrics
params = {
'task': 'train'.'boosting_type': 'gbdt'.'objective': 'regression'.'metric': {'l2'.'auc'},
'num_leaves': 31.'learning_rate': 0.05.'feature_fraction': 0.9.'bagging_fraction': 0.8.'bagging_freq': 5.'verbose': 0
}
print('Start training... ')
# training
gbm = lgb.train(params,
lgb_train,
num_boost_round=20,
valid_sets=lgb_eval,
early_stopping_rounds=5)
# Save the model
print('Save the model... ')
Save the model to a file
gbm.save_model('model.txt')
print('Begin to predict... ')
# prediction
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# assessment
print('The estimated RMSE is :')
print(mean_squared_error(y_test, y_pred) ** 0.5)
Copy the code
Loading data... Start training... [1] Valid_0's L2:0.24288 VALID_0's AUC: 0.764496 Training Until Validation scores don't improve for 5 rounds. [2] VALID_0's L2:0.239307 ROUNDS: [3] VALID_0's L2:0.235559 VALID_0's AUC: 0.785547 [4] VALID_0's L2:0.23771 VALID_0's AUC: 0.785547 [6] VALID_0's L2:0.223692 VALID_0's AUC: 0.797786 [5] VALID_0's L2:0.226297 VALID_0's AUC: 0.805155 [6] VALID_0's L2:0.223692 [7] VALID_0's L2:0.220941 VALID_0's AUC: 0.806566 [8] VALID_0's L2:0.217982 VALID_0's auC: 0.220941 [8] VALID_0's L2:0.217982 VALID_0's auC: 0.806566 [10] VALID_0's L2:0.213064 VALID_0's AUC: 0.808566 [9] VALID_0's L2:0.215351 VALID_0's AUC: 0.809041 [10] VALID_0's L2:0.213064 [12] VALID_0's L2:0.209336 AUC: 0.805953 [11] VALID_0's L2:0.209336 AUC: 0.804631 [12] VALID_0's L2:0.209336 AUC: 0.805953 [14] VALID_0's L2:0.206016 VALID_0's AUC: 0.802922 [13] VALID_0's L2:0.207492 VALID_0's auC: 0.802011 [14] VALID_0's L2:0.206016 0.80193 Early stopping, Best Iteration is: [9] VALID_0's L_2:0.215351 VALID_0's AUC: 0.809041 Save model by stopping, Best Iteration is: [9] VALID_0's L_2:0.215351 VALID_0's AUC: 0.809041 save model by stopping... Begin to predict... The estimated RMSE is 0.4640593794679212Copy the code
3.2 Setting sample weight
LightGBM modeling is very flexible, it allows us to set different weight learning for each sample, the way of setting is very simple, we need to provide the model with a set of weight array data, the length is the same as the sample number.
The following is a typical example, where binar. train and binar. test are read and loaded as input in the lightgbm.Dataset format, while sample weights can be set in the lightgbm.Dataset build parameters (in this case, the numpy array form). Then based on the lightgBM. train interface using built-in modeling training.
# coding: utf-8
import json
import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings("ignore")
Load the data set
print('Load data... ')
df_train = pd.read_csv('./data/binary.train', header=None, sep='\t')
df_test = pd.read_csv('./data/binary.test', header=None, sep='\t')
W_train = pd.read_csv('./data/binary.train.weight', header=None) [0]
W_test = pd.read_csv('./data/binary.test.weight', header=None) [0]
y_train = df_train[0].values
y_test = df_test[0].values
X_train = df_train.drop(0, axis=1).values
X_test = df_test.drop(0, axis=1).values
num_train, num_feature = X_train.shape
Load weights at the same time as data
lgb_train = lgb.Dataset(X_train, y_train,
weight=W_train, free_raw_data=False)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,
weight=W_test, free_raw_data=False)
# set parameters
params = {
'boosting_type': 'gbdt'.'objective': 'binary'.'metric': 'binary_logloss'.'num_leaves': 31.'learning_rate': 0.05.'feature_fraction': 0.9.'bagging_fraction': 0.8.'bagging_freq': 5.'verbose': 0
}
# Output feature name
feature_name = ['feature_' + str(col) for col in range(num_feature)]
print('Start training... ')
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
valid_sets=lgb_train, # Evaluate the training set
feature_name=feature_name,
categorical_feature=[21])
Copy the code
Loading data... Start training... [1] Training's Binary_logloss: 0.68205 [2] Training's Binary_logloss: 0.673618 [3] 0.665891 [4] Training's binary_logloss: 0.656874 [5] Training's binary_logloss: 0.648523 [6] Training's binary_logloss: 0.665891 [4] Training's binary_logloss: 0.648523 [6] 0.641874 [7] Training's binary_logloss: 0.636029 [8] Training's binary_logloss: 0.629427 [9] Training's binary_logloss: 0.629427 [10] Training's binary_loGloss: 0.617593Copy the code
3.3 Model storage and loading
The model objects from the above modeling process can be saved by the save_model member function. The saved model can be loaded back into memory via GIGAB.booster and the test set can be predicted.
The specific code is as follows:
# check the feature name
print('Complete 10 rounds of training... ')
print('The seventh feature is :')
print(repr(lgb_train.feature_name[6]))
# Storage model
gbm.save_model('./model/lgb_model.txt')
# Feature name
print('Feature Name :')
print(gbm.feature_name())
# Feature importance
print('Feature importance :')
print(list(gbm.feature_importance()))
# Load model
print('Load model for prediction')
bst = lgb.Booster(model_file='./model/lgb_model.txt')
# prediction
y_pred = bst.predict(X_test)
# Evaluate the effect in the test set
print('RmSE on test set is :')
print(mean_squared_error(y_test, y_pred) ** 0.5)
Copy the code
Complete 10 training rounds... Number 7: 'Feature_6' ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20', 'feature_21', 'feature_22', 'feature_23', 'feature_24', 'Feature_25 ', 'feature_26', 'feature_27'] [8, 5, 1, 19, 7, 33, 2, 0, 2, 10, 5, 2, 0, 9, 3, 3, 0, 2, 2, 5, 1, 0, 36, 3, 33, 45, 29, 35] The loading model was used to predict rmSE on the test set as follows: 0.4629245607636925Copy the code
3.4 Continue training
LightGBM is boosting model, and new base learners will be added to each training round. LightGBM also supports continuous training based on existing models and parameters, without having to start training each time.
Here is a typical example. We load the LGB model that has been trained for 10 rounds (i.e. 10-tree integration), and continue the training on this basis (some changes are made at the parameter level, the learning rate is adjusted, and some processing methods such as bagging are added to alleviate over-fitting).
# Keep training
# load model initialization from./model/model.txt
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
init_model='./model/lgb_model.txt',
valid_sets=lgb_eval)
print('Initialize the old model and complete the 10-20 training rounds... ')
# Adjust hyperparameters during training
# For example, the learning rate is adjusted here
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
init_model=gbm,
learning_rates=lambda iter: 0.05 * (0.99 ** iter),
valid_sets=lgb_eval)
print('Gradually adjust the learning rate to complete round 20-30... ')
# Adjust other hyperparameters
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
init_model=gbm,
valid_sets=lgb_eval,
callbacks=[lgb.reset_parameter(bagging_fraction=[0.7] * 5 + [0.6] * 5)])
print('Gradually adjust the Bagging ratio to complete round 30 to 40... ')
Copy the code
[11] Valid_0's binary_logloss: 0.616177 [12] Valid_0's binary_logloss: 0.611792 [13] 0.607043 [14] VALID_0's binary_loGloss: 0.602314 [15] valid_0's binary_loGloss: 0.598433 [16] Valid_0's binary_loGloss: [18] VALID_0's binary_loGloss: 0.595238 [17] Valid_0's binary_logloss: 0.592047 [18] VALID_0's binary_logloss: 0.588673 [19] 0.586084 [20] Valid_0's binary_loGLOSS: 0.584033 completes the 10-20 rounds of training with the old model initialized... [21] Valid_0's binary_logloss: 0.616177 [22] Valid_0's binary_logloss: 0.611834 [23] 0.607177 [24] VALID_0's binary_loGloss: 0.602577 [25] VALID_0's binary_loGloss: 0.59831 [26] Valid_0's binary_loGloss: 0.607177 [24] Valid_0's binary_loGloss: 0.602577 [25] Valid_0's binary_loGloss: 0.59831 [26] 0.595259 [28] VALID_0's binary_loGloss: 0.589017 [29] Valid_0's binary_loGloss: 0.592201 [28] Valid_0's binary_loGloss: 0.589017 [29] 0.586597 [30] Step by step adjust learning rate to complete round 20-30 training... [31] Valid_0's binary_logloss: 0.616053 [32] Valid_0's binary_logloss: 0.612291 [33] 0.60856 [34] valid_0's binary_logloss: 0.605387 [35] Valid_0's binary_logloss: 0.601744 [36] Valid_0's binary_logloss: 0.601744 [36] Valid_0's binary_logloss: 0.598556 [37] VALID_0's binary_logloss: 0.595585 [38] Valid_0's binary_logloss: 0.593228 [39] Valid_0's binary_logloss: 0.595585 [38] Valid_0's binary_logloss: 0.593228 [39] 0.59018 [40] Valid_0's Binary_loGLOSS: 0.588391 Gradually adjust the Bagging ratio to complete round 30-40 training...Copy the code
3.5 User-defined loss function
LightGBM supports the customization of loss function and evaluation criteria in the training process, in which the definition of loss function needs to return the calculation methods of the first and second derivatives of the loss function, and the evaluation criteria need to calculate the label and estimated value of data. The loss function is used for tree structure learning in the training process, while the evaluation criterion is often used for effect evaluation on the verification set.
# Custom loss function needs to provide the first and second derivative forms of loss function
def loglikelood(preds, train_data) :
labels = train_data.get_label()
preds = 1. / (1. + np.exp(-preds))
grad = preds - labels
hess = preds * (1. - preds)
return grad, hess
# Custom evaluation function
def binary_error(preds, train_data) :
labels = train_data.get_label()
return 'error', np.mean(labels ! = (preds >0.5)), False
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
init_model=gbm,
fobj=loglikelood,
feval=binary_error,
valid_sets=lgb_eval)
print('Complete round 40-50 with custom loss function and assessment criteria... ')
Copy the code
[41] Valid_0's binary_logloss: 0.614429 VALID_0's error: 0.268 [42] Valid_0's binary_logloss: 0.610689 VALID_0's error: 0.26 [43] VALID_0's binary_logloss: 0.606267 VALID_0's Error: 0.264 [44] valid_0's binary_logloss: 0.606267 0.601949 VALID_0's error: 0.258 [45] valid_0's binary_loGloss: 0.597271 VALID_0's error: 0.266 [46] VALID_0's binary_logloss: 0.593971 VALID_0's error: 0.276 [47] valid_0's binary_logloss: 0.266 [46] Valid_0's binary_logloss: 0.593971 ERROR: 0.276 [47] valid_0's binary_logloss: 0.591427 VALID_0's error: 0.278 [48] VALID_0's binary_loGloss: 0.588301 VALID_0's error: 0.284 [49] VALID_0's binary_logloss: 0.586562 VALID_0's Error: 0.288 [50] valid_0's binary_logloss: 0.284 [49] valid_0's binary_logloss: 0.586562 Complete round 40 to 50 with custom loss function and evaluation criteria...Copy the code
4. Form interface of LightGBM predictor
4.1 SKLearn Morphology predictor interface
Like XGBoost, LightGBM supports modeling using SKLearn’s unified predictor morphology interface. The following is a typical reference case for training sets and test sets read in Dataframe format. LGBMRegressor can be initialized directly with LightGBM for fit fit training. Use method and interface, and other SKLearn estimator.
# coding: utf-8
import lightgbm as lgb
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
# load data
print('Load data... ')
df_train = pd.read_csv('./data/regression.train.txt', header=None, sep='\t')
df_test = pd.read_csv('./data/regression.test.txt', header=None, sep='\t')
Extract features and tags
y_train = df_train[0].values
y_test = df_test[0].values
X_train = df_train.drop(0, axis=1).values
X_test = df_test.drop(0, axis=1).values
print('Start training... ')
Initialize LGBMRegressor
gbm = lgb.LGBMRegressor(objective='regression',
num_leaves=31,
learning_rate=0.05,
n_estimators=20)
# Use fit function to fit
gbm.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
eval_metric='l1',
early_stopping_rounds=5)
# prediction
print('Begin to predict... ')
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)
# Evaluate the predicted results
print('The RMSE of the predicted result is :')
print(mean_squared_error(y_test, y_pred) ** 0.5)
Copy the code
Loading data... Start training... [1] Training Until Validation scores don't improve for 5 rounds. [2] Valid_0's ROUNDS: [4] VALID_0's L1:0.476848 [5] VALID_0's L1:0.47305 [6] VALID_0's L1:0.486563 [3] VALID_0's L1:0.481489 [4] VALID_0's L1:0.476848 [5] VALID_0's L1:0.47305 [6] 0.469049 [7] valid_0 's l1:0.465556 [8] valid_0' s l1:0.462208 [9] valid_0 's l1:0.458676 [10] valid_0' s l1: [12] VALID_0's L1:0.449158 [13] VALID_0's L1:0.44608 [14] VALID_0's L1:0.452047 [12] VALID_0's L1:0.449158 [13] VALID_0's L1:0.44608 [14] [16] VALID_0's L1:0.437687 [17] VALID_0's L1:0.435454 [18] VALID_0's L1:0.440643 [16] VALID_0's L1:0.437687 [17] VALID_0's L1:0.435454 [18] VALID_0's L1:0.440643 [16] VALID_0's L1:0.437687 [17] VALID_0's L1:0.435454 [18] 0.433288 [19] VALID_0's L1:0.431297 [20] VALID_0's L1:0.428946 Did not meet early stopping. Best Iteration is: [20] Valid_0's L1:0.428946 The RMSE of the predicted result is 0.4441153344254208Copy the code
4.2 Grid search callback
As mentioned above, LightGBM’s estimator interface is used in the same way as other estimators in SKLearn, so we can also use the hyperparameter tuning method in SKLearn for model tuning.
The following is a code example for tuning hyperparameters using a typical grid search and intersection method. We will give a dictionary of candidate parameter lists and use GridSearchCV to evaluate cross-validation experiments to select the optimal hyperparameters of LightGBM among the candidate parameters.
Select the optimal hyperparameter with scikit-learn grid search cross-validation
estimator = lgb.LGBMRegressor(num_leaves=31)
param_grid = {
'learning_rate': [0.01.0.1.1].'n_estimators': [20.40]
}
gbm = GridSearchCV(estimator, param_grid)
gbm.fit(X_train, y_train)
print('The optimal hyperparameter found by grid search is :')
print(gbm.best_params_)
Copy the code
The optimal hyperparameters found with grid search are: {'learning_rate': 0.1, 'n_ESTIMators ': 40}Copy the code
4.3 Drawing Interpretation
LightGBM supports visual presentation and interpretation of model training, including visualization of loss function values and evaluation criteria results during training, sorting and visualization of feature importance after training, and visualization of base learners (such as decision trees).
The reference code is as follows:
# coding: utf-8
import lightgbm as lgb
import pandas as pd
try:
import matplotlib.pyplot as plt
except ImportError:
raise ImportError('You need to install matplotlib for plotting.')
Load the data set
print('Load data... ')
df_train = pd.read_csv('./data/regression.train.txt', header=None, sep='\t')
df_test = pd.read_csv('./data/regression.test.txt', header=None, sep='\t')
Extract features and tags
y_train = df_train[0].values
y_test = df_test[0].values
X_train = df_train.drop(0, axis=1).values
X_test = df_test.drop(0, axis=1).values
Construct Dataset data format in LGB
lgb_train = lgb.Dataset(X_train, y_train)
lgb_test = lgb.Dataset(X_test, y_test, reference=lgb_train)
# set parameters
params = {
'num_leaves': 5.'metric': ('l1'.'l2'),
'verbose': 0
}
evals_result = {} # to record eval results for plotting
print('Start training... ')
# training
gbm = lgb.train(params,
lgb_train,
num_boost_round=100,
valid_sets=[lgb_train, lgb_test],
feature_name=['f' + str(i + 1) for i in range(28)],
categorical_feature=[21],
evals_result=evals_result,
verbose_eval=10)
print('Drawing during training... ')
ax = lgb.plot_metric(evals_result, metric='l1')
plt.show()
print('Draw the importance of features... ')
ax = lgb.plot_importance(gbm, max_num_features=10)
plt.show()
print('Draw the 84th tree... ')
ax = lgb.plot_tree(gbm, tree_index=83, figsize=(20.8), show_info=['split_gain'])
plt.show()
#print(' Draw the 84th tree with Graphviz... ')
#graph = lgb.create_tree_digraph(gbm, tree_index=83, name='Tree84')
#graph.render(view=True)
Copy the code
Loading data... Start training... [10] Training's L_1:0.457448 L_1's L_1:0.21641 L_1's L_1:0.456464 [20] Training's L_1:0.457448 L_1's L_1:0.21641 L_1:0.456464 [20] Training's L1:0.436869 VALID_1's L2:0.201616 VALID_1's L1:0.434057 [30] Training's L1:0.205099 training's L1:0.436869 VALID_1's L2:0.201616 VALID_1's L1:0.434057 0.197421 training's L1:0.421302 VALID_1's L2:0.192514 VALID_1's L1:0.417019 [40] Training's L2: 0.192856 Training's L1:0.411107 VALID_1's L2:0.187258 VALID_1's L1:0.406303 [50] Training's L1:0.187258 VALID_1's L2:0.187258 VALID_1's L1:0.406303 Training's L1:0.403695 VALID_1's L2:0.183688 VALID_1's L1:0.398997 [60] Training's L1:0.403695 valid_1's L2:0.183688 VALID_1's L1:0.398997 Training's L_1:0.398704 VALID_1's L_1:0.181009 VALID_1's L_1:0.393977 [70] Training's L_1:0.393977 Training's L_1:0.394876 VALID_1's L_1:0.178803 VALID_1's L_1:0.389805 [80] Training's L_1:0.389805 Training's L1:0.391147 VALID_1's L2:0.176799 VALID_1's L1:0.386476 [90] Training's L1:0.391147 VALID_1's L2:0.176799 VALID_1's L1:0.386476 Training's L1:0.388101 VALID_1's L2:0.175775 VALID_1's L1:0.384404 [100] Training's L1:0.388101 VALID_1's L2:0.175775 VALID_1's L1:0.384404 Training's L1:0.385174 VALID_1's L2:0.175321 VALID_1's L1:0.382929Copy the code
The resources
- Diagram of machine learning algorithm | from entry to master series
- The illustration of machine learning | LightGBM model explanation
ShowMeAIRecommended series of tutorials
- Illustrated Python programming: From beginner to Master series of tutorials
- Illustrated Data Analysis: From beginner to master series of tutorials
- The mathematical Basics of AI: From beginner to Master series of tutorials
- Illustrated Big Data Technology: From beginner to master
- Illustrated Machine learning algorithms: Beginner to Master series of tutorials
- Machine learning: Teach you how to play machine learning series
Related articles recommended
- Application practice of Python machine learning algorithm
- SKLearn introduction and simple application cases
- SKLearn most complete application guide
- XGBoost modeling applications in detail
- LightGBM modeling applications in detail
- Python Machine Learning Integrated Project – E-commerce sales estimates
- Python Machine Learning Integrated Project — E-commerce Sales Estimation
- Machine learning feature engineering most complete interpretation
- Application of Featuretools
- AutoML Automatic machine learning modeling