• By Han Xinzi @Showmeai
  • Tutorial address: www.showmeai.tech/tutorials/4…
  • This paper addresses: www.showmeai.tech/article-det…
  • Statement: All rights reserved, please contact the platform and the author and indicate the source
  • collectionShowMeAICheck out more highlights

The introduction

LightGBM is a Boosting integration model developed by Microsoft. Like XGBoost, LightGBM is an optimized and efficient implementation of GBDT. It has some similarities in principle, but it is better than XGBoost in many aspects.

This content ShowMeAI spread to everyone LightGBM engineering application method, for anyone interested in LightGBM principle knowledge, welcome another article reference ShowMeAI graphic machine learning | LightGBM model explanation.

1. LightGBM installation

LightGBM, a common library of powerful Python machine learning tools, is also relatively easy to install.

1.1 Python and IDE Environment Settings

Python environment and IDE Settings can reference ShowMeAI articles graphic python | install set set with the environment.

1.2 Installing the Tool Library

(1) Linux/Mac, etc

The XGBoost installation for these systems can be easily completed by simply typing the following command on the command line and waiting for the installation to complete.

pip install lightgbm
Copy the code

You can also choose a domestic PIP source to get a better installation speed:

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple lightgbm
Copy the code

(2) the Windows

For Windows, the most efficient and convenient installation method is: www.lfd.uci.edu/~gohlke/pyt… To download the LightGBM installation package of the corresponding version, and then run the following command to install the LightGBM. PIP install lightgbm ‑ 3.3.2 rainfall distribution on 10-12 ‑ cp310 ‑ cp310 ‑ win_amd64, WHL

2.LightGBM Parameter manual

In ShowMeAI’s previous article, we explained the three types of XGBoost parameters: general parameters, learning goal parameters, and Booster parameters. The tunable parameters of LightGBM are more abundant, including core parameters, learning control parameters, IO parameters, target parameters, metric parameters, network parameters, GPU parameters, model parameters. Here I often modify the core parameters, learning control parameters, metric parameters and so on. Let’s expand on these model parameters and refer to the LightGBM Chinese documentation for more details.

2.1 Parameter Description

(1) Core parameters

  • Config or config_file: a string that specifies the path to the configuration file. The default is an empty string.

  • Task: A string specifying the task to be performed. To:

    • trainortraining: Indicates a training mission. The default istrain.
    • predictorpredictionortest: indicates a prediction task.
    • convert_model: represents a model transformation task. Convert the model file to if-else format.
  • Application or Objective or App: A string representing the problem type. To:

    • regressionregression_l2mean_squared_errormseorl2_rootroot_mean_squred_errorrmse: represents a regression task, but uses L2 loss function. The default isregression.
    • regression_l1ormaeormean_absolute_error: represents a regression task, but uses the L1 loss function.
    • huber: represents a regression task, but uses the Huber loss function.
    • fair: represents a regression task, but uses the Fair loss function.
    • poisson: indicates a Poisson regression task.
    • quantile: represents the Quantile regression task.
    • quantile_l2: represents the Quantile regression task, but uses the L2 loss function.
    • mapeormean_absolute_precentage_error: represents a regression task, but uses the MAPE loss function
    • gamma: represents the gamma regression task.
    • tweedie: indicates tweedie’s return to the mission.
    • binary: represents binary classification task, using logarithmic loss function as objective function.
    • multiclass: indicates multi-category tasks, using softmax function as the target function. You must set upnum_classparameter
    • multiclassovaormulticlass_ovaorovaorovr: indicates multi-category tasksone-vs-allDichotomous objective function of. You must set upnum_classParameters.
    • xentropyorcross_entropy: The objective function is cross entropy (with selectable linear weight). The label is required to be between 0 and 1.
    • xentlambdaorcross_entropy_lambda: replaces parameterizedcross_entropy. The label is required to be between 0 and 1.
    • lambdarank: Indicates sorting tasks. inlambdarankIn the task, the label should be of integer type, with a higher value indicating a higher correlation.label_gainParameter can be used to set the gain (weight) of integer labels.
  • Boosting or Boost or BooSTING_type: a string that gives the base learner model algorithm. To:

    • gbdt: represents the traditional gradient ascending decision tree. The default value isgbdt.
    • rf: indicates random forest.
    • dart: represents GBDT with Dropout.
    • goss: represents GBDT of gradient-based one-side Sampling.
  • Data or train or train_data: a string that specifies the name of the file where the training data resides. The default is an empty string. LightGBM will use it to train the model.

  • Valid or test or valid_data or test_DATA: a string representing the name of the file where the validation set is located. The default is an empty string. LightGBM will output a measure of this data set. If there are multiple validation sets, separate them with commas.

  • Num_iterations, num_iteration, num_tree, num_trees, num_round, num_rounds, or num_boost_round are integers that give boosting iterations. The default is 100.

    • For Python/R packages, this parameter is ignored. For Python, usetrain()/cv()Input parameter ofnum_boost_roundInstead.
    • Internally, LightGBM is set up for multiclass problemsnum_class*num_iterationsTree.
  • Learning_rate or shrinkage_rate: a floating point number giving the learning rate. The default value is 1. In DART, it also affects the normalized weights of dropped trees.

  • Num_leaves or num_leaf: an integer that gives the number of leaves in a tree. The default value is 31.

  • Tree_learner or tree: a string representing tree learner, used for parallel learning. The default is serial. To:

    • serial: Single machine tree learner
    • feature: features parallel tree learner
    • data: Data parallel tree learner
    • voting: Vote parallel tree learner
  • Num_threads or num_threads or nthread: an integer that gives the number of LightGBM threads. The default is OpenMP_default.

    • For faster speed, it should be set to the true number of CPU cores, not the number of threads (most cpus use hyperthreading to generate 2 threads per CPU core).
    • When the data set is small, do not make it too large.
    • For parallel learning, you should not use the entire CPU core, as this can lead to poor network performance.
  • Device: A string specifying the computing device. The default is CPU. The value can be GPU or CPU.

    • A smaller one is recommendedmax_binTo get faster calculations.
    • To speed up learning, gpus use 32-bit floating-point sums by default. You can setgpu_use_dp=TrueTo start 64-bit floating-point numbers, but it slows down training.

(2) Learning control parameters

  • max_depth: an integer limiting the maximum depth of the tree model. The default value is -1. If it is less than 0, there is no limit.
  • min_data_in_leaformin_data_per_leaformin_dataormin_child_samples: an integer representing the minimum number of samples contained on a leaf node. The default value is 20.
  • min_sum_hessian_in_leaformin_sum_hessian_per_leaformin_sum_hessianormin_hessianormin_child_weight: a floating point number representing the sum of the smallest Hessians on a leaf node. (that is, the minimum sum of the weight of the leaf node sample) defaults to 1E-3.
  • feature_fractionorsub_featureorcolsample_bytree: a floating point number. The value range is [0.0,1.0]. The default value is 0. If less than 1.0, LightGBM randomly selects some of the features in each iteration. For example, 0.8 means that 80% of the features are selected for training before each tree is trained.
  • feature_fraction_seed: is an integerfeature_fractionThe default value is 2.
  • bagging_fractionorsub_roworsubsample: a floating point number. The value range is [0.0,1.0]. The default value is 0. If less than 1.0, LightGBM will randomly select a portion of samples for training in each iteration (non-repetitive sampling). For example, 0.8 means that 80% samples (non-repeated sampling) are selected for training before each tree training.
  • bagging_freqorsubsample_freq: is an integer representing eachbagging_freqPerform bagging. If the parameter is 0, bagging is disabled.
  • bagging_seedorbagging_fraction_seed: an integer representing the random number seed of bagging. The default value is 3.
  • early_stopping_roundorearly_stopping_roundsorearly_stopping: an integer. The default value is 0. If the metric of a validation set is inearly_stopping_roundIf there is no promotion in the cycle, the training stops. If the value is 0, the early stop function is disabled.
  • lambda_l1orreg_alpha: a floating point number representing the L1 regularization coefficient. The default is 0.
  • lambda_l2orreg_lambda: a floating point number representing L2 regularization coefficient. The default is 0.
  • min_split_gainormin_gain_to_split: a floating point number representing the minimum gain for performing sharding. Default is 0.
  • drop_rate: a floating point number in the range of [0.0,1.0] that represents the ratio of dropout. The default is 1. This parameter is only used in DART.
  • skip_drop: a floating point number in the range of [0.0,1.0] that represents the probability of skipping dropout. The default is 5. This parameter is only used in DART.
  • max_drop: an integer representing the maximum number of trees to be deleted in an iteration. The default value is 50. If it is less than or equal to 0, there is no limit. This parameter is only used in DART.
  • uniform_drop: A Boolean value indicating whether you want to evenly delete the tree. The default value is False. This parameter is only used in DART.
  • xgboost_dart_mode: A Boolean value indicating whether the Xgboost Dart mode is used. The default value is False. This parameter is only used in DART.
  • drop_seed: An integer that represents the seed of the random number for dropout. The default value is 4. This parameter is only used in DART.
  • top_rate: a floating point number in the range of [0.0,1.0] that represents the retention ratio of large gradient data in GOSS. The default value is 2. This parameter is only used in GOSS.
  • other_rate: a floating point number in the range of [0.0,1.0]. It represents the retention ratio of small gradient data in GOSS. The default value is 1. This parameter is only used in GOSS.
  • min_data_per_group: An integer that indicates the minimum amount of data for each category group. The default value is 100. For sorting tasks
  • max_cat_threshold: an integer that represents the maximum size of the set of values for the category feature. The default value is 32.
  • cat_smooth: a floating point number used for probabilistic smoothing of category features. The default value is 10. It reduces the effect of noise on category characteristics, especially for classes with very little data.
  • cat_l2: a floating point number used for L2 regularization coefficients in category segmentation. The default value is 10.
  • top_kortopk: an integer used in voting parallel. The default is 20. Setting it to a larger value will give more accurate results, but will slow down the training.

(3) the IO parameters

  • max_bin: an integer that indicates the maximum number of buckets. The default value is 255. LightGBM automatically compresses memory based on this. Such asmax_bin=255, LightGBM will use uint8 to represent each value of the feature.
  • min_data_in_bin: an integer that represents the minimum number of samples per bucket. The default value is 3. This method can avoid the situation that a bucket has only one sample.
  • data_random_seed: an integer that represents the seed of random numbers in parallel learning data separation. The default is 1 which does not include feature parallelism.
  • output_modelormodel_outputormodel_out: a string representing the name of the file in which the model output in the training is saved. The default TXT.
  • input_modelormodel_inputormodel_in: a string representing the file name of the input model file. Default empty string. For prediction tasks, this model will be used to predict data, and for train tasks, training will continue from this model
  • output_resultorpredict_resultorprediction_result: a string specifying the name of the file to store the prediction result. The default value is TXT.
  • pre_partitionoris_pre_partition: A Boolean value indicating whether the data has been partitioned. The default value is False. If true, different machines use different partitions for training. It is used for parallel learning (excluding feature parallelism)
  • is_sparseoris_enable_sparseorenable_sparse: A Boolean value indicating whether sparse optimization is enabled. The default is True. If True, sparse optimization is enabled.
  • two_roundortwo_round_loadingoruse_two_round_loading: A Boolean value indicating whether to start loading twice. The default value is False, which means that only one load is required. By default, LightGBM maps data files to memory and then loads features from memory, which provides faster data loading. But when the data files are large, memory can run out. If the data file is too large, set it to True
  • save_binaryoris_save_binaryoris_save_binary_file: A Boolean value indicating whether the data set, including the validation set, is saved to a binary file. The default value is False. If True, it speeds up data loading.
  • verbosityorverbose: An integer indicating whether to output intermediate information. The default value is 1. If less than 0, only critical information is output; If the value is 0, error and warning messages are displayed. If greater than 0, info is also printed.
  • headerorhas_header: A Boolean value indicating whether the input data has a header. The default is False.
  • labelorlabel_column: A string representing the label column. The default is an empty string. You can also specify an integer, such as label=0 to indicate that the 0th column is the label column. You can also prefix column names, for examplelabel=prefix:label_name.
  • weightorweight_column: a string representing the sample weight column. The default is an empty string. You can also specify an integer, such as weight=0 to indicate that the 0th column is the weight column. Note: This is the index after the tag column has been removed. If the label is 0 and the weight is 1, then the weight=0. You can also prefix column names, for exampleweight=prefix:weight_name.
  • queryorquery_columnorgourporgroup_column: a string, the query/groupID column. The default is an empty string. You can also specify an integer, such as query=0 to indicate that the 0th column is query. Note: This is the index after the tag column has been removed. If the label is 0 and query is 1, then query=0. You can also prefix column names, for examplequery=prefix:query_name.
  • ignore_columnorignore_featureorblacklist: a string representing columns to be ignored in training. Default is empty string. It can be indexed by numbers, as inIgnore_column = 0Indicates that columns 0, 2 are ignored. Note: This is the index after the tag column has been removed.
  • You can also prefix column names, for exampleignore_column=prefix:ign_name1,ign_name2.
  • categorical_featureorcategorical_columnorcat_featureorcat_column: a string specifying the column of category characteristics. The default is an empty string. It can be indexed by numbers, as inCategorical_feature = 0Indicates that columns 0, 2 will be category characteristics. Note: This is the index after the tag column has been removed. You can also prefix column names, for examplecategorical_feature=prefix:cat_name1,cat_name2In the CategoryCAL feature, a negative value is considered a missing value.
  • predict_raw_scoreorraw_scoreoris_predict_raw_score: A Boolean value indicating whether to predict the raw score. The default is False. If True, only the raw score is predicted. This parameter is only used for prediction tasks.
  • predict_leaf_indexorleaf_indexoris_predict_leaf_index: A Boolean value indicating whether to predict the number of leaf nodes in each tree for each sample. The default is False. During prediction, each sample is assigned to a leaf node in each tree. This parameter is to print the number of these leaf nodes. This parameter is only used for prediction tasks.
  • predict_contriborcontriboris_predict_contrib: A Boolean value indicating whether to output the predicted contribution of each feature for each sample. The default is False. The output result shape is [nsamples,nfeatures+1], and the +1 is due to the contribution of Bais. All the contributions add up to the predicted result for that sample. This parameter is only used for prediction tasks.
  • bin_construct_sample_cntorsubsample_for_bin: an integer representing the number of samples used to construct the histogram. The default value is 200,000. If the data is very sparse, it can be set to a larger value, which provides better training but increases the data load time.
  • num_iteration_predict: an integer indicating how many subtrees to use in the prediction. The default value is -1. Less than or equal to 0 means that all subtrees of the model are used. This parameter is only used for prediction tasks.
  • pred_early_stop: A Boolean value indicating whether early stops are used to accelerate the prediction. The default is False. If True, accuracy may be affected.
  • pred_early_stop_freq: an integer indicating the frequency of checking early stops. The default is 10
  • pred_early_stop_margin: a floating point number representing the marginal threshold for early stops. The default is 0
  • use_missing: A Boolean value indicating whether the missing value function is used. Default True if False disables the missing value feature.
  • zero_as_missing: A Boolean value indicating whether all zeros (including those not shown in the libSVM/SPARSE matrix) are considered missing. The default is False. If False, nan is considered a missing value. If True, thennp.nanAnd zero are considered missing values.
  • init_score_file: a string representing the path to the initialization score file during training. Default empty string train_data_file+ “.init “(if present)
  • valid_init_score_file: a string representing the path to the initialization score file at the time of validation. The default is an empty string representing valid_data_file+ “.init “(if present). If there are more than one (corresponding to more than one validation set), you can use a comma.To separate.

(4) Target parameters

  • sigmoid: a floating point number that takes the argument of the sigmoid function and defaults to 0. It is used for binary tasks and Lambdarank tasks.
  • alpha: a floating point number used in the Huber loss function and Quantileregression, with a default value of 0. It is used in the Huber regression task and Quantile regression task.
  • fair_c: a floating point number used in the Fair loss function. The default value is 0. It is used for the Fair regression task.
  • gaussian_eta: a floating point number used to control the width of the Gaussian function. The default value is 0. It is used for regression_L1 and Huber regressions.
  • posson_max_delta_step: a floating point number used as a parameter to the Poisson Regression. The default value is 7. It is used for poisson regression tasks.
  • scale_pos_weight: a floating point number used to adjust the weight of positive samples, default to 0. It is used for binary tasks.
  • boost_from_average: A Boolean value indicating whether the initial score is adjusted to the average (which can make convergence faster). The default is True. It is used for regression tasks.
  • is_unbalanceorunbalanced_set: a Boolean value indicating whether training data is balanced. The default is True. It is used for binary tasks.
  • max_position: an integer indicating that the NDCG location will be optimized. The default is 20. It is used for the Lambdarank mission.
  • label_gain: a sequence of floating point numbers that gives the gain of each tag. The default values are 0,1,3,7,15… . It is used for the Lambdarank mission.
  • num_classornum_classes: an integer indicating the number of categories in a multicategory task. The default is 1, which is used for multi-category tasks.
  • reg_sqrt: a Boolean value, False by default. If True, the result of the fitting is: \ SQRT {label}. The predicted result is automatically converted to {pred}^2. It is used for regression tasks.

(5) Measurement parameters

  • metric: a string specifying the metric of the measure. Default: for regression problems, use l2; For dichotomous problems, usebinary_logloss; For the lambdarank problem, NDCG is used. If there are more than one metric, use a comma.Space.
    • l1ormean_absolute_errorormaeorregression_l1: indicates absolute loss.
    • l2ormean_squared_errorormseorregression_l2orregression: represents the squared loss.
    • l2_rootorroot_mean_squared_errororrmse: indicates the loss of opening.
    • quantile: represents the loss in Quantile regression.
    • mapeormean_absolute_percentage_error: indicates MAPE loss.
    • huber: represents huber loss.
    • fair: indicates fair loss.
    • poisson: represents the negative logarithmic likelihood of Poisson regression.
    • gamma: represents the negative log likelihood of gamma regression.
    • gamma_deviance: represents the variance of the gamma regression residual.
    • tweedie: represents the negative logarithmic likelihood of Tweedie regression.
    • ndcg: indicates NDCG.
    • mapormean_average_precision: indicates the average accuracy.
    • auc: indicates AUC.
    • binary_loglossorbinary: represents the logarithmic loss function in binary classification.
    • binary_error: indicates the classification error rate in the binary classification.
    • multi_loglossormulticlassorsoftmaxOr ‘multiclassovaormulticlass_ovaOr,ovaorOvr ‘: represents the logarithmic loss function in multi-class classification.
    • multi_error: indicates the category error rate of multiple categories.
    • xentropyorcross_entropy: indicates cross entropy.
    • xentlambdaorcross_entropy_lambda: indicates the cross-entropy weighted by intensity.
    • kldivorkullback_leibler: represents KL divergence.
  • metric_freqoroutput_freq: a formal, indicating the number of times a measurement result is output. The default value is 1.
  • train_metricortraining_metricoris_training_metric: a Boolean value, False by default. If True, the measurement is printed at training time.
  • ndcg_atorndcg_eval_atoreval_at: a list of integers specifying the locations of NDCG evaluation points. The default values are 1, 2, 3, 4, and 5.

2.2 Parameter Impact and Parameter Adjustment Suggestions

The following is the summary of the influence of core parameters on the model and the corresponding parameter tuning suggestions.

(1) Control tree growth

  • num_leaves: Number of leaf nodes. It is the main parameter controlling the complexity of tree model.
    • If it islevel-wise, the parameter is
      2 d e p t h 2^{depth}
      Where depth is the tree depth. However, when the number of leaves is the same, the leaf-wise tree is much deeper than the Level-Wise tree, which is very easy to lead to overfitting. Therefore, num_leaves should be less than
      2 d e p t h 2^{depth}
      . In leaf-wise trees, there is no concept of depth. Because there is no proper mapping from leaves to depth.
  • min_data_in_leaf: Minimum number of samples per leaf node.
    • It is processingleaf-wiseImportant parameters of tree overfitting. Set it to a large value to avoid generating an overly deep tree. But it can also lead to under-fitting.
  • max_depth: Maximum depth of a tree. This parameter explicitly limits the depth of the tree.

(2) Faster training speed

  • By setting thebagging_fractionandbagging_freqParameter to use the Bagging method.
  • By setting thefeature_fractionParameters to use subsampling of features.
  • Use smallermax_bin.
  • usesave_binaryData loading is accelerated in the future learning process.

(3) Better model effect

  • Use largermax_bin(Learning may slow down).
  • Use smallerlearning_rateAnd the largernum_iterations.
  • Use largernum_leaves(May lead to overfitting).
  • Use larger training data.
  • trydart.

(4) Alleviate the problem of over-fitting

  • Use smallermax_bin.
  • Use smallernum_leaves.
  • usemin_data_in_leafandmin_sum_hessian_in_leaf.
  • By setting thebagging_fractionandbagging_freqTo use thebagging.
  • By setting thefeature_fractionTo use feature subsampling.
  • Use larger training data.
  • uselambda_l1,lambda_l2andmin_gain_to_splitTo use the re.
  • trymax_depthTo avoid generating too deep a tree.

3. Built-in modeling method of LightGBM

3.1 Built-in modeling method

LightGBM has a built-in modeling method with the following data formats and core training methods:

  • Based on thelightgbm.DatasetFormat data.
  • Based on thelightgbm.trainInterface training.

The following is an official simple example of reading data in libSVM format (in Dataset format) and specifying parameters for modeling.

# coding: utf-8
import json
import lightgbm as lgb
import pandas as pd
from sklearn.metrics import mean_squared_error


Load the data set
print('Load data... ')
df_train = pd.read_csv('./data/regression.train.txt', header=None, sep='\t')
df_test = pd.read_csv('./data/regression.test.txt', header=None, sep='\t')

# Set training set and test set
y_train = df_train[0].values
y_test = df_test[0].values
X_train = df_train.drop(0, axis=1).values
X_test = df_test.drop(0, axis=1).values

Construct Dataset format in LGB
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

Nail down a set of metrics
params = {
    'task': 'train'.'boosting_type': 'gbdt'.'objective': 'regression'.'metric': {'l2'.'auc'},
    'num_leaves': 31.'learning_rate': 0.05.'feature_fraction': 0.9.'bagging_fraction': 0.8.'bagging_freq': 5.'verbose': 0
}

print('Start training... ')
# training
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=20,
                valid_sets=lgb_eval,
                early_stopping_rounds=5)

# Save the model
print('Save the model... ')
Save the model to a file
gbm.save_model('model.txt')

print('Begin to predict... ')
# prediction
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# assessment
print('The estimated RMSE is :')
print(mean_squared_error(y_test, y_pred) ** 0.5)
Copy the code

Loading data... Start training... [1] Valid_0's L2:0.24288 VALID_0's AUC: 0.764496 Training Until Validation scores don't improve for 5 rounds. [2] VALID_0's L2:0.239307 ROUNDS: [3] VALID_0's L2:0.235559 VALID_0's AUC: 0.785547 [4] VALID_0's L2:0.23771 VALID_0's AUC: 0.785547 [6] VALID_0's L2:0.223692 VALID_0's AUC: 0.797786 [5] VALID_0's L2:0.226297 VALID_0's AUC: 0.805155 [6] VALID_0's L2:0.223692 [7] VALID_0's L2:0.220941 VALID_0's AUC: 0.806566 [8] VALID_0's L2:0.217982 VALID_0's auC: 0.220941 [8] VALID_0's L2:0.217982 VALID_0's auC: 0.806566 [10] VALID_0's L2:0.213064 VALID_0's AUC: 0.808566 [9] VALID_0's L2:0.215351 VALID_0's AUC: 0.809041 [10] VALID_0's L2:0.213064 [12] VALID_0's L2:0.209336 AUC: 0.805953 [11] VALID_0's L2:0.209336 AUC: 0.804631 [12] VALID_0's L2:0.209336 AUC: 0.805953 [14] VALID_0's L2:0.206016 VALID_0's AUC: 0.802922 [13] VALID_0's L2:0.207492 VALID_0's auC: 0.802011 [14] VALID_0's L2:0.206016 0.80193 Early stopping, Best Iteration is: [9] VALID_0's L_2:0.215351 VALID_0's AUC: 0.809041 Save model by stopping, Best Iteration is: [9] VALID_0's L_2:0.215351 VALID_0's AUC: 0.809041 save model by stopping... Begin to predict... The estimated RMSE is 0.4640593794679212Copy the code

3.2 Setting sample weight

LightGBM modeling is very flexible, it allows us to set different weight learning for each sample, the way of setting is very simple, we need to provide the model with a set of weight array data, the length is the same as the sample number.

The following is a typical example, where binar. train and binar. test are read and loaded as input in the lightgbm.Dataset format, while sample weights can be set in the lightgbm.Dataset build parameters (in this case, the numpy array form). Then based on the lightgBM. train interface using built-in modeling training.

# coding: utf-8
import json
import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings("ignore")

Load the data set
print('Load data... ')
df_train = pd.read_csv('./data/binary.train', header=None, sep='\t')
df_test = pd.read_csv('./data/binary.test', header=None, sep='\t')
W_train = pd.read_csv('./data/binary.train.weight', header=None) [0]
W_test = pd.read_csv('./data/binary.test.weight', header=None) [0]

y_train = df_train[0].values
y_test = df_test[0].values
X_train = df_train.drop(0, axis=1).values
X_test = df_test.drop(0, axis=1).values

num_train, num_feature = X_train.shape

Load weights at the same time as data
lgb_train = lgb.Dataset(X_train, y_train,
                        weight=W_train, free_raw_data=False)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,
                       weight=W_test, free_raw_data=False)

# set parameters
params = {
    'boosting_type': 'gbdt'.'objective': 'binary'.'metric': 'binary_logloss'.'num_leaves': 31.'learning_rate': 0.05.'feature_fraction': 0.9.'bagging_fraction': 0.8.'bagging_freq': 5.'verbose': 0
}

# Output feature name
feature_name = ['feature_' + str(col) for col in range(num_feature)]

print('Start training... ')
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=10,
                valid_sets=lgb_train,  # Evaluate the training set
                feature_name=feature_name,
                categorical_feature=[21])
Copy the code
Loading data... Start training... [1] Training's Binary_logloss: 0.68205 [2] Training's Binary_logloss: 0.673618 [3] 0.665891 [4] Training's binary_logloss: 0.656874 [5] Training's binary_logloss: 0.648523 [6] Training's binary_logloss: 0.665891 [4] Training's binary_logloss: 0.648523 [6] 0.641874 [7] Training's binary_logloss: 0.636029 [8] Training's binary_logloss: 0.629427 [9] Training's binary_logloss: 0.629427 [10] Training's binary_loGloss: 0.617593Copy the code

3.3 Model storage and loading

The model objects from the above modeling process can be saved by the save_model member function. The saved model can be loaded back into memory via GIGAB.booster and the test set can be predicted.

The specific code is as follows:

# check the feature name
print('Complete 10 rounds of training... ')
print('The seventh feature is :')
print(repr(lgb_train.feature_name[6]))

# Storage model
gbm.save_model('./model/lgb_model.txt')

# Feature name
print('Feature Name :')
print(gbm.feature_name())

# Feature importance
print('Feature importance :')
print(list(gbm.feature_importance()))

# Load model
print('Load model for prediction')
bst = lgb.Booster(model_file='./model/lgb_model.txt')

# prediction
y_pred = bst.predict(X_test)

# Evaluate the effect in the test set
print('RmSE on test set is :')
print(mean_squared_error(y_test, y_pred) ** 0.5)
Copy the code

Complete 10 training rounds... Number 7: 'Feature_6' ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20', 'feature_21', 'feature_22', 'feature_23', 'feature_24', 'Feature_25 ', 'feature_26', 'feature_27'] [8, 5, 1, 19, 7, 33, 2, 0, 2, 10, 5, 2, 0, 9, 3, 3, 0, 2, 2, 5, 1, 0, 36, 3, 33, 45, 29, 35] The loading model was used to predict rmSE on the test set as follows: 0.4629245607636925Copy the code

3.4 Continue training

LightGBM is boosting model, and new base learners will be added to each training round. LightGBM also supports continuous training based on existing models and parameters, without having to start training each time.

Here is a typical example. We load the LGB model that has been trained for 10 rounds (i.e. 10-tree integration), and continue the training on this basis (some changes are made at the parameter level, the learning rate is adjusted, and some processing methods such as bagging are added to alleviate over-fitting).

# Keep training
# load model initialization from./model/model.txt
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=10,
                init_model='./model/lgb_model.txt',
                valid_sets=lgb_eval)

print('Initialize the old model and complete the 10-20 training rounds... ')

# Adjust hyperparameters during training
# For example, the learning rate is adjusted here
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=10,
                init_model=gbm,
                learning_rates=lambda iter: 0.05 * (0.99 ** iter),
                valid_sets=lgb_eval)

print('Gradually adjust the learning rate to complete round 20-30... ')

# Adjust other hyperparameters
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=10,
                init_model=gbm,
                valid_sets=lgb_eval,
                callbacks=[lgb.reset_parameter(bagging_fraction=[0.7] * 5 + [0.6] * 5)])

print('Gradually adjust the Bagging ratio to complete round 30 to 40... ')
Copy the code

[11] Valid_0's binary_logloss: 0.616177 [12] Valid_0's binary_logloss: 0.611792 [13] 0.607043 [14] VALID_0's binary_loGloss: 0.602314 [15] valid_0's binary_loGloss: 0.598433 [16] Valid_0's binary_loGloss: [18] VALID_0's binary_loGloss: 0.595238 [17] Valid_0's binary_logloss: 0.592047 [18] VALID_0's binary_logloss: 0.588673 [19] 0.586084 [20] Valid_0's binary_loGLOSS: 0.584033 completes the 10-20 rounds of training with the old model initialized... [21] Valid_0's binary_logloss: 0.616177 [22] Valid_0's binary_logloss: 0.611834 [23] 0.607177 [24] VALID_0's binary_loGloss: 0.602577 [25] VALID_0's binary_loGloss: 0.59831 [26] Valid_0's binary_loGloss: 0.607177 [24] Valid_0's binary_loGloss: 0.602577 [25] Valid_0's binary_loGloss: 0.59831 [26] 0.595259 [28] VALID_0's binary_loGloss: 0.589017 [29] Valid_0's binary_loGloss: 0.592201 [28] Valid_0's binary_loGloss: 0.589017 [29] 0.586597 [30] Step by step adjust learning rate to complete round 20-30 training... [31] Valid_0's binary_logloss: 0.616053 [32] Valid_0's binary_logloss: 0.612291 [33] 0.60856 [34] valid_0's binary_logloss: 0.605387 [35] Valid_0's binary_logloss: 0.601744 [36] Valid_0's binary_logloss: 0.601744 [36] Valid_0's binary_logloss: 0.598556 [37] VALID_0's binary_logloss: 0.595585 [38] Valid_0's binary_logloss: 0.593228 [39] Valid_0's binary_logloss: 0.595585 [38] Valid_0's binary_logloss: 0.593228 [39] 0.59018 [40] Valid_0's Binary_loGLOSS: 0.588391 Gradually adjust the Bagging ratio to complete round 30-40 training...Copy the code

3.5 User-defined loss function

LightGBM supports the customization of loss function and evaluation criteria in the training process, in which the definition of loss function needs to return the calculation methods of the first and second derivatives of the loss function, and the evaluation criteria need to calculate the label and estimated value of data. The loss function is used for tree structure learning in the training process, while the evaluation criterion is often used for effect evaluation on the verification set.

# Custom loss function needs to provide the first and second derivative forms of loss function
def loglikelood(preds, train_data) :
    labels = train_data.get_label()
    preds = 1. / (1. + np.exp(-preds))
    grad = preds - labels
    hess = preds * (1. - preds)
    return grad, hess


# Custom evaluation function
def binary_error(preds, train_data) :
    labels = train_data.get_label()
    return 'error', np.mean(labels ! = (preds >0.5)), False


gbm = lgb.train(params,
                lgb_train,
                num_boost_round=10,
                init_model=gbm,
                fobj=loglikelood,
                feval=binary_error,
                valid_sets=lgb_eval)

print('Complete round 40-50 with custom loss function and assessment criteria... ')
Copy the code

[41] Valid_0's binary_logloss: 0.614429 VALID_0's error: 0.268 [42] Valid_0's binary_logloss: 0.610689 VALID_0's error: 0.26 [43] VALID_0's binary_logloss: 0.606267 VALID_0's Error: 0.264 [44] valid_0's binary_logloss: 0.606267 0.601949 VALID_0's error: 0.258 [45] valid_0's binary_loGloss: 0.597271 VALID_0's error: 0.266 [46] VALID_0's binary_logloss: 0.593971 VALID_0's error: 0.276 [47] valid_0's binary_logloss: 0.266 [46] Valid_0's binary_logloss: 0.593971 ERROR: 0.276 [47] valid_0's binary_logloss: 0.591427 VALID_0's error: 0.278 [48] VALID_0's binary_loGloss: 0.588301 VALID_0's error: 0.284 [49] VALID_0's binary_logloss: 0.586562 VALID_0's Error: 0.288 [50] valid_0's binary_logloss: 0.284 [49] valid_0's binary_logloss: 0.586562 Complete round 40 to 50 with custom loss function and evaluation criteria...Copy the code

4. Form interface of LightGBM predictor

4.1 SKLearn Morphology predictor interface

Like XGBoost, LightGBM supports modeling using SKLearn’s unified predictor morphology interface. The following is a typical reference case for training sets and test sets read in Dataframe format. LGBMRegressor can be initialized directly with LightGBM for fit fit training. Use method and interface, and other SKLearn estimator.

# coding: utf-8
import lightgbm as lgb
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

# load data
print('Load data... ')
df_train = pd.read_csv('./data/regression.train.txt', header=None, sep='\t')
df_test = pd.read_csv('./data/regression.test.txt', header=None, sep='\t')

Extract features and tags
y_train = df_train[0].values
y_test = df_test[0].values
X_train = df_train.drop(0, axis=1).values
X_test = df_test.drop(0, axis=1).values

print('Start training... ')
Initialize LGBMRegressor
gbm = lgb.LGBMRegressor(objective='regression',
                        num_leaves=31,
                        learning_rate=0.05,
                        n_estimators=20)

# Use fit function to fit
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric='l1',
        early_stopping_rounds=5)

# prediction
print('Begin to predict... ')
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)
# Evaluate the predicted results
print('The RMSE of the predicted result is :')
print(mean_squared_error(y_test, y_pred) ** 0.5)
Copy the code

Loading data... Start training... [1] Training Until Validation scores don't improve for 5 rounds. [2] Valid_0's ROUNDS: [4] VALID_0's L1:0.476848 [5] VALID_0's L1:0.47305 [6] VALID_0's L1:0.486563 [3] VALID_0's L1:0.481489 [4] VALID_0's L1:0.476848 [5] VALID_0's L1:0.47305 [6] 0.469049 [7] valid_0 's l1:0.465556 [8] valid_0' s l1:0.462208 [9] valid_0 's l1:0.458676 [10] valid_0' s l1: [12] VALID_0's L1:0.449158 [13] VALID_0's L1:0.44608 [14] VALID_0's L1:0.452047 [12] VALID_0's L1:0.449158 [13] VALID_0's L1:0.44608 [14] [16] VALID_0's L1:0.437687 [17] VALID_0's L1:0.435454 [18] VALID_0's L1:0.440643 [16] VALID_0's L1:0.437687 [17] VALID_0's L1:0.435454 [18] VALID_0's L1:0.440643 [16] VALID_0's L1:0.437687 [17] VALID_0's L1:0.435454 [18] 0.433288 [19] VALID_0's L1:0.431297 [20] VALID_0's L1:0.428946 Did not meet early stopping. Best Iteration is: [20] Valid_0's L1:0.428946 The RMSE of the predicted result is 0.4441153344254208Copy the code

4.2 Grid search callback

As mentioned above, LightGBM’s estimator interface is used in the same way as other estimators in SKLearn, so we can also use the hyperparameter tuning method in SKLearn for model tuning.

The following is a code example for tuning hyperparameters using a typical grid search and intersection method. We will give a dictionary of candidate parameter lists and use GridSearchCV to evaluate cross-validation experiments to select the optimal hyperparameters of LightGBM among the candidate parameters.

Select the optimal hyperparameter with scikit-learn grid search cross-validation
estimator = lgb.LGBMRegressor(num_leaves=31)

param_grid = {
    'learning_rate': [0.01.0.1.1].'n_estimators': [20.40]
}

gbm = GridSearchCV(estimator, param_grid)

gbm.fit(X_train, y_train)

print('The optimal hyperparameter found by grid search is :')
print(gbm.best_params_)
Copy the code

The optimal hyperparameters found with grid search are: {'learning_rate': 0.1, 'n_ESTIMators ': 40}Copy the code

4.3 Drawing Interpretation

LightGBM supports visual presentation and interpretation of model training, including visualization of loss function values and evaluation criteria results during training, sorting and visualization of feature importance after training, and visualization of base learners (such as decision trees).

The reference code is as follows:

# coding: utf-8
import lightgbm as lgb
import pandas as pd

try:
    import matplotlib.pyplot as plt
except ImportError:
    raise ImportError('You need to install matplotlib for plotting.')

Load the data set
print('Load data... ')
df_train = pd.read_csv('./data/regression.train.txt', header=None, sep='\t')
df_test = pd.read_csv('./data/regression.test.txt', header=None, sep='\t')

Extract features and tags
y_train = df_train[0].values
y_test = df_test[0].values
X_train = df_train.drop(0, axis=1).values
X_test = df_test.drop(0, axis=1).values

Construct Dataset data format in LGB
lgb_train = lgb.Dataset(X_train, y_train)
lgb_test = lgb.Dataset(X_test, y_test, reference=lgb_train)

# set parameters
params = {
    'num_leaves': 5.'metric': ('l1'.'l2'),
    'verbose': 0
}

evals_result = {}  # to record eval results for plotting

print('Start training... ')
# training
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=100,
                valid_sets=[lgb_train, lgb_test],
                feature_name=['f' + str(i + 1) for i in range(28)],
                categorical_feature=[21],
                evals_result=evals_result,
                verbose_eval=10)

print('Drawing during training... ')
ax = lgb.plot_metric(evals_result, metric='l1')
plt.show()

print('Draw the importance of features... ')
ax = lgb.plot_importance(gbm, max_num_features=10)
plt.show()

print('Draw the 84th tree... ')
ax = lgb.plot_tree(gbm, tree_index=83, figsize=(20.8), show_info=['split_gain'])
plt.show()

#print(' Draw the 84th tree with Graphviz... ')
#graph = lgb.create_tree_digraph(gbm, tree_index=83, name='Tree84')
#graph.render(view=True)
Copy the code

Loading data... Start training... [10] Training's L_1:0.457448 L_1's L_1:0.21641 L_1's L_1:0.456464 [20] Training's L_1:0.457448 L_1's L_1:0.21641 L_1:0.456464 [20] Training's L1:0.436869 VALID_1's L2:0.201616 VALID_1's L1:0.434057 [30] Training's L1:0.205099 training's L1:0.436869 VALID_1's L2:0.201616 VALID_1's L1:0.434057 0.197421 training's L1:0.421302 VALID_1's L2:0.192514 VALID_1's L1:0.417019 [40] Training's L2: 0.192856 Training's L1:0.411107 VALID_1's L2:0.187258 VALID_1's L1:0.406303 [50] Training's L1:0.187258 VALID_1's L2:0.187258 VALID_1's L1:0.406303 Training's L1:0.403695 VALID_1's L2:0.183688 VALID_1's L1:0.398997 [60] Training's L1:0.403695 valid_1's L2:0.183688 VALID_1's L1:0.398997 Training's L_1:0.398704 VALID_1's L_1:0.181009 VALID_1's L_1:0.393977 [70] Training's L_1:0.393977 Training's L_1:0.394876 VALID_1's L_1:0.178803 VALID_1's L_1:0.389805 [80] Training's L_1:0.389805 Training's L1:0.391147 VALID_1's L2:0.176799 VALID_1's L1:0.386476 [90] Training's L1:0.391147 VALID_1's L2:0.176799 VALID_1's L1:0.386476 Training's L1:0.388101 VALID_1's L2:0.175775 VALID_1's L1:0.384404 [100] Training's L1:0.388101 VALID_1's L2:0.175775 VALID_1's L1:0.384404 Training's L1:0.385174 VALID_1's L2:0.175321 VALID_1's L1:0.382929Copy the code

The resources

  • Diagram of machine learning algorithm | from entry to master series
  • The illustration of machine learning | LightGBM model explanation

ShowMeAIRecommended series of tutorials

  • Illustrated Python programming: From beginner to Master series of tutorials
  • Illustrated Data Analysis: From beginner to master series of tutorials
  • The mathematical Basics of AI: From beginner to Master series of tutorials
  • Illustrated Big Data Technology: From beginner to master
  • Illustrated Machine learning algorithms: Beginner to Master series of tutorials
  • Machine learning: Teach you how to play machine learning series

Related articles recommended

  • Application practice of Python machine learning algorithm
  • SKLearn introduction and simple application cases
  • SKLearn most complete application guide
  • XGBoost modeling applications in detail
  • LightGBM modeling applications in detail
  • Python Machine Learning Integrated Project – E-commerce sales estimates
  • Python Machine Learning Integrated Project — E-commerce Sales Estimation
  • Machine learning feature engineering most complete interpretation
  • Application of Featuretools
  • AutoML Automatic machine learning modeling