
In this paper, the basic concepts and evaluation indexes of multi-label classification are introduced, and several methods, such as modeling techniques, supervised feature selection, unsupervised feature selection and upsampling, are summarized to improve the performance of multi-label classification models.

What is multi-label classification?

As we all know, binary classification divides a given input into two categories, 1 or 0. Multi-label or multi-target classification predicts multiple binary targets simultaneously from a given input. For example, our model can predict whether a given picture is a dog or a cat, and whether its coat is long or short.

Targets are mutually exclusive in multi-label classifications, which means that an input can belong to more than one class.

This article will summarize some common methods to improve the performance of multi-label classification models.

Grading index

Most of the indicators used for binary classification can be applied to multiple labels by calculating the indicators for each column and then taking the average of the scores. One indicator we can use is logarithmic loss or binary cross entropy. For a better measure of class imbalance, we can use ROC-AUC.

ROC AUC curve

Modeling techniques

Before we get into the fancy tricks with features, let’s share some tips for designing models that are suitable for multi-label classification cases.

For most non-neural network models, the only option is to train a classifier for each target and then combine the prediction. The library SciKit-Learn provides a simple wrapper class to do this, OneVsRestClassifier.

Although this would enable the classifier to perform multi-label tasks, it is not the approach to take. This has several drawbacks. First of all, the training will be long, and for each goal, we are training a new model. Secondly, the model cannot learn the relationship between different tags or the correlation of tags.

The second problem can be solved by performing a two-stage training in which the target prediction is combined with the original features as the input data for the second stage training. The downside of this is that the training time will increase dramatically, because now you have to train twice as many models as before.

Neural networks are better suited to this situation. The number of tags is the number of output neurons in the network. Now we can apply any binary classification loss to the model, which will output all targets simultaneously. This solves two problems with non-neural network models because we only need to train one model and the network can learn different tag correlations by outputting neurons.

Supervised feature selection methods

Before starting any feature engineering or selection, features should be normalized or standardized. Using Quantile Transformer will reduce the skewness of the data and make the features subject to normal distribution. Another option is to standardize characteristics, which can be done by subtracting the mean from the data and then dividing by the standard deviation. This does a similar job compared to Quantile Transformer, both of which aim to convert data to more robust, but Quantile Transformer has a higher computational cost.

Using supervised feature selection in this context is a bit tricky because most algorithms are designed for a single target. To solve this problem, we can convert the multi-label case into a multi-class problem. One popular approach is LabelPowerset, where each unique label combination of training data is converted into a class. The SciKit-Multilearn library provides tools for this.

Tool links:

Scikit. Ml/API/skmulti…

After transformation, we can use information gain and CHI2 to select features. While this approach works, things get tricky when we have hundreds or even thousands of different combinations of unique tags, and that’s where an unsupervised feature selection approach might be better.

Unsupervised feature selection method

In the unsupervised method, we do not need to consider the nature of the multi-label case because the unsupervised method does not rely on labels.

Here are some algorithms:

  • Principal component analysis or other similar factor analysis methods. This removes redundant information from the features and extracts useful insights for the model. An important explanation for this is to ensure that the data is standardized before applying PCA, so that each feature contributes equally to the analysis. Another trick with PCA is that we can concatenate these reduced features back into the original data as additional information that the model can choose to use instead of the reduced features provided by the algorithm.

  • Variance threshold. This is a simple and effective way to reduce the dimension of features. We discard features that have low variance or distribution. This can be optimized by finding a better selection threshold, generally using 0.5 as the initial threshold.

  • Clustering. We can create a new feature by creating a cluster from the input data, and then assign the corresponding cluster to each row of the input data as the new feature column.

KMeans Clustering

Upsampling method

Using upsampling methods when our classification data is highly unbalanced, we then generate artificial samples for rare classes so that the model focuses on rare classes. In order to create a new sample in a multi-label setting, use MLSMOTE or multi-label synthesizing minority oversampling.

MLSMOTE Project address:…

This is a change from the original SMOTE method. In this case, after we generate data for a small number of classes and assign a corresponding small number of labels, we also generate other labels associated with the data point by counting the number of occurrences of each label in adjacent data points, and take the frequency with more than half of the data point counts.

