preface

In this paper, the basic concepts and evaluation indexes of multi-label classification are introduced, and several methods, such as modeling techniques, supervised feature selection, unsupervised feature selection and upsampling, are summarized to improve the performance of multi-label classification models.

This article is from the public account CV technical Guide ******** technical summary series ********

Pay attention to the public CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

What is multi-label classification?

As we all know, binary classification divides a given input into two categories, 1 or 0. Multi-label or multi-target classification predicts multiple binary targets simultaneously from a given input. For example, our model can predict whether a given picture is a dog or a cat, and whether its coat is long or short.

Targets are mutually exclusive in multi-label classifications, which means that an input can belong to more than one class.

This article will summarize some common methods to improve the performance of multi-label classification models.

Grading index

Most of the indicators used for binary classification can be applied to multiple labels by calculating the indicators for each column and then taking the average of the scores. One indicator we can use is logarithmic loss or binary cross entropy. For a better measure of class imbalance, we can use ROC-AUC.

ROC AUC curve

Modeling techniques

Before we get into the fancy tricks with features, let’s share some tips for designing models that are suitable for multi-label classification cases.

For most non-neural network models, the only option is to train a classifier for each target and then combine the prediction. The library SciKit-Learn provides a simple wrapper class to do this, OneVsRestClassifier.

Although this would enable the classifier to perform multi-label tasks, it is not the approach to take. This has several drawbacks. First of all, the training will be long, and for each goal, we are training a new model. Secondly, the model cannot learn the relationship between different tags or the correlation of tags.

The second problem can be solved by performing a two-stage training in which the target prediction is combined with the original features as the input data for the second stage training. The downside of this is that the training time will increase dramatically, because now you have to train twice as many models as before.

Neural networks are better suited to this situation. The number of tags is the number of output neurons in the network. Now we can apply any binary classification loss to the model, which will output all targets simultaneously. This solves two problems with non-neural network models because we only need to train one model and the network can learn different tag correlations by outputting neurons.

Supervised feature selection methods

Before starting any feature engineering or selection, features should be normalized or standardized. Using Quantile Transformer will reduce the skewness of the data and make the features subject to normal distribution. Another option is to standardize characteristics, which can be done by subtracting the mean from the data and then dividing by the standard deviation. This does a similar job compared to Quantile Transformer, both of which aim to convert data to more robust, but Quantile Transformer has a higher computational cost.

Using supervised feature selection in this context is a bit tricky because most algorithms are designed for a single target. To solve this problem, we can convert the multi-label case into a multi-class problem. One popular approach is LabelPowerset, where each unique label combination of training data is converted into a class. The SciKit-Multilearn library provides tools for this.

Tool links:

Scikit. Ml/API/skmulti…

After transformation, we can use information gain and CHI2 to select features. While this approach works, things get tricky when we have hundreds or even thousands of different combinations of unique tags, and that’s where an unsupervised feature selection approach might be better.

Unsupervised feature selection method

In the unsupervised method, we do not need to consider the nature of the multi-label case because the unsupervised method does not rely on labels.

Here are some algorithms:

  • Principal component analysis or other similar factor analysis methods. This removes redundant information from the features and extracts useful insights for the model. An important explanation for this is to ensure that the data is standardized before applying PCA, so that each feature contributes equally to the analysis. Another trick with PCA is that we can concatenate these reduced features back into the original data as additional information that the model can choose to use instead of the reduced features provided by the algorithm.

  • Variance threshold. This is a simple and effective way to reduce the dimension of features. We discard features that have low variance or distribution. This can be optimized by finding a better selection threshold, generally using 0.5 as the initial threshold.

  • Clustering. We can create a new feature by creating a cluster from the input data, and then assign the corresponding cluster to each row of the input data as the new feature column.

KMeans Clustering

Upsampling method

Using upsampling methods when our classification data is highly unbalanced, we then generate artificial samples for rare classes so that the model focuses on rare classes. In order to create a new sample in a multi-label setting, use MLSMOTE or multi-label synthesizing minority oversampling.

MLSMOTE Project address:

Github.com/niteshsukhw…

This is a change from the original SMOTE method. In this case, after we generate data for a small number of classes and assign a corresponding small number of labels, we also generate other labels associated with the data point by counting the number of occurrences of each label in adjacent data points, and take the frequency with more than half of the data point counts.

By Andy Wang

Compilation: CV technical Guide

Original link:

Andy-wang.medium.com/bags-of-tri…

Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

Reply keyword “technical summary” in the public account to obtain the summary PDF of the original technical summary article of the public account.

​​

Other articles

Incremental learning deep neural network

Overview of human pose estimation in deep learning

Summary of common methods of small target detection

CV technical Guide – Summary and classification of essential articles

Normalization method summary | under fitting and over fitting

NMS summary | loss function technical summary

Attention mechanism technical summary | technical summary characteristics of pyramid

Pooling technical summary | summary data method

Paper innovation common thinking summary | GPU parallel card training summary

Summary of CNN structure Evolution (I) Classical model

Summary of CNN structural evolution (II) Lightweight model

Summary of CNN structure evolution (iii) Design principles

Summary of CNN visualization technology (I) Feature map visualization

Summary of CNN visualization technology (II) Convolution kernel visualization

CNN visualization technology summary (iii) class visualization

Summary of CNN visualization technology (IV) Visualization tools and projects

Summary of image annotation tools in computer vision

Review and summary of various Optimizer gradient descent optimization algorithms

Summary | classic open source data sets at home and abroad

The Softmax function and its misconceptions

Common strategies for improving machine learning model performance

Resources sharing | SAHI: big slices of small target detection in auxiliary reasoning library

Summary of image annotation tools in computer vision

Batch Size effect on neural network training

Summary of tuning methods for hyperparameters of neural networks

Use Ray to load the PyTorch model 340 times faster

Summary of image annotation tools in computer vision

A review of the latest research on small target detection in 2021

Capsule Networks: The New Deep Learning Network

Summary of computer vision terms (a) to build the knowledge system of computer vision

A review of small sample learning in computer vision