Text classification is an essential entry task of NLP, which can be seen everywhere in search, recommendation, conversation and other scenarios, and has mature research branches and data sets such as sentiment analysis, news classification and tag classification.
This article mainly introduces the common model principles, advantages and disadvantages, and skills of deep learning text classification. It is one of the chapters of “GETTING Started with NLP” and will be improved continuously. Your comments are welcome:
https://github.com/leerumor/nlp_tutorial
Copy the code
Fasttext
Thesis: HTTPS:/ / arxiv.org/abs/1607.01759Code: HTTPS://github.com/facebookresearch/fastText
Copy the code
Fasttext is a handy tool from Facebook that includes text categorization and word vector training.
Fasttext’s classification implementation is simple: convert the input into a word vector, average it, and run it through a linear classifier to get the categories. Input word vectors can be pre-trained or randomly initialized to be trained along with the classification task.
Fasttext is still widely used today and has the following advantages:
- The model itself is low in complexity, but does a good job of quickly producing the baseline of the task
- Facebook uses C++ to further improve computing efficiency
- Char-level N-gram is used as an additional feature, for example, paper trigram is [PAP, APE, per]. The input paper is converted into a vector and trigram is converted into a vector. On the one hand, it solves the OOV (out-of-vocabulary) problem of long-tailed words, and on the other hand, n-gram features are used to improve performance
- When there are too many categories, hierarchical Softmax is supported for classification to improve efficiency
For scenarios with long text and high speed requirements, Fasttext is preferred for baseline. It is also good for training word vectors on unsupervised corpora for textual representation. But further improvements will require more sophisticated models.
TextCNN
Thesis: HTTPS:/ / arxiv.org/abs/1408.5882Code: HTTPS://github.com/yoonkim/CNN_sentence
Copy the code
TextCNN is a model proposed by Yoon Kim in 2014, which pioneered the use of CNN to encode N-gram features.
The model structure is shown in the figure below. Convolution in the image is two-dimensional, while TextCNN uses “one-dimensional convolution”, i.e. Filter_size * embedding_DIM, with one dimension equal to embedding. This will extract filter_size gram of information. Taking 1 sample as an example, the overall forward logic is:
- Embedding on words is obtained
[seq_length, embedding_dim]
- You take N convolution kernels, you get N
seq_length-filter_size+1
One-dimensional feature map of length - Conduct max-pooling (for time-dimension, also called max-over-time pooling) for feature maps, and obtain N
1x1
Is spliced into an N-dimensional vector, which is represented as a sentence of the text - Compress the n-dimensional vector to the dimension of the number of categories, through Softmax
In the practice of TextCNN, there are many areas that can be optimized (refer to this paper [1]) :
- Filter size: This parameter determines the length of n-gram features extracted. This parameter is mainly related to data. If the average length is less than 50, it is ok to use less than 10, otherwise it can be longer. You can start with a size Grid search when tuning, find an optimal size, and then try a combination of the optimal size and nearby sizes
- Filter Number: This parameter will affect the dimension of the final feature. If the dimension is too large, the training speed will be slow. Here between 100 and 600 can be adjusted
- CNN activation functions: Try Identity, ReLU, tanh
- Regularization: refers to regularization of CNN parameters. You can use dropout or L2, but the effect is very small. Try a small dropout rate (<0.5), and L2 is more limited
- Pooling method: mean, Max and K-max Pooling are selected according to the situation. In most cases, Max performs very well, because the classification task does not have high requirements on fine-grained semantics, and only the largest features are captured
- Embedding table: Chinese can be input at char or word level, or both, which improves the effect. If the training data is sufficient (10W +), you can also start from scratch
- Disaggregate BERT’s Logits, using unsupervised data in the field
- Deepen full connection: In the original paper, only one full connection layer was used, and the effect would be better if it was added to about three or four layers [2]
TextCNN is very suitable for the strong baseline of medium and short text scenes, but it is not suitable for long text, because the size of the convolution kernel is usually not set too large to capture long-distance features. At the same time, max-pooling also has limitations, and some useful features will be lost. In addition, if you think more carefully, TextCNN and the traditional N-Gram word bag model are essentially the same, and its good effect largely comes from the introduction of word vector [3], because it solves the sparsity problem of word bag model.
DPCNN
Thesis: HTTPS://ai.tencent.com/ailab/media/publications/ACL3-Brady.pdfCode: HTTPS://github.com/649453932/Chinese-Text-Classification-Pytorch
Copy the code
As mentioned above, TextCNN is too shallow and depends on long distance. Is it ok to play several layers of CNN directly? If you’re interested, you can try it, and you’ll find it’s not as easy as you think. It was not until 2017 that Tencent proposed a DPCNN model that took TextCNN deeper:
In the figure above, ShallowCNN refers to TextCNN. The core improvements of DPCNN are as follows:
- Weighted convolution like CNN is not applied in Region embedding, but 1×1 convolution is added after pooling N words, because the effect is similar after experiment, and the author believes that the former has stronger representation ability and is easy to overfit
- Use 1/2 pooling layer and convolution kernel size=3 stride=2 to directly double the sequence length that can be encoded by the model (draw it on the paper yourself and get it).
- Residual links, refer to ResNet, mitigate gradient dispersion problems
With some subtle improvements above, DPCNN is 1-2 percentage points better than TextCNN.
TextRCNN
Thesis: HTTPS:/ / dl.acm.org/doi/10.5555/2886521.2886636Code: HTTPS://github.com/649453932/Chinese-Text-Classification-Pytorch
Copy the code
In addition to DPCNN’s way of increasing receptive fields, RNN can also alleviate the problem of long-distance dependence. Here is a classic TextRCNN.
The forward process of the model is:
- I get a representation of the word I
- The sum of left and right bidirectional expressions is obtained by RNN
- The representation is concatenated and then transformed
- By max-pooling, the sentence representation is obtained, and the final classification is conducted
Convolutional means max-pooling. By adding RNN, it is 1-2 percentage points higher than pure CNN.
TextBiLSTM+Attention
Thesis: HTTPS://www.aclweb.org/anthology/P16-2034.pdfCode: HTTPS://github.com/649453932/Chinese-Text-Classification-Pytorch
Copy the code
The framework of text classification can be obtained naturally from the methods introduced above, that is, encoding tokens based on context, then pooling sentence representation, and then categorizing. Max-pooling usually performs better in the final pooling, because text classification is usually subject classification, and conclusions can be drawn from one or two main words in a sentence, while others are mostly noise and meaningless to classification. For finer – grained analysis, useful features may be removed by max-pooling, and then sentence fusion can be performed by attention:
BiLSTM will not explain this, but it should be noted that the calculation of attention score will be transformed first:
Where is the context vector, randomly initialized and updated with training. Finally get the sentence representation, and then classify.
The routine of adding attention can also be replaced by pooling after using CNN encoder. According to the experimental results, adding attention can improve 2 points. RNN is the first choice for the task of sentiment analysis, where the classification result is determined by sentence as a whole.
HAN
Thesis: HTTPS://www.aclweb.org/anthology/N16-1174.pdfCode: HTTPS://github.com/richliao/textClassifier
Copy the code
The above is sentence-level classification. Although it is possible to use long text and discourse level, the speed and accuracy will decrease. Therefore, some researchers proposed Hierarchical Attention classification framework. First, each sentence is encoded by BiGRU+Att to get the sentence vector, and then the sentence vector is classified by BiGRU+Att to get the doc level representation:
The method is intuitive, but the results are less than 1 point better than avG and Max pooling.
BERT
BERT’s principle code doesn’t need to be written
Optimization of BERT classification can be tried:
- Try different pre-training models like RoBERT, WWM, ALBERT
- In addition to [CLS], you can also use avG, Max pooling sentences, and even combine different layers
- Incremental pre-training on domain data
- Integrated distillation, where multiple large models are integrated and distilled onto one
- Use multitasking training first, then migrate to your own tasks
Other model
In addition to the above commonly used models, Capsule Network[4], TextGCN[5] and other popular models, due to more background knowledge involved, this paper will not introduce for the moment (xi xi).
While landing applications are rare, they can be used in machine learning competitions. Capsule Network has been proved to be far superior to CNN and LSTM in multi-label migration tasks [6], but there are few studies in this area after 18 years. TextGCN can learn more global information and is used in semi-supervised scenarios, but performs poorly in longer text that requires sequential information [7].
skills
The model said almost, the following introduces some of their own data processing blood and tears experience, if there are different opinions welcome to discuss ~
Data set construction
First of all, the construction of the label system. When I get the task, I will try one or two hundred of them, and see how many of them are difficult to determine (thinking for more than 1s). If the ratio is too much, the definition of the task is problematic. It may be that the label system is not clear, or the categories to be divided are too difficult. At this time, it is necessary to ask the project owner for feedback instead of continuing.
The second is the construction of training evaluation set. Two evaluation sets can be constructed, one is the online evaluation set that fits the distribution of real data to reflect the online effect, and the other is the random evaluation set with uniform sampling after using rules to reflect the real ability of the model. The distribution of the training set should be consistent with that of the evaluation set as far as possible. Sometimes we will go to a similar field to get ready-made marked training data. At this time, we should pay attention to adjust the distribution, such as sentence length, punctuation, cleanliness, etc., so as to make it impossible for us to tell whether the sentence is from the task or borrowed from others.
Finally, data cleaning:
- Remove text strong pattern: for example, in news subject classification, some data with XX reports and XX editing high-frequency fields are useless. We can make statistics of fragments or words in the corpus and remove useless elements with very high frequency. For example, when I was judging whether the sentence was meaningless chatter, I found that adding a period would change the sample from positive to negative. Since the expected chatter was rarely accompanied by a period (related to people’s typing habits), it was much better to remove this pattern
- Correct tagging errors: I’ve done this one time and again, turning myself from an algorithm into a tagger. To put it simply, it is to combine the training set and evaluation set, use the data set to train the model with two or three epochs (to prevent over-fitting), and then predict the data set. If the model is judged wrong, it can be sorted according to ABS (label-proB). If it is less, it can be read by itself, and if it is more, it can be fed back to the annotator. It’s possible to improve data quality by several points
Long text
For simple tasks (such as sorting news items), fastText works just fine.
To use BERT’s words, the simplest method is rough truncation, for example, just take the first sentence + the last sentence, and the first sentence +tfidf to screen out a few words; Or predict each sentence and synthesize the results at the end.
There are also a few mods to try, such as XLNet, Reformer, Longformer.
If it’s an offline task and there’s time to run it all, let’s trust the coding capabilities of the model.
Less sample
Since I used BERT, I was rarely troubled by unbalanced or too little data, so I had no brain training for a version first.
If the sample is in hundreds, you can first transform the classification problem into a matching problem, or use this idea to mark some high confidence data, or use self-supervision, semi-supervision method.
robustness
In practical applications, robustness is a very important issue, otherwise in the face of badCase will be very awkward, how to clearly that is correct, add a word is wrong?
Here you can directly use some rude data enhancement, add stop words, add punctuation, delete words, replace synonyms, etc., if the effect drops, wash the enhanced training data.
Of course, you can also use confrontation learning, contrast learning such advanced skills to improve, generally can raise 1 point or so, but not necessarily can avoid the above embarrassing situation.
conclusion
Text categorization is one of the most common tasks in the industry and the first task most NLPer beginners do. I was a beginner at nothing and practiced text categorization from training to deployment. Many models are given above, but only a few are commonly used in practical tasks. Here are some suggestions for quick selection:
In fact, landing is mainly a game with data. Data determines the upper limit of the model, and it is good to have the accuracy of more than 95% for most manual annotations, while text classification usually requires higher accuracy. Instead of struggling to adjust the fancy structure, it is better to take a good look at BadCase and do some data enhancement to improve the robustness of the model.
The resources
[1]
A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification: Arxiv.org/pdf/1510.03…
[2]
Which is more important, the convolution layer or the classification layer? : www.zhihu.com/question/27…
[3]
From the classical text classification model TextCNN to depth model DPCNN: zhuanlan.zhihu.com/p/35457093
[4]
Lift the veil for a delicious Capsule feast: Kexue. FM/Archives /48…
[5]
Graph Convolutional Networks for Text Classification: arxiv.org/abs/1809.05…
[6]
Capsule (Capsule Network Network) in the exploration in the text classification: zhuanlan.zhihu.com/p/35409788
[7]
What do you think of GNN? : www.zhihu.com/question/30…
Read more
Top 10 Best Popular Python Libraries of 2020 \
2020 Python Chinese Community Top 10 Articles \
5 minutes to master Python object references \
Special recommendation \
\
Click below to read the article and join the community