In intelligent customer service, categorizing intentions is central. There will be some intentions and few training samples. This is the so-called long tail problem, which cannot be solved by the traditional supervised classification model. The field of machine Learning has a special field for the classification of such scenarios — fee-shot Learning. Here are some summaries.

1. Match Network(NIPS2016)

This is an earlier article about one-shot learning, from Google’s DeepMind.

Model description

In this paper, a small sample learning model is proposed, which is different from ordinary machine learning model, that is, a model is trained on all the training data and then predicted on the test set. In small-sample learning, the concept of episode is introduced in the training stage. An episode refers to randomly selecting N categories and k samples (N-way-K-shot) for each category. Calculation of loss is performed on this episode every time. It’s through this training model that we update the model.

Model is introduced

This model is to solve the problem: ** How to obtain the probability that x^\hat{x}x^ belongs to y^\hat{y}y^ for a test sample x^\hat{x}x^ given S(Support set). Using mathematical expression is: * * P (y ^ ∣ x ^, S) P (\ hat {} y | \ hat {x}, S) P (y ^ ∣ x ^, S).

The author gives the calculation method as follows:

Among them
a ( x ^ . x i ) a(\hat{x},x_i)
That’s the calculation of attention, which you can view as all the samples in the support set
( x i . y i ) (x_i,y_i)
I just give them different weights.This is the loss of the model.
a a
The calculation of Query set is also relatively simple
x ^ \hat{x}
And the Support set
x i x_i
The embedding calculates the weight of the cosine distance by softmax:Among them,
f ( . ) f(.)
and
g ( . ) g(.)
Are their respective feature extractors, which are encoder. However, the author has their own special design for encoder at both ends of support-set and Query-set, and has done some work (see paper for details), which is basically a sequence encoder based on BI-LSTM implementation.

The training process

The process of iteration is as follows:

  1. Select a few categories (e.g., 5 categories) and select a small number of samples in each category (e.g., 5 samples per category);
  2. Divide the selected set into support-set and query-set;
  3. Using the reference set of this iteration, the error of the test set is calculated.
  4. Calculate gradient and update parameter;

Such a process is referred to in the article as episode.

paper code

2. Relation Network

Model description

This paper is

Model is introduced

This model mainly consists of two parts: 1. The embedding module is also an encoder, which encodes the input sample in support-set and query-set. CNN network with four convolution blocks is adopted in this paper.

2. A function module

The main difference with Match Network is that the embedding obtained by the encoder is not computed directly
x ^ \hat{x}
Is the probability distribution of the label. The Relation Network concat the embedding obtained by the sample in support-set and the embedding obtained by the sample in query-set, and then the embedding is input to the relational Network (actually the neural Network). To predict the probability distribution that belongs to the tag. It can be expressed by the following formula:Loss function is a typical regression problem (positive cases regression to 1, negative cases regression to 0) :

To put it simply, the innovation of relational network is to use neural network instead of Euclidean distance to calculate the matching degree between two characteristic variables.

支那

3. Induction Network

The model proposed by Ali Xiaomi team is mainly divided into three modules. 1.Encoder 2.Induction 3. And how the Support Set and Query Dataset are built (note that both support-set and query-set are built from class C) :

The following is the model structure of this paper:

1.Encoder Module

They encode sentences using a two-way LSTM+attention. The resulting sentence representation is eee. Of course, you can also use Transformer or BERT to encode sentences.

2.Induction Module

The purpose of this module is to design a nonlinear mapping of categories into the Support Set
i i
All K sentence vectors are mapped to a classification vector
c i c_i
On.They adopted the idea of Hinton’s capsule network, and the main algorithm is:

It basically applies the dynamic routing strategy in Capsule Network. The resulting CIC_ICI can thus be regarded as a class vector representation of class III.

3.Relation Module

For each Query in the Query Set, the sameEncoder ModuleGet the corresponding sentence vector
e q e^q
And then withInduction ModuleIn the class vector representation. This is a typical problem to compare two vector relations, using a neural network layer + sigmoID to deal with:Then use a SIGmod to calculate the number
q q
Query and category
i i
Relation Score between.

4.Objective Function

In order to measure relation score
r i q r_{iq}
With the ground way
y q y_q
(1 for the same category and 0 for different categories), this is a regression problem, and MSE is used to calculate loss. This can actually be regarded as a pair-wise training method, which is to compare whether the category vector is similar to the sample corresponding to the Query set. Give a Support Set in an episode
S S
, there are
C C
Category, and Query Set
Q Q
Each category has n samples, and the loss function is as follows:

Implementation details

For a 5-way 5-shot setting, it means that each episode contains 5 (C) categories in the training Set and 5 (K) samples in each category in the Support Set. For Query sets, each category contains 20 samples. Each episode requires 205 + 55 = 125 sentences.

Ctrip implementation details:

Training: C (c=15) classes were randomly selected in the Support Set, and K (k=20) samples were selected for each class. Query set is a batch (batchsize=128), which ensures that all c classes selected in the batch have at least one positive example. Then, the remaining BATchsize -c samples are randomly sampled evenly from all unused data. Query Set training < Query,label> format, where,label is one-hot label, Query target label in C classes, corresponding one-hot position is 1, not all 0.

Encoder: two-layer Transformer with 8 heads. Word_embedding and char_embedding concat together.

Number of categories: 230

Predit:

  1. Presave the model’s label embedding
  2. Query disqualifies the lable embedding comparison of all classes, selects Argmax, and uses the threshold to control.

4. Few-shot Text Classification with Distributional Signatures

5. Dynamic Memory Induction Networks for Few-Shot Text Classification