ICLR 2021

Openreview.net/forum?id=JW…

Github.com/ShuoYang-19…

Idea

The distribution of small sample data is calibrated

How do you calibrate? By statistical migration of the distribution of sufficient samples

Samples are then taken from the calibrated distribution so that there are more samples to be taken together to train the classifier

It is assumed that each dimension of features follows a Gaussian distribution, so mean and VAR can be easily migrated from similar categories (with a large number of samples and good estimation)

Instead of calibrating the distribution of the original input data space (less estimable and high dimensional), choose to calibrate the distribution of the feature space

The author takes a table of animal classes as evidence, and the similarity between mean and VAR of similar categories can be seen. It can be seen that the mean and VAR similarity of arctic fox and white fox both reach 97%, but the mean and VAR similarity of Arctic fox and beer bottle is only 34% and 11%, so the author leads a hypothesis here: It is assumed that each dimension of feature follows Gaussian, and classes with similar categories have similar mean and VAR in feature space

Have marked data D = {(xi, yi)} D = \ {\} (x_i, y_i) D = {} (xi, yi), including the xi ∈ Rdx_ {I} \ in R ^ {D} xi ∈ Rd is the sample feature vector, yi ∈ Cy_ {I} \ in Cyi ∈ C, C is the set categories TAB, This collection is divided into base classes CbC_{b}Cb and noval classes CnC_{n}Cn. The base and noval category labels are disjoint

The target is still small sample routine, training on base classes data, and then constructing small sample N-way K-shot task from Noval classes, to see generalization ability, Small sample of the support set S = {} (xi, yi) I = 1 N * KS = \ {\} (x_i, y_i) _ ^ {I = 1} {N} \ times K S = {} (xi, yi) I = 1 N x K, Query the set Q = {(xi, yi)} I = N * K + N * K + 1 N qQ = \ {(x_i y_i) \} _ {I = N + 1} \ times K ^ {N, N times, K +, times Q} q ={(xi,yi)} I =N×K+1N×K+N×q, which means that each class has K training samples and Q test samples. Finally, the model is measured by calculating the average ACC of multiple sampled tasks

First, estimate the distribution of base classes feature

The base class I mean is as follows:


mu i = j = 1 n i ( x j ) n i \mu_{i}=\frac{\sum_{j=1}^{n_{i}}(x_{j})}{n_{i}}

J represents the JTH sample in base class I, and nin_{I}ni represents the number of all samples in class I

The covariance matrix of base class I is as follows:


Σ i = 1 n i 1 j = 1 n i ( x j mu i ) ( x j mu i ) T \Sigma_{i}=\frac{1}{n_{i}-1}\sum_{j=1}^{n_{i}}(x_{j}-\mu_{i})(x_{j}-\mu_{i})^{T}

The feature features of noval support and Query in a small sample are mapped to make the distribution look more like normal distribution, using Tukey’s Ladder of Powers Transformation. With super parameter lambda \ lambda lambda, lambda indicates \ lambda \ 0 ne lambda  = 0, 0 transformation is: x ~ = x lambda \ tilde {x} = x ^ {\ lambda} x ~ = x lambda; Lambda = 0 \ lambda = 0 lambda equals zero, x ~ = log ⁡ \ tilde (x) = {x} \ log (x) x ~ = log (x)

www.youtube.com/watch?v=PKE…

The calibration process

Let’s start with one-shot

For a sample of a class in support set, since there is only one, the above conversion is denoted as: x~\tilde{x}x~

The first step is to find out which classes are similar to this class in base classes. This is the most important assumption in the article, and the author uses Euclidean distance to find this

Calculate the Euclidean distance of mean of all classes in x~\tilde{x}x~ and base classes, denoted as SdS_{d}Sd


S d = { mu i x ~ 2 i C b } S_{d}=\{-||\mu_{i}-\tilde{x}||^{2} | i \in C_{b}\}

Find the nearest TOPk from the set above, denoted as set SNS_{N}SN


S N = { i mu i x ~ 2 t o p k ( S d ) } S_{N}=\{i|-||\mu_{i}-\tilde{x}||^{2} \in topk(S_{d})\}

Then calibrate the distribution of the support set class as follows:


mu = i S N mu i + x ~ k + 1 \mu^{‘}=\frac{\sum_{i \in S_{N}}\mu_{i}+\tilde{x}}{k+1}


Σ = i S N Σ i k + Alpha. \Sigma^{‘}=\frac{\sum_{i \in S_{N}}\Sigma_{i}}{k}+\alpha

Alpha is a hyperparameter that determines the dispersion of the distribution

In the case of few-shot, for the class Y of the support set, there are K samples, the distribution of the K samples do the above process, and we get a set of mean and covariance Sy={(μ1 ‘, σ 1 ‘),… , (mu K ‘, ‘Σ K)} S_ {} y = \ {(\ mu_ {1} ^ {‘}, \ Sigma_ {1} ^ {‘}),… , (\ mu_ {K} ^ {‘}, \ Sigma_ ^ {K} {‘}) \} Sy = {(1 ‘, mu Σ 1 ‘),… , (mu K ‘, ‘Σ K)}

Characteristics of sampling


D y = { ( x . y ) x …… N ( mu . Σ ) . ( mu . Σ ) S y } D_{y}=\{(x,y)|x \sim N(\mu,\Sigma),\forall(\mu,\Sigma) \in S^{y}\}

The author considers the number of forged feature vector samples to be generated for each class of the support set as hyperparameters, and then evenly distributes them to the distributions in the set. Assume that 10 samples are generated, and there are 5 5-shot calibration distributions, and 2 samples are generated for each calibration distribution

Then train the classifier together by combining the generated feature vector sample with the original small sample (remember that the original sample goes through Tukey’s Ladder of Powers Transformation)

The experimental details

Three data sets were used

  • miniImageNet

100 categories, 600 samples per category, 64 base classes, 16 Validation classes, and 20 noval classes were selected

  • tieredImageNet

608 categories, with an average of 1281 samples per category, 351 bases, 97 validations, and 160 Novals

  • CUB

There are a total of 11788 images in 200 categories, divided into 100 bases, 50 validation and 50 novel images

Evaluation index: Average ACC over 10000 tasks

Use base classes to train WideResNet

Note that the feature representation is extracted from the penultimate layer (with a ReLU activation function) from the feature extractor

It can satisfy Tukey’s Ladder of Powers Transformation by taking the second-to-last layer to pass relu, since relu is a positive number

Then, LR or SVM is used for prediction

The number of generated feature samples is 750, k=2 (TOPk), lambda=0.5, and alpha is 0.21, 0.21 and 0.3 for the three data sets respectively

MiniImageNet was 5 percentage points higher with Tukey Transformation than without, but the improvement was not significant at 5way5Shot

The number of samples generated is not the more the better. Within a certain range, the effect will rise, and beyond a certain threshold, the effect will start to decline. If you do the experiment by yourself, you need to adjust the parameters