Selected from arxiv

Andrew Ilyas et al

Heart of the machine compiles

Participation: Lu Xue, Siyuan

Is fighting samples a bug in the model? Can we completely solve this problem through confrontational training or other means? The MIT researchers show that adversarial samples are just some of the features, and that from a supervised learning perspective, these robust and non-robust features are equally important.

Adversarial samples have attracted much attention in machine learning, but the reasons for their existence and popularity are not clear. A study from THE Massachusetts Institute of Technology (MIT) has shown that the generation of adventive samples can be directly attributed to the presence of non-robust features: certain features derived from data distribution patterns are highly predictive, but fragile and difficult for humans to understand.

The researchers constructed a theoretical framework in which to capture these features, thus establishing their widespread presence in standard data sets. Finally, the researchers demonstrated a simple task setup in which they rigorously correlated the observed adversity-sample phenomenon with the (human-assumed) concept of robustness and the mismatches between the geometry within the data.

Adversarial Examples Are Not Bugs, They Are Features

The paper address: https://arxiv.org/pdf/1905.02175.pdf

What is adversarial sample?

In recent years, the vulnerability of deep neural networks has attracted a great deal of attention, especially the concern about adversity-sample phenomenon: small perturbations to natural inputs, which humans see as unaffected by the whole, can cause the currently optimal classifier to mispredict results.

Given an image of a panda, as shown below, the attacker added a tiny amount of noise to the image, and the model misclassified it as a gibbon with a very high probability, even though it was hard for human eyes to distinguish. With the large-scale application of machine learning, this kind of error becomes very important for system security.

The above is an adversarial sample shown by Ian Goodfellow in 2014, using an algorithm called FGSM.

Since adversarial samples are so harmful, it’s important to understand why. Generally speaking, most previous studies in this field regard adversarial samples as distortions caused by high-dimensional input space or biases caused by statistical fluctuations in training data.

From this point of view, it makes perfect sense to aim for adversarial robustness, a goal that can be addressed or achieved simply by maximizing model accuracy, which can be achieved by improving standard regularization methods or network input/output preprocessing and post-processing.

Understand new ideas against samples

So why is there an adversarial sample? Is it a Bug in deep neural networks? There have been many studies in the past to explain various phenomena against samples from theoretical models, but they can’t explain everything observed.

The new MIT study offers a new perspective. In contrast to the previous model, adversarial vulnerability is regarded as a fundamental consequence of the mainstream supervised learning mechanism. Specifically, they said:

Adversarial vulnerability is a direct result of model sensitivity to well-generalized features in data.

Their hypothesis also explains the adversarial migrability, which is that the adversarial perturbations calculated by one model can usually be migrated to another independently trained model. Since it is possible for any two models to learn similar non-robust features, perturbations that manipulate such features can be applied to both. Finally, the new insights presented in this study focus on combating vulnerability as a fully human-centric phenomenon, since non-robust features are equally important as robust features from a standard supervised learning perspective.

This paper shows that the method of enhancing the interpretability of models by introducing “priors” actually hides the characteristics of true “significance” and predictability. Thus, generating explanations that are meaningful to humans and faithful to the underlying model cannot be obtained from model training alone.

MIT’s main approach

To confirm this theory, the researchers show that it is possible to separate non-robust features from robust features in a standard image classification dataset. Specifically, given any training data set, the researcher can construct:

“Robust” version of robust classification (see Figure 1A) : The researchers showed that it is possible to efficiently remove robust features from a dataset. By creating a training dataset semantically similar to the original dataset, the model can obtain robust accuracy on the original unmodified test set after standard training on it. This finding suggests that adversarial vulnerability is not necessarily related to standard training frameworks, but may also be related to data set attributes.

The “non-robust” version of the standard classification (see Figure 1B) : The researcher builds a training dataset with inputs that are nearly identical to the original dataset, but all of the inputs are mislabeled. In fact, the association between the inputs in the newly trained data set and their labels is maintained only by small adversarial perturbations (thus only using non-robust features). Despite the lack of predictive human-visible information, after training on this data set, the model achieved good accuracy on the original unmodified test set.

Figure 1: Concept diagram of the experiment in Chapter 3. In A, the researcher decomposed the features into robust and non-robust features. In B, the researcher constructed a data set that would be mislabeled to humans due to the antagonistic sample, but it could achieve good accuracy on the original test set.

Finally, the association between adversarial samples and non-robust features was rigorously investigated using a specific classification task. The task involves splitting a Gaussian distribution using a model based on Tsipras et al. ‘s, but the MIT researchers have extended the model in several ways.

  • First, in this research setting, antagonism vulnerability can be accurately quantified as the difference between the intrinsic data geometry and the data geometry against the sample disturbance set.

  • Secondly, the classifier obtained by robust training utilizes the corresponding geometry of the combination of the two.

  • Finally, the gradients of the Standard model produce greater mismatches with the in-class directions, capturing phenomena observed in practice in more complex scenarios.

The experiment

The core premise of the theoretical framework proposed in this study is the existence of robust and non-robust features in standard classification tasks, both of which can provide useful information for classification. To confirm this, the researchers conducted some experiments, the conceptual description of which is shown in Figure 1.

Decompose the robust and non-robust features

Give the new training set(Robustness training set, see figure 2A), the researchers used standard (non-robustness) training to generate a classifier. Its performance is then tested on the original test set (D), with the results shown in Figure 2b. This shows that the classifier trained with the new data sets can achieve good accuracy in both standard and adversarial environments.

Give the new training set(Non-robust training set, robust training set, see figure 2A), the researchers used the same method to obtain a classifier. Experimental results show that the classifier trained on this dataset can also achieve good accuracy, but it is hardly robust (see figure 2b).

These findings support the hypothesis that the sample is derived from the data (non-robust).

Nonrobust features are sufficient to support standard classification

Can a model trained only on non-robust characteristics perform well on standard test sets? The researchers did an experiment.

The data set is constructed using counter disturbance X and target class T  和 , and then the standard (non-robust) model is used in D, 和  The classifier was obtained by training on the three data sets, and then the accuracy was obtained by testing on test set D, as shown in Table 1 below. Experimental results show that the model derived from standard training on these data sets can be generalized to the original test set, indicating that non-robust features are indeed useful in a standard environment.

The portability

Researchers in the data setIt is found that the test accuracy of each architecture is proportional to the migration of the adversarial sample from the original model to the standard classifier with this architecture. This confirms the researcher’s hypothesis that adversarial mobility occurs when the model learns similar fragile features of the underlying data set.

The core theoretical framework of the paper

A theoretical framework for learning (non-robust) features is proposed, but the core premise of this framework is the existence of robust and non-robust features in standard classification tasks, both of which can provide useful information for classification. In chapter 3 of the original paper, the researchers provide some evidence to support this hypothesis by demonstrating that the two characteristics are distinguishable.

Experiments in Chapter 3 of the original paper show that the conceptual framework of robust and non-robust features strongly predicts the empirical behavior of the current optimal model, and the behavior on real data sets. To enhance the understanding of these phenomena, the MIT researchers instantiated the framework in a concrete context to theoretically study the various properties of the corresponding model.

The MIT researcher’s model is similar to that of Tsipras et al. [Tsi+19]. In a sense, the model contains a dichotomy between robust and non-robust features, but the model proposed in this study extends it in many ways:

  1. The robustness of adversarial samples is explicitly expressed as the difference between the intrinsic data measure and L2 measure.

  2. Robust learning corresponds to learning a combination of these two metrics.

  3. The gradient of the model after adversarial training is more consistent with the metric standard of the attacker.

Measure vulnerability by metric mismatches (non-robust characteristics)

Robust learning

Figure 4 shows a visualization of robustness optimization and its impact under L2 constraint opposition.

Figure 4: Empirical demonstration of the effect of Theorem 2, as the adversarial perturbation ε grows, the learned mean µ is still constant, but the learned covariance “blend” is the identity matrix, effectively adding more and more uncertainty to the non-robust feature.

Gradient interpretability

This article is compiled for machine heart, reprint please contact this public number for authorization.

✄ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Join Heart of the Machine (full-time reporter/intern) : [email protected]

Contribute or seek coverage: [email protected]

Advertising & Business partnerships: [email protected]