Seven misconceptions about machine learning

Oscar Chang, a doctoral student in computer science at Columbia University, has posted seven misconceptions about machine learning, which AI Technology Review has compiled below.

Here are seven misconceptions that have been floating around in the understanding of deep learning, many of which are pre-existing biases that have recently been called into question by new research:

Myth # 1: TensorFlow is a tensor library

Myth 2: Image datasets reflect the real image distribution of the natural world

Myth # 3: Machine learning researchers do not use test sets for validation

Myth 4: Neural network training uses all the data points in the training set

Myth 5: We need batch standardization to train ultra-deep residual networks

Myth # 6: Attention is better than convolution

Myth # 7: Explicit graphs are a robust way to interpret neural networks

The following will be explained separately:

Myth # 1: TensorFlow is a tensor library?

In fact, TensorFlow is a matrix operation library, which is significantly different from the tensor operation inventory.

In NeurIPS 2018 paper Computing Higher Order Derivatives of Matrix and Tensor Expressions, researchers show that Their new automated differential library, based on Tensor Calculus, has significantly more compact expression trees. This is because tensor calculus uses index identifiers, which make forward and reverse patterns the same.

In a perverse way, matrix calculus hides indexes for the sake of identification, which often results in an expression tree that automatically differentiates too much.

If there is matrix multiplication: C=AB. In forward mode, there are:

Seven misconceptions about machine learning

In reverse mode, there are:

Seven misconceptions about machine learning

To do the multiplication correctly, we need to be careful about the order of the multiplication and the use of the transpose. For machine learning developers, this is just a little bit of confusion on the logo, but for programs, the cost is calculated.

The next example is obviously more important: for the determinant c= the determinant of A. In forward mode, there are:

Seven misconceptions about machine learning

In reverse mode, there are:

Seven misconceptions about machine learning

It is obvious here that you cannot use the same expression tree to represent two patterns because they are composed of different operations.

In general, the automatic differentiation methods implemented by TensorFlow and other libraries such as Mathematica, Maple, Sage, SimPy, AdOLC, TAPENADE, TensorFlow, Theano, PyTorch, and HIPS Autograd, Results in different, inefficient expression trees in forward and backward modes. In tensor calculus, these problems are easily avoided by preserving the commutability of multiplication through index identifiers (see article).

The researchers tested their method for auto-differentiation in reverse mode, or back-propagation, on three different problems and measured the time it took to compute the Hessian matrix.

Seven misconceptions about machine learning

The first problem is to optimize a quadratic function of the form xAx; The second problem is to solve a logistic regression; The third problem is solving matrix factorization.

On CPU, the new method is two orders of magnitude faster than popular automatic differentiation libraries such as TensorFlow, Theano, PyTorch, and HIPS Autograd.

Seven misconceptions about machine learning

On gpus, the researchers found that the new method increased speed even more dramatically, outpacing popular libraries by nearly three orders of magnitude.

Significance: Using current deep learning libraries to complete the derivation of quadratic or higher order functions, the cost is higher than it should be consumed. This involves computing general fourth-order tensors such as Hessian’s (e.g., in MAML, and second-order Newton’s method). Fortunately, second-order functions are not common in deep learning. But in traditional machine learning, they are widespread: SVM duality problem, least square regression, LASSO, Gaussian process… Myth 2: Image datasets reflect the real image distribution of the natural world

One might think that neural networks are now better than real people at the task of target recognition. Not really. Perhaps in ImageNet and other filtered image data sets, their effect is indeed better than manual; But when it comes to realistic images of nature, they are no better at object recognition than any normal adult. This is because the distribution of images extracted from the current image dataset differs from that extracted from the real world as a whole.

In a 2011 paper, Unbiased Look at Dataset Bias, we tried to investigate the existence of Dataset Bias by training a classifier to determine which Dataset a given image comes from, based on 12 popular image datasets.

Seven misconceptions about machine learning

Random guesses should be correct 1 in 12 = 8 percent of the time, while experimental results are more than 75 percent accurate.

Seven misconceptions about machine learning

The researchers trained a SVM on HOG feature and found that its accuracy rate reached 39%, higher than the level of random guess. Today, if the experiment were replicated using state-of-the-art CNN, the classifier might perform better.

If the image data set does represent real images from the natural world, it should not be possible to tell which data set a particular image came from.

Seven misconceptions about machine learning

But there are biases in the data that make each dataset unique. For example, in ImageNet, there are too many “racing cars” to be considered typical of “cars” in the general sense.

Seven misconceptions about machine learning

Researchers train classifiers on one dataset and evaluate performance on other datasets to further measure the value of the dataset. LabelMe and ImageNet are the least biased data sets according to this metric, scoring 0.58 in the “Basket of currencies.” All data sets had scores of less than 1, indicating that models trained on other data sets gave lower accuracy. Ideally with no dataset bias, there should be some scores higher than 1.

The author concludes gloomily:

So what is the value of the data set we now use to train algorithms if we deploy it in the real world? The answer that emerges is: “Better than nothing, but not much better. Myth # 3: Machine learning researchers do not use test sets for validation

In the first machine learning course, we will learn to divide data sets into training sets, validation sets, and test sets. The model will be trained on the training set, and the effect will be evaluated on the verification set, so as to guide the developers to adjust the model, so as to obtain the best model in the real scene. Until the model is tuned, test sets should not be used to provide an unbiased estimate of how the model will actually perform in a real-world scenario. If a developer “cheats” using a test set during training or validation, the model runs the risk of overfitting the dataset bias: such bias information cannot be generalized outside the dataset.

Machine learning research is highly competitive and new algorithms/models are often evaluated using their performance on test sets. Therefore, there is no reason for the researcher to write/submit a paper with a test set effect that is not SOTA. This also shows that in the field of machine learning research, in general, the use of test sets for verification is a common phenomenon.

What are the effects of this “cheating”?

Seven misconceptions about machine learning

In this paper doCIFAR-10 classiFIERS Generalize to CIFAR-10? , the researchers created a new test set on CIFAR-10 to investigate this question. To do this, they parsed annotated Images from the Tiny Images library, as well as during the initial data acquisition process.

The researchers chose CIFAR-10 because it is one of the most widely used datasets in machine learning and the second most popular dataset in NeurIPS 2017 (after MNIST). The creation of the CIFAR-10 dataset is also well documented and publicly available. There is also enough fine-grained label data in the vast Tiny Images library to make it possible to rebuild a test integration with minimal distribution offset.

Seven misconceptions about machine learning

The researchers found that many neural network models showed a significant decrease in accuracy (4-15%) when switching from the original test set to the new one. But the relative ranking of the models is still relatively stable.

Seven misconceptions about machine learning

In general, the better-performing models had a smaller decline in accuracy than the lower-performing models. This is encouraging news because, at least on CIFAR-10, as the research community develops better machine learning models/methods, the generalization loss due to “cheating” becomes less significant. Myth 4: Neural network training uses all the data points in the training set

It is often said that data is the new wealth, and the more data there is, the better we can train our relatively data-poor, over-parameterized deep learning models.

In ICLR 2019, An Empirical Study of Example Forgetting During Deep Neural Network Learning, researchers show that in multiple common small image data sets, Significant redundancy exists. Shockingly, in CIFAR-10, we were able to eliminate 30% of the data points without significantly affecting the accuracy of the test set.

Seven misconceptions about machine learning

When the neural network gives accurate classification at time t and misclassification occurs at time T +1, it is called the occurrence of forgetting event. The time here refers to the number of SGD updates in the network. To track forgetting events, the researchers ran the neural network on only samples from small batches of data at the time of SGD updates, rather than on each sample of the dataset. For the samples that do not have forgetting events during training, they are called unforgettable samples.

The researchers found that 91.7% of the MNIST data, 75.3% of the permutedMNIST data, 31.3% of the CIFAR-10 data, and 7.62% of the CIFAR-100 data were memorable samples. This makes intuitive sense, because as the diversity and complexity of the image data set increases, the neural network can forget the data more easily.

Seven misconceptions about machine learning

The forgeable sample seemed to exhibit more unusual unique characteristics than the unforgettable sample. The researchers compare them to support vectors in SVM because they seem to delimit decision boundaries.

Seven misconceptions about machine learning

In contrast, memorable samples encode most of the redundant information. If you order the samples by unforgettability, you can reduce the dataset by removing most of the unforgettability samples.

In CIFAR-10, 30% of the data can be removed without affecting the accuracy of the test set, and the accuracy drops by 0.2% after 35% of the data is deleted. If 30% of the data removed were randomly selected, not based on memorability, accuracy would be significantly reduced by 1%.

Seven misconceptions about machine learning

Similarly, on CIFAR-100, 8% of the data can be removed without compromising the accuracy of the test set.

These findings indicate that there is significant data redundancy in neural network training, just as in SVM training, non-support vector data can be removed without affecting model decisions.

Meaning: If we can determine which samples are unforgetable before we start training, we can save storage space and training time by deleting these data. Myth 5: We need batch standardization to train ultra-deep residual networks

For a long time, people believed that “training deep networks by random initial parameter values and gradient descent, directly optimizing supervised objective functions (such as correctly classified logarithmic probability), would not work very well.”

Since then, there have been a number of clever random initialization methods, activation functions, optimization methods, and other structural innovations such as residual connections to make it easier to train deep neural networks using gradient descent.

But the real breakthrough came with the introduction of Batch normalization (and other follow-up normalization techniques), which mitigates issues such as gradient loss and explosion by limiting the activation values for each layer of the deep network.

Note that in this year’s paper Fixup Initialization: In the course of Exploratory Learning Without Normalization, studies have shown that a 10,000-level deep network has been effectively trained through the use of Vanilla SGD Without introducing any standardized methods.

Seven misconceptions about machine learning

The researchers compared the results of training an epoch with residual networks at different depths on CIFAR-10 and found that standard initialization methods failed at 100 layers, but both Fixup and batch standardization succeeded at 10,000 layers.

Seven misconceptions about machine learning

Through theoretical analysis, the researchers proved that “the gradient norm of a particular neural layer takes a value that increases with the increase of network depth as the desired lower bound”, namely the gradient explosion problem.

To avoid this problem, the core idea in Fixup is to adjust the weight of m neural layers by using factors that depend on BOTH L and M on each L residual branch.

Seven misconceptions about machine learning

Fixup makes it possible to train a 110-layer deep residual network on CIFAR-10 at a high learning rate, and the resulting test set performance is comparable to that of the same structured network using batch standardized training.

Seven misconceptions about machine learning

The researchers further demonstrated that, without any standardization, the Fixup based neural network achieved comparable scores to LayerNorm’s network on ImageNet data sets and English-German machine translation tasks. Myth # 6: The attentional mechanism is better than convolution

In machine learning, there is a growing consensus that the attentional mechanism is a better alternative to convolution. Importantly, Vaswani et al noted that “the computational cost of a separable convolution is consistent with that of a self-attention layer combined with a point-by-point feed-forward layer”.

Even the latest GAN networks show that self-attention is better than standard convolution at modeling long-term, multi-scale dependencies.

In the ICLR 2019 paper Pay Less Attention with Lightweight and Dynamic Convolutions, the researcher questions the validity and efficiency of parameters of the self-attention mechanism in the modeling of long-term dependence. They show a variant of convolution inspired by self-attention, with higher parameter efficiency.

Seven misconceptions about machine learning

Lightweight convolutions are depthwise-separable and softmax normalized in the time dimension, share weights in the channel dimension, and reuse the same weights in each time step (similar to RNN networks). Dynamic convolutions are lightweight convolution with different weights on each time step.

These techniques make lightweight convolution and dynamic convolution orders of magnitude more efficient than traditional non-separable volume products.

Seven misconceptions about machine learning

The researchers also demonstrate that these new convolution can meet or exceed the self-attention-based baseline effect with a similar or fewer number of parameters in machine translation, language modeling, and abstract summarization tasks. Myth # 7: Saliency Maps are a robust way to interpret neural networks

Although neural networks are often thought of as black boxes, there have been many attempts to explain them. Salience maps, or other similar methods of assigning importance points to features or training samples, are among the most popular.

It is easy to conclude that the reason for a certain classification of a given image is because a particular part of the image plays an important role in the decision to classify the neural network. There are several methods for calculating significant graphs, which are usually based on the activation of neural networks on a particular image and the gradient propagated in the network.

In a paper Interpretation of Neural Networks is Fragile published in AAAI 2019, the researchers showed that the significance map of a given image can be distorted by introducing an imperceptible disturbance.

Seven misconceptions about machine learning

“Monarchs are classified as monarchs not because of the pattern of their wings, but because of some unimportant green leaves in the background.”

Seven misconceptions about machine learning

High dimensional images are usually located near the decision boundary established by deep neural networks, so they are easily affected by adversarial attacks. An adversarial attack moves the image to the other side of the decision boundary, while an adversarial interpretation attack moves the image along the contour line of the decision boundary within the same decision region.

Seven misconceptions about machine learning

In order to realize this attack, the basic method used by researchers is a variant of the method FGSM (Fast gradient Sign Method) proposed by Goodfellow, which is the earliest method introduced to realize effective counterattack. This also suggests that other, more recent and sophisticated adversarial attacks can also be used to attack the interpretability of neural networks.

Seven misconceptions about machine learning

Meaning:

As deep learning becomes more common in high-risk applications such as medical imaging, we must pay attention to how decisions made by neural networks are interpreted. For example, while it’s great that CNN can identify a spot on an MRI image as a malignant tumor, those results should not be trusted if they are based on flimsier interpretation methods.

Seven misconceptions about machine learning

Related Posts

Basics of Python (11) Python accesses lists by index

E-commerce search engine (1) — Algorithm selection

Constant source Cloud (GpuShare)_PRGC: Associative relation triplet extraction based on potential relation and global correspondence