From The Gradient by Alan L. Yuille and Chenxi Liu, Heart of the Machine.

Deep learning is the core of the development of artificial intelligence technology in recent years. Although it has achieved great success, it has obvious limitations. Compared with the human visual system, deep learning is much less versatile, flexible and adaptable, and may encounter mechanical difficulties when it comes to complex natural images. In this article, scholars from Johns Hopkins University present some of the limits of deep learning and their ideas on how to solve them.

Deep neural networks in their current form seem unlikely to be the best solution for building general-purpose intelligent machines or understanding the mind/human brain in the future, but many mechanisms of deep learning will continue to exist in the future, researchers say.

Deep Nets: What Have They Ever Done for Vision? A condensed version of the thesis.

  • The thesis links: https://arxiv.org/pdf/1805.04025.pdf

The history of deep learning

What we are witnessing now is the third rise of artificial intelligence. The first two waves occurred in the 1950s-1960s and 1980-1990s — both had considerable impact then, but both cooled off. That’s because neural networks didn’t deliver the performance improvements they were supposed to, nor did they help us understand biological visual systems. Third wave: early 2000s — This time, unlike the first two, deep learning has vastly outstripped biology in many benchmarking and real-world applications. While most of the basic ideas of deep learning have been developed in the second wave, its power will not be unleashed until large data sets and computing power (especially gpus) are developed.

The ups and downs of deep learning reflect the popularity of schools of thought and the heat of different algorithms. The second wave allowed us to see the limitations of classic AI in the development of high expectations and low results, which is why there was an AI winter in the mid-1980s. The retreat of the second wave translates into the rise of support vector machines, kernel methods, and related methods. We’ve been amazed by the work of neural network researchers, and while the results have been disappointing, over time they’ve reappeared. Today, it’s hard to find research that doesn’t involve neural networks — and that’s not a good thing either. We can’t help but wonder: if scholars in the field of ARTIFICIAL intelligence had pursued more diverse approaches, rather than following trends, the industry might have grown faster. Worryingly, students of ARTIFICIAL intelligence often ignore old technologies altogether and chase new trends.

Success and failure

Before AlexNet came along, the computer vision community was skeptical of deep learning. In 2011, AlexNet swept all competitors in the ImageNet image recognition competition, and in the following years, researchers have proposed more and more better object classification neural network architectures. Deep learning is also quickly adapted to other visual tasks, such as object detection, in which an image contains one or more objects. In this task, the neural network will enhance the information of the initial stage to determine the final target category and location, where the initial stage proposes the possible position and size of the object. These methods are superior to the previous best methods in the most important object recognition competition before ImageNet — PASCAL object recognition challenge, Deformable Part Model. Other deep learning architectures have also improved on some classic tasks, as shown below:

Figure 1. Deep learning can perform many different visual tasks. These include boundary detection, semantic segmentation, semantic boundaries, surface normals, saliency, human body, and object detection.

But while deep learning surpasses previous techniques, it is not applicable to general-purpose tasks. Here, we can define three main limitations.

First, deep learning almost always requires a lot of annotated data. This has led computer vision researchers to tend to solve problems that are solvable — not really important.

There are ways to reduce the need for supervision, including transfer learning, fee-shot learning, unsupervised learning, and weakly supervised learning. But their achievements so far have not been as impressive as supervised learning.

Second, deep learning performs well on benchmark data sets, but can perform poorly on real-world images outside the data sets. All data sets have their own biases. This bias was evident in early visual data sets, and the researchers found that the neural network would take advantage of the biases in the data sets to “play tricks,” such as making judgments using backgrounds (it used to be very easy to detect fish in Caltech101 because the fish graph was the only one with water in the background). Although this can be reduced by using large data sets and deep neural networks, the problem remains.

In the figure below, a deep neural network is trained on ImageNet to recognize sofas, but may not succeed because of insufficient sample images. Deep neural networks are biased against “special cases”, and the model does not take into account the few cases in the data set. But in real-world applications, these biases are problematic, and using such systems for visual detection can have serious consequences. The data set used to train self-driving cars, for example, would almost never include a baby sitting in the middle of the road.

Figure 2: UnrealCV allows vision researchers to simply manipulate composite scenes, such as changing the perspective of a sofa. We found that the average accuracy (AP) of the ftF-RCNN detection sofa was in the range of 0.1 to 1.0, showing extreme sensitivity to perspective. This may be due to biases in training that result in ftP-RCNN favoring a particular perspective.

Third, the deep network is overly sensitive to changes in the image that humans might not perceive as affecting the image. The deep network is sensitive not only to standard adversarial attacks, which cause imperceptible changes in the image, but also to changes in the environment. Figure 3 shows a guitar photoshopped as a monkey in a rainforest. This led the Deep Network to mistake monkeys for humans and guitars for birds, presumably because guitars were more likely to be held by humans than monkeys, and birds rather than guitars were more likely to be seen near monkeys in the rainforest. Recent studies have given plenty of examples of deep networks being too sensitive to the environment, like putting an elephant in a room.

Figure 3: The deep network fails due to occlusion. Left: The deep network identifies a monkey as a human after shielding it with a motorcycle. Chinese: Masked by a bicycle, the deep network identifies monkeys as humans, and the jungle misidentifies the network’s handlebars as birds. Right: Masked with a guitar, the deep network identifies the monkey as a human, and the jungle misidentifies the guitar as a bird.

Over-sensitivity to context can be attributed to the limited size of the data set. For any object, the data set can only contain a limited number of backgrounds, so the neural network has preferences. For example, in early image annotation data sets, we observed that giraffes only appeared near trees, so the generated annotation did not mention giraffes without trees in the image, even though they were the primary objects.

For data-driven approaches such as deep networks, the difficulty of capturing large variations in the background and the need to explore a large number of noise factors is a big problem. Ensuring that the network can solve all of these problems seems to require an infinitely large data set, which in turn presents a huge challenge for training and testing data sets.

“Large data sets” are not big enough

Combinatorial explosion

None of the problems mentioned above are necessarily the crux of deep learning, but they are early signs of real problems. That is, the combination of image datasets in the real world is so large that no dataset, no matter how large, can represent the complexity of the real world.

What does it mean to be combinative? Imagine building a visual scene by selecting targets from a target dictionary and placing them in different configurations. The number of ways to do this can be exponential. Even an image with a single object can have similar complexity because it can be obscured in an infinite number of ways. The background can vary in myriad ways.

Although humans naturally adapt to various changes in the visual environment, deep neural networks are more sensitive and error-prone, as shown in Figure 3. We note that this combinatorial explosion may not occur in some visual tasks, and that deep neural networks are generally very successful in medical imaging because of relatively little background variation (e.g., pancreas and duodenum are always close together). But for many applications, we can’t capture the complexity of the real world without an exponentially large data set.

This flaw poses significant problems because the standard paradigm of training and testing models on a limited random sample becomes impractical. These sample sizes are never large enough to characterize the underlying distribution of the data. So we have to face the following two new problems:

1. In tasks that require large data sets to capture real-world combinatorial complexity, how can algorithms be trained on limited data sets to perform well?

2. If we can only test on a limited subset, how can we effectively test these algorithms to ensure that they perform well on large data sets?

Overcome the combinatorial explosion

In its current form, deep neural networks and other methods may not be able to overcome the problem of combinatorial explosion. Whether training or testing, the data set never seems big enough. Here are some potential solutions.

compositionality

Combinativity is a fundamental principle that can be poetically expressed as “an expression of the belief that the world is knowable and that humans can disassemble, understand and rearrange things as they wish”. The key assumption here is that structures are layered, made up of more basic substructures combined according to a set of syntactic rules. This means that substructures and syntax can be learned from a limited set of data and generalized to a combined scenario.

Unlike deep networks, compositional models require structured representations that clearly indicate their structure and substructures. Combinatorial models have the ability to reason beyond the data seen, to reason about systems, to intervene, to perform diagnostics, and to solve many different problems based on the same underlying knowledge structure. Stuart Geman once said, “The world is composed, or god exists,” otherwise God would have welded human intelligence by hand. Although deep neural networks have some form of complexity, such as high-level features combining responses from low-level features, this is not the combinativity mentioned in this article.

Figure 4: From (a) to (c), variability increases and occlusion is used. (c) is an example of a large combined data set, essentially the same as a captcha. Interestingly, studies on captchas show that combinatorial models perform well, but deep neural networks perform poorly.

Figure 4 is an example of compositionality, related to composition analysis.

Several conceptual advantages of the composite model have been demonstrated in visual problems such as performing multiple tasks using the same underlying model and recognizing captchas. Other non-visual examples illustrate the same point. Attempts to train deep networks to perform IQ tests have not been successful. The goal of this task is to predict missing images in a 3×3 grid. Images of the other 8 cells have been given. The underlying rules in this task are combinatorial (interference can exist). Conversely, for some natural language applications, the dynamic architecture of neural module networks appears to be flexible enough to capture some meaningful combinations and outperform traditional deep learning networks. In fact, we recently confirmed that, after joint training, the modules did indeed achieve their intended combined functionality (e.g., AND, OR, FILTER(RED), etc.).

The combinatorial model has many ideal theoretical properties, such as explainability and sample generation. This makes errors easier to diagnose, so they are harder to fool than black-box methods such as deep networks. But learning the combinatorial model is difficult because it requires learning artifacts and syntax (even the nature of syntax is debatable). And, in order to be analyzed through composition, they need to have a generation model of the target and scenario structure. With some exceptions, such as faces, letters, and regular texture images, placing distributions on images is difficult.

What’s more, dealing with combinatorial explosions requires learning causal models of the 3D world and how those models generate images. Studies of human infants have shown that they learn by building causal models that predict the structure of their environment. This causal understanding allows them to learn from limited data and generalize to new environments. This is akin to comparing Newton’s laws, which give causal understanding with the fewest free parameters, with the Ptolemaic model of the solar system, which gives very accurate predictions but requires a lot of data to determine its details.

Test on combined data

One of the potential challenges of testing visual algorithms on real-world combinatorial complexity is that we can only test on a limited amount of data. Game theory addresses this problem by focusing on the worst case rather than the average case. As we said earlier, the results of a general case on a finite size dataset may not be meaningful if the dataset does not capture the combinatorial complexity of the problem. Clearly, if the goal is to develop visual algorithms for diagnosing cancer in self-driving cars or medical images, it makes sense to focus on worst-case scenarios, where algorithm failure can have serious consequences.

If failure patterns, such as three-dimensional risk factors, can be captured in low-dimensional space, we can study them using computer graphics and grid searches. But for most visual tasks, especially those involving combined data, it is difficult to identify a few risk factors that can be isolated or tested. One strategy is to extend the standard concept of counterattack to include non-local structures, which can be done by allowing complex operations that cause changes in an image or scene but do not significantly affect human perception (such as occlusion or changing the physical properties of the object being observed). Applying this strategy to visual algorithms that process combined data remains challenging. However, if the algorithm is designed with combinativity in mind, their explicit structure may make it possible to diagnose and determine their failure patterns.

summary

A few years ago, Aude Oliva and Alan Yuille co-organized a national Science Foundation sponsored symposium on frontiers in Computer Vision (MIT CSAIL 2011). A frank exchange of views was encouraged. Attendees were sharply divided over the deep web’s potential for computer vision. Yann LeCun boldly predicts that everyone will soon be using the deep Web. His prediction was right. The remarkable success of the deep network has also made computer vision very popular, greatly increasing the interaction between academia and industry, leading to the application of computer vision in many fields and many other important research results. Even so, the deep Web still presents huge challenges that we must overcome if we are to achieve universal ARTIFICIAL intelligence and understand biological vision systems. Some of our concerns are similar to those raised in recent criticisms of the Deep Web. As researchers begin to tackle increasingly complex visual tasks under increasingly realistic conditions, arguably the toughest challenge is to develop algorithms that can handle combinatorial explosions. While deep networks could be part of the solution, we believe that complementary approaches involving combinatorial principles and causal models are needed to capture the basic structure of the data. Moreover, in the face of combinatorial explosion, we need to think again about how to train and evaluate visual algorithms.

The original link: https://thegradient.pub/the-limitations-of-visual-deep-learning-and-how-we-might-fix-them/