From its roots in natural language processing tasks to its prowess in image classification and generation, is the all-conquering Transformer the next myth?

Imagine walking into your local hardware store and seeing a new type of hammer on the shelf. You’ve heard about this hammer: It hits faster and more accurately than other hammers, and over the years it has made many others obsolete for most purposes.

And with a few tweaks, such as an attachment here and a screw there, the hammer can turn into a saw, cutting as fast as any alternative. Some experts at the forefront of tool development say the hammer could herald the convergence of all tools into one device.

A similar story is playing out in artificial intelligence. The multifunctional new hammer is an artificial neural network — a network of nodes trained on existing data to “learn” how to do certain tasks — called Transformer. It was originally used to handle language tasks, but has recently begun to affect other areas of AI.

Transformer first appeared in a 2017 paper: Attention Is All You Need. In other AI approaches, the system focuses on local patches of input data first, and then builds the whole. For example, in a language model, adjacent words are first grouped together. Transformer, by contrast, runs the program so that each element in the input data connects or focuses on other elements. The researchers call this “self-attention.” This means that once training begins, Transformer can see traces of the entire data set.

Until Transformer came along, AI’s progress in language tasks lagged behind that in other areas. “Natural language processing is kind of a late comer to the deep learning revolution that’s happened over the last 10 years,” says Anna Rumshisky, a computer scientist at the University of Massachusetts, Lowell. “In a sense, NLP was behind computer vision, and Transformer has changed that. “

Transformer quickly became a leader in applications such as word recognition that focused on analyzing and predicting text. It sparked a wave of tools like OpenAI’s GPT-3 that can train on hundreds of billions of words and generate coherent new text.

Transformer’s success has led ai researchers to wonder: What more can this model do?

The results are unfolding – Transformer is proving to be surprisingly feature-rich. In some visual tasks, such as image classification, neural networks using Transformer are faster and more accurate than those without Transformer. Transformer can also handle more and better research in other emerging areas of ARTIFICIAL intelligence, such as processing multiple inputs at once or completing planning tasks.

“Transformer seems to be quite transformative in many aspects of machine learning, including computer vision,” said Vladimir Haltakov, who works on computer vision for autonomous vehicles at BMW In Munich.

Just a decade ago, the different subdomains of AI were virtually incommunicable, but Transformer’s arrival shows that convergence is possible. “I think Transformer is so popular because it shows the potential for generality,” said Atlas Wang, a computer scientist at the University of Texas at Austin. “There’s a good reason to try Transformer across the whole spectrum of AI tasks.”

From “Language” to “Vision”

A few months after Attention Is All You Need was released, the most promising moves to expand Transformer’s reach began. Alexey Dosovitskiy, who was working in Google Brain’s Berlin office at the time, was working on computer vision, a subfield of AI focused on teaching computers how to process and classify images.

Alexey Dosovitskiy.

Like almost everyone else in the field, he has been using convolutional neural networks (CNN). Over the years, it was CNN that drove all the big leaps in deep learning, especially computer vision. CNN performs feature recognition by repeatedly applying filters to pixels in the image. Based on CNN, photo apps can sort your photos by face or distinguish avocados from clouds. Therefore, CNN is considered essential for visual tasks.

At the time, Dosovitskiy was working on one of the field’s biggest challenges, scaling up CNN without increasing processing time: training on larger data sets to represent higher-resolution images. But then he saw Transformer replace the previous tool of choice for almost all language-related AI tasks. “We were obviously inspired by what was happening,” he said. “We wondered, could we do something similar visually?” The idea makes some sense — after all, if Transformer can handle words in large data sets, why not images?

The end result: at a conference in May 2021, a network called Vision Transformer (ViT) emerged. The architecture of the model is nearly identical to that of the first Transformer, introduced in 2017, with minor changes that allow it to analyze images rather than just words. “Language tends to be discrete,” Rumshisky says, “so images have to be discrete.”

The ViT team knew that the language approach could not be completely imitated because the self-attention per pixel would be very expensive in computing time. So, they divided the larger images into square units or tokens. The size is arbitrary, as token can grow larger or smaller depending on the resolution of the original image (default is 16 pixels per edge), but by processing pixels in groups and applying self-attention to each pixel, ViT can quickly process large training data sets, resulting in increasingly accurate categorizations.

Transformer is able to classify images with over 90% accuracy, which is much better than Dosovitskiy expected, and achieves new SOTA Top-1 accuracy on ImageNet image datasets. The success of ViT suggests that convolution may not be as vital to computer vision as researchers think.

“I think IT’s very likely that CNN will be replaced by Vision Transformer or its derivatives in the medium term,” says Neil Houlsby of the Zurich office of Google Brain, which worked with Dosovitskiy on ViT. Future models, he thinks, could be pure Transformer or a way to add self-focus to existing models.

Several other results confirm these predictions. The researchers regularly test their image classification model on the ImageNet database, and in early 2022, an updated version of ViT is second only to the new approach combining CNN with Transformer. Longtime champion CNN, which has no Transformer, is barely in the top 10.

How Transformer works

ImageNet results show Transformer can compete with leading CNN. But Maithra Raghu, a computer scientist at Google Brain’s Mountain View, Calif., office, wonders if they “see” images the same way CNN does. Neural networks are a “black box” that is hard to decipher, but there are ways to peer inside — for example, by examining the network’s inputs and outputs layer by layer to learn how training data flows. Raghu’s team basically did just that — they took ViT apart.

Maithra Raghu

Her team identified ways in which self-attention leads to different perceptions in algorithms. Ultimately, Transformer’s power comes from the way it handles image-encoded data. “At CNN, you start in a very local place and slowly get the big picture,” Raghu said. CNN recognizes images pixel by pixel and identifies features such as angles or lines in a local to global way. But in Transformer with self-attention, even the first layer of information processing makes connections between image locations that are far apart (just like language). If CNN’s approach is like starting at a single pixel and using a zoom lens to reduce the magnification of the image of a distant object, Transformer slowly brings the entire blurred image into focus.

The difference is easier to understand in the language areas where Transformer initially focused, consider these sentences: “The owl spotted a squirrel. It tried to grab it with its paw, but only grabbed the end of its tail.” The structure of the second sentence is confusing: what does “it” refer to? CNN, which only focuses on words adjacent to “it,” would have trouble, but Transformer, which links each word to another, can identify owls catching squirrels that have lost part of their tails.

Clearly, Transformer’s approach to image processing is fundamentally different from that of convolutional networks, and researchers are getting more excited. Transformer’s versatility in converting data from one-dimensional strings (such as sentences) to two-dimensional arrays (such as images) shows that such a model can handle many other types of data. For example, Wang believes Transformer could be a big step toward achieving convergence of neural network architectures, resulting in a universal approach to computer vision — and perhaps for other AI tasks. “Of course there are limitations to actually making it happen, but it would certainly be great to have a universal model that allows you to put all kinds of data on one machine.”

Outlook on ViT

Now researchers hope to apply Transformer to an even more difficult task: creating new images. Language tools such as GPT-3 can generate new text from their training data. In a paper published last year, “TransGAN: In Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up, Wang combines Two Transformer models and tries to do the same thing with images, But this is a much more difficult problem. As the dual-Transformer network trained on the faces of more than 200,000 celebrities, it synthesized new images of faces at medium resolution. The celebrity faces generated were impressive and at least as believable as the CNN celebrity creation, based on initial scores, a standard way of evaluating images generated by neural networks.

Wang believes Transformer’s success in generating images is more surprising than ViT’s ability to categorize images. “Generating a model requires synthesis, and you need to be able to add information to make it look reasonable,” he said. As with classified domains, the Transformer method is generating domains in place of convolutional networks.

Raghu and Wang also see new uses for Transformer in multimodal processing. “It used to be tricky,” Raghu says, because each type of data had its own specialized model and the methods were isolated. But Transformer offers a way to combine multiple input sources.

“There are lots of interesting applications that can combine some of these different types of data and images.” Multimodal networks, for example, might support a system that reads a person’s lips in addition to listening to their voice. “You can have a rich representation of verbal and visual information,” Raghu says, “and go deeper than ever before.”

The faces were created by a Transformer based network after training on a dataset of more than 200,000 famous faces.

A new set of studies points to a range of new uses for Transformer in other AI fields, including teaching robots to recognize human movement, training machines to recognize emotions in speech and measuring stress levels in electrocardiograms. Another program with Transformer components is AlphaFold, which has made headlines for solving the 50-year protein folding problem with its ability to quickly predict protein structure.

Transformer isn’t all you need

Even though Transformer helps integrate and improve AI tools, Transformer, like other emerging technologies, has costly features. A Transformer model requires a lot of computing power in the pre-training phase to beat previous competitors.

That could be a problem. “People are increasingly interested in high-resolution images,” Says Mr Wang. Training costs can be a deterrent to Transformer’s spread. However, Raghu believes that training barriers can be overcome with the help of sophisticated filters and other tools.

Wang also points out that while Vision Transformer is already driving advances in AI, many of the new models still include the best parts of convolution. That means future models are more likely to use both models, rather than ditch CNN altogether, he said.

At the same time, this suggests that some hybrid architectures hold promising prospects that take advantage of Transformer in ways that current researchers cannot predict. “Maybe we shouldn’t jump to the conclusion that Transformer is the perfect model,” Wang said. But it is increasingly clear that Transformer will be at least part of all the new supertools in the AI Shop.

The original link: www.quantamagazine.org/will-transf… Finally, I welcome you to follow my wechat official account: Duibainotes, which tracks the forefront of machine learning such as NLP, recommendation system and comparative learning. I also share my entrepreneurial experience and life perception on a daily basis. Students who want to further communicate can also add my wechat account to discuss technical problems with me, thank you!