The author | Andrew Ng

The translator | nuclear cola, yan liu

On December 23, machine learning guru Andrew Ng published his latest article in The Batch, The weekly ai newsletter he edits. In the article, Ng reviewed the major developments of global AI in 2021 in terms of multi-mode, large model, intelligent speech generation, Transformer architecture, and national AI legal initiatives.

Hello, this is dialogue.

The year 2021 is coming to an end.

Ng recently released a Christmas message with the theme of “Roses, fragrance in hand”.

As the end of 2021 approaches, you may be cutting back on work in preparation for winter vacation. I’m looking forward to taking a break from work, and I hope you are, too.

December is sometimes called the season of giving. If you have free time and want to know what to do with it, I think one of the best things any of us can do is think about how we can help others.

Historian and philosopher Will Durant once said, “We are what we repeat.” If you constantly seek to improve others, it will not only help them, but perhaps just as important, it will also make you a better person. It’s your repetitive behavior that defines you as a person. There’s also a classic study showing that spending money on others may make you happier than spending it on yourself.

So, this holiday season, I hope you can take some time off. Rest, relax, recharge! Connect with loved ones you haven’t had enough time to connect with in the past year. If time permits, do something meaningful to help others. This could be leaving an encouraging comment on a blog post, sharing advice or encouragement with a friend, answering an AI question in an online forum, or donating to a worthy cause. Among the educational and/or technology-related charities, some of my favorites are the Wikimedia Foundation, Khan Academy, Electronic Frontier Foundation, and Mozilla Foundation.

Ng also talked about the development of the AI community. “The AI community has been very collaborative from a very young age,” he says. It felt like a group of fearless pioneers marching out into the world. People are eager to help, offer advice, encourage each other, and introduce each other. Those who benefit from it often get nothing back, so we repay it by helping those who follow. I want to keep that spirit alive as the AI community grows. I pledge to continue my efforts to build ai communities. Hope you can too!

I also urge you to consider ways, large and small, to reach out to those outside the AI community. There are still many places in the world that don’t have advanced technology. Billions of dollars and billions of lives are affected by our decisions. This gives us a special opportunity to do good in the world.

Ng reviewed the progress of global AI in 2021 and looked ahead to the future of AI technology in 2022 and beyond.

Back in 2021

Over the past year, the world has struggled with extreme weather, economic inflation, supply chain disruptions and the COVID-19 virus.

In the tech world, telecommuting and online conferencing continued throughout the year. The AI community continues to work to bridge the world and advance machine learning while enhancing its ability to benefit all industries.

This time, we want to focus on the future of AI technology in 2022 and beyond.

The takeoff of multimodal AI

While deep learning models such as GPT-3 and EfficientNet, which focus solely on tasks such as text and graphics, have attracted much attention, the most impressive part of the year was the progress the AI model made in discovering the relationship between licenses and images.

Background information

OpenAI starts multi-mode learning with CLIP (matching images with text) and Dall·E (generating corresponding images according to input text). DeepMind’s Perceiver IO is sorting text, images, video and point clouds. ConVIRT of Stanford University is trying to add text labels to medical X-ray images.

Important benchmark

Although most of these new multi-mode systems are in the experimental stage, they have made breakthroughs in practical applications.

  • The open source community combines CLIP with generative adversarial Networks (GAN) to develop compelling works of digital art. Artist Martin O ‘Leary used Samuel Coleridge’s epic Kublai Khan as his input to create the psychedelic “Sinuous Rills”.

  • Facebook says its multimodal hate speech detector is able to flag and remove 97 percent of abusive and harmful content on the social network. The system can accurately classify memes and other image-text pairs as “benign” or “harmful” based on 10 data types, including text, image and video.

  • Google says it has added multimodal (and multilingual) capabilities to its search engine. Its multi-tasking unified model returns text, audio, image, and video links in response to queries submitted in 75 languages.

Behind the news

This year’s multimodal developments come from decades of solid research.

Back in 1989, researchers at Johns Hopkins University and the University of California, San Diego developed a vowel-based classification system to identify audio and visual data in human speech.

Over the next two decades, more research groups experimented with multimodal applications such as digital video library indexing and evidential/visual data classification of human emotions.

Current situation of the development of

Images and text are so complex that researchers can only focus on one or the other for a long time. During this time, they developed many different technological achievements.

But over the past decade, computer vision and natural language processing have converged effectively in neural networks, making it possible to eventually merge the two — and even audio integration has found room.

Trillion-scale parameter

Over the past year, the model has gone from big to bigger.

Background information:

Google kicked off 2021 with the Switch Transformer, the first model in human history to have teraflop-scale parameters, amounting to 1.6 trillion.

The Beijing Institute of Artificial Intelligence responded with Wudao 2.0, which contains 1.75 trillion parameters.

Important benchmark

There’s nothing special about simply raising model parameters. But as processing power and data sources grow, deep learning is really starting to establish the principle that bigger is better.

Deep-pocketed AI vendors are piling metrics at a feverish pace, both to improve performance and flex their muscles. Especially in language models, Internet vendors provide a lot of unmarked data for unsupervised and semi-supervised pre-training.

Since 2018, the parametric arms race has gone from BERT (110 million), GPT-2 (1.5 billion), MegatronLM (8.3 billion), Turing-NLG (17 billion), GPT-3 (175 billion) to finally crossing the trillion mark.

It’s good, but…

The expansion path of the model also presents new challenges. Increasingly large models present developers with four stark obstacles.

  • ** Data: ** Large models need to absorb a lot of data, but traditional data sources such as the Web and digital libraries often do not provide such high quality material. For example, BookCorpus, a data set of 11,000 e-books commonly used by researchers, has previously been used to train more than 30 large language models; But it contains some religious bias, because it mainly discusses Christianity and Islam, with little reference to other religions.

The AI community recognizes that data quality directly determines model quality, but has been unable to agree on effective compilation methods for large, high-quality datasets.

  • ** Speed: ** Today’s hardware still has a hard time handling large volume models, and the model’s training and reasoning speed can be severely affected when data comes in and out of memory repeatedly.

To reduce latency, the Google team behind Switch Transformer developed a way for each token to work with only a subset of each layer of the model. Their best models predicted 66% faster than traditional models with one-thirtieth as many parameters.

In addition, Microsoft’s DeepSpeed library takes the path of parallel processing of data, layers, and layer groups, and reduces processing redundancy by dividing tasks between cpus and Gpus.

  • ** Power consumption: ** Training such a large network consumes a lot of power. A 2019 study found that training a transformer model with 200 million parameters on eight Nvidia P100 Gpus resulted in carbon emissions (in terms of fossil fuel power generation) equivalent to the total emissions of a typical car in five years of driving.

To be sure, a new generation of AI-accelerated chips, such as Cerebras’ WSE-2 and Google’s new TPU, promise to reduce emissions, while the supply of wind, solar and other clean energy sources is also increasing. I believe AI research will be less damaging to the environment.

  • ** Model delivery: ** These massive models are difficult to run on consumer or edge devices, so true scale deployment can only be achieved through Internet access or stripped-down versions — both currently having problems.

Current situation of the development of

The natural language modeling rankings are still dominated by trillion-scale models, because trillion-scale parameters are too difficult to handle.

But it’s a safe bet that more trillion-level club members will join in the coming years, and the trend will continue. OpenAI’s planned GPT-3 successor is rumored to include even more terrifying quadrillion parameters.

Ai-generated audio content goes Mainstream

Musicians and filmmakers are already used to ai-enabled audio production tools.

Background information

Professional media producers use neural networks to generate new sounds and modify old ones. Naturally, the voice actors were not amused.

Important benchmark

Generative models can learn features from existing recordings to create convincing replicas. Some producers use the technology directly to create original sounds or imitate existing sounds.

  • Us startup Modulate uses a generative adversarial network to create new voices in real time, allowing gamers and chatterers to build their avatars. Transgender people also use it to tune their voices to achieve a tone consistent with their gender identity.

  • Sonantic is a startup that specializes in sound synthesis. Actor Val Kilmer lost most of his voice after throat surgery in 2015, and the company created a special sound for him using original material.

  • Film-maker Morgan Neville has hired a software company to recreate the voice of the late travel presenter for his documentary Wanderer: A Film about Anthony Bourdain. But the move drew the ire of Bourdain’s widow, who said she had not given her permission.

It’s good, but…

The controversy is not alone.

Voice actors are also worried that the technology could threaten their livelihoods. Fans of 2015’s Game of the Year, The Witcher 3: Hunt, even used the technology to recreate the voices of the original cast members in a fan Mod.

Behind the news

The recent trend towards mainstreaming audio generation is a natural continuation of earlier research.

  • OpenAI’s Jukebox, for example, trains on 1.2 million songs, using autoencoders, converters and decoder pipes to generate full real-time recordings in styles ranging from Elvis Presley to Eminem.

  • In 2019, an anonymous AI developer devised technology that allowed users to recreate the sounds of animated and video-game characters using lines of text in as little as 15 seconds.

Current situation of the development of

Generating audio and video not only gives media producers the ability to repair and enhance archival material, but it also gives them the ability to create new, unfathomable material from scratch.

But the ethical and legal questions it raises are also growing. If voice actors are completely replaced by AI, who will be responsible for their loss? What ownership disputes are involved in recreating the voice of a deceased person in a commercial work? Can AI be used to create new albums for dead artists? Is that the right thing to do?

One structure, driving everything

The Transformer architecture is rapidly expanding its reach.

Background information

The Transformers architecture was originally developed specifically for natural language processing, but has become a cure-all for deep learning. In 2021, people are already using it to discover drugs, recognize speech and images, among other things.

Important benchmark

Transformers has proven itself in visual tasks, earthquake prediction, protein classification and synthesis.

Over the past year, researchers have begun pushing it into new, broader fields.

  • TransGAN is a generative adversarial network that combines transformers to ensure that each pixel generated is consistent with the previously generated pixels. This result can effectively measure the similarity between the generated images and the original training data.

  • Facebook’s TImeSformer uses this architecture to identify action elements in video clips. Its task is no longer to identify sequences of words from text, but to try to interpret sequence relationships in video frames. Its performance is superior to convolutional neural network, and it can analyze longer video clips in a shorter time, so its energy consumption is also controlled at a lower level.

  • Researchers at Facebook, Google and the University of California, Berkeley trained GPT-2 on text and then froze its self-attention and feed-forward layers. From there, they can fine-tune the model for different use cases, ranging from math and logic problems to computer vision.

  • DeepMind has released an open source version of AlphaFold 2, which uses Transformers to predict the 3D structure of proteins based on their amino acid sequences. The model has caused a stir in the medical community and is widely regarded as having great potential to advance drug discovery and reveal biological principles.

Behind the news

Transformer debuted in 2017 and has rapidly changed the way language processing models are designed. Its self-attention mechanism can track the relationship between each element and other elements in the sequence, which can be used to analyze not only word sequences, but also sequences such as pixels, video frames, amino acids and seismic waves.

New objective standards have been established for large language models based on Transformer, including model pre-training on large unlabeled corpora and fine-tuning for specific tasks with a limited number of labeled examples.

The ubiquity of Transformer’s architecture may herald the future of AI models that can solve multiple domains and problems.

Current situation of the development of

Several concepts have been rapidly popularized in the evolution of deep learning: the ReLU activation function, the Adam optimizer, the attention-attention mechanism, and now Transformer.

Developments over the past year have proved that this architecture is indeed alive and well.

Governments around the world have introduced laws related to ARTIFICIAL intelligence

Governments around the world are enacting new laws and proposals to control the impact of AI automation on modern society.

Background information

With AI’s potential impact on privacy, fairness, security and international competition, governments around the world are stepping up efforts to regulate AI.

Important benchmark

Ai-related laws often reflect countries’ value judgments in the political order, including how to balance social justice with individual freedom.

  • The European Union has drafted regulations banning or restricting machine learning applications based on risk categories. Real-time facial recognition and social credit systems are banned; Applications such as critical infrastructure control, law enforcement assistance and biometrics require detailed documentation to demonstrate that AI solutions are secure and subject to continuous human supervision.

The draft rule, issued in April, is still in the legislative process and is not expected to be enacted for another 12 months.

  • Starting next year, China’s Internet regulator will force supervision of AI systems and recommendation algorithms that could undermine social order and good morals. The targets include systems that spread misinformation, guide addictive behavior and harm national security. Companies must seek approval before deploying any algorithms that could sway public sentiment, and no offending algorithms are allowed online.

  • The U.S. government has proposed an AI bill of Rights to protect citizens from systems that could violate privacy and civil rights. The government will continue to collect public comments on the proposals until January 15. Below the federal level, several state and municipal governments are starting to restrict facial recognition systems. New York City passed a law requiring bias audits of hiring algorithms.

  • The UN High Commissioner for Civil Rights has called on member states to suspend certain uses of AI, including situations where it could violate human rights, limit access to basic services and misuse private data.

Behind the news

The AI community is moving towards a regulatory consensus.

A recent survey of 534 machine learning researchers found that 68 percent of respondents believe that model deployment should indeed value reliability and reliability. International institutions such as the European Union and the United Nations were also generally more trusted than national governments.

Current situation of the development of

Outside China, most AI-related regulations are still under review. But as the proposals stand, AI practitioners must prepare for the inevitable prospect of full government involvement.

If you find it useful, please share it in moments!

Finally, I welcome you to follow my wechat official account: Duibainotes, which tracks the forefront of machine learning such as NLP, recommendation system and comparative learning. I also share my entrepreneurial experience and life perception on a daily basis. Students who want to further communicate can also add my wechat account to discuss technical problems with me, thank you!