Editor’s note: with the improving of the network performance, video has become more and more people access to information, entertainment, leisure, the main way at the same time also emerged many video creators, they will like of the person, thing, that appears in the form of creative video, this let small make up a small white as no video creation cells and envy. However, such ability can be realized soon with the help of technology. GODIVA, an open field video generation pre-training model proposed by Microsoft Research Asia, overcomes the challenge of video frame continuity modeling based on 3D attention mechanism, and can realize video generation based on text script in seconds.
Question: Put an elephant in a refrigerator… ? Bushi. Let’s do it again.
Q: What are the steps from text script to creative video generation?
Answer: that is quite a lot of steps, need picture conception, sub-shot design, color collocation, scene screening, element ornament, field shooting, animation production and so on… Can be said to be a trek, heaven and earth……
However, with the continuous development of artificial intelligence technology, we only need to input text script in the future can directly generate video, can be completed in one step.
Recently, the natural Language Computing group of Microsoft Research Asia released an innovative research result — GODIVA (paper link: arxiv.org/abs/2104.14…
Yes, you read that right, it’s the work of researchers in the field of natural language processing (NLP). Why did NLP scientists start studying video? How did they implement this technology? Let’s find out.
Cross-modal intelligence in natural language and vision
In fact, the acquisition of information through text reading and verbal dialogue is only part of the learning process of human development. Some of the information also comes from visual input. Birds can fly in the sky, can’t swim in the water, and so on. Since this kind of information is taken for granted by people and is rarely reflected in text and language, researchers are increasingly aware of the lack of common sense knowledge in existing models when training NLP models based on large-scale text, because such knowledge is more often shown in pictures and videos.
Previously, due to the differences in computing power and AI’s cross-domain processing methods, it was difficult to learn cross-domain and multi-modal content. In recent years, with the rapid development of NLP technology, some new underlying models continue to emerge, and have been applied to computer vision (CV) and other fields of model training, such as Transformer.
With the increasing commonality of underlying models between natural language processing and computer vision, researchers in the two fields are increasingly trying algorithms from the other field to improve the representation and reasoning ability of models. In the global ACL of NLP, many papers related to multi-modal question answering, multi-modal abstract generation and multi-modal content retrieval have appeared in recent years. There are also many cross-modal methods that combine NLP tasks at the summit of CVPR and other computer vision fields. Cross-domain, multi-modal machine learning is starting to pay off.
“From the perspective of NLP research, we hope to be able to learn common sense information from video or picture signals that are not able to be described in text, so as to supplement common sense or physical knowledge that is lacking in existing NLP models, and ultimately make NLP models achieve better results. It also allows NLP to build an internal connection with image and video tasks, “said Nan Duan, senior research fellow in the Natural Language Computing group at Microsoft Research Asia.” This is where we started our video generation research.”
GODIVA’s innovative 3D attention mechanism solves the challenge of video continuous modeling
At present, the common video generation technology is mainly based on generative adversarial network (GAN), while the video generation of Microsoft Research Asia is based on VQ-Vae technology. For NLP researchers, the latter approach is closer to that of THE NLP field, which maps video/image information to text and then processes it through serialization to generate symbols.
There is no essential difference between video and image here, as video can be split into many video frames, i.e., pictures. Vq-vae algorithm model can be used to encode every video frame into discrete vector representation, so that the image information can be corresponding to the corresponding text, so as to serialize the token that NLP is best at processing, making full use of the existing NLP model and algorithm. After the large-scale data pre-training, the discrete sequence is reversely restored to video frames based on the VQ-Vae model, and all the frames are connected together to form the visual video.
The advantage of this method is that each generated video frame has a high correlation with the text, but how to ensure the smoothness between the generated video frames and how to solve the long sequence modeling problems encountered in the generation of video are the technical difficulties that researchers must overcome. In view of the above two challenges, researchers introduced a 3D sparse attention mechanism between frames. When generating a certain area of the current frame, visual information from the three dimensions of Row, Column and Temporal of the area was also considered (as shown in Figure 1).
FIG. 1: Mask matrix of 3d sparse attention
Figure 1 shows the 3D sparse attention mask matrix when the input text length is 3 tokens, the output video is 2 frames, and each frame is composed of 4 tokens. Where, rows represent the eight visual tokens (V_1, V_2… , v_8), indicating which tokens need to be concerned to generate the visual token. Red, blue and purple represent the unique attention of row, column and time, respectively. Green represents the attention shared by the three attention mechanisms, and black represents no attention. For example, the first three columns in green indicate that all the language tokens t_1, t_2, t_3 are looked at by the three attention mechanisms when v_1 is generated. The second row indicates that the first three columns are green, and the fourth column is blue, indicating that in addition to the three attention mechanisms focusing on T_1, T_2, and T_3, the column attention mechanism also focuses on V_1 when producing V_2. This is because when a frame of video consists of four tokens, V_1 is the leading token of V_2 (as shown in Figure 2, for the Column axis, 1 precedes 2). For example, the fourth line indicates that the model pays special attention to V_2 and v_4 in addition to T_1, T_2 and T_3 when generating V_4. As can be seen from Figure 2, this is because V_2 is the prior line token of V_4, and V_3 is the prior token of V_4. It is worth mentioning that, in order to reduce calculation, the model no longer pays attention to v_1, which is more distant from V_4.
Figure 2: Token arrangement when each video frame consists of 4 tokens
This has three advantages: first, through sparse attention-modeling, the model saves a lot of computation (as can be seen from Figure 1, a large number of positions are black), which enables long sequence modeling. Second, the three dimensions of row, column and time make the model consider the dependence of space and time when generating a certain visual area, so as to generate smoother videos within frames and smoother videos between frames. Third, since all text information is considered when generating each visual token (the first three columns of Figure 1 are green), the generated video and text are consistent.
Figure 3: GODIVA model diagram
Figure 3 shows the entire GODIVA model diagram. As can be seen, the model can iteratively generate sequences of visual tokens through the cyclic stacking of rows, columns, and time sparse attention described above. After these tokens are assembled, the video is output frame by frame through the vQ-Vae decoder.
In addition to the above technical challenges, another difficulty of text based video generation is that the evaluation of video generation effect is relatively subjective. The same text of a child and a dog playing by the swimming pool may have thousands of ways to present the corresponding video. It is difficult to measure the generated video through annotation data, which brings great challenges to the automatic evaluation mechanism of video generation research. In order to solve this problem, researchers of Microsoft Research Asia adopted a combination of manual audit and technical evaluation. In terms of technical discrimination, researchers based on CLIP (link to paper: arxiv.org/abs/2103.00… RM (Relative Matching) :
As shown in The command, T indicates the input text description, V ^(L) and V ^(L) indicates the physical video V and FRAME L of v^(L) generated video respectively, and CLIP(T, V ^(L)) indicates the similarity between T and V ^(L) calculated based on the CLIP model. Experimental data show that this indicator can well select the input text description corresponding to the generated video from several text description sets (according to the maximum score of RM, as shown in FIG. 4), thus proving that there is a good correlation between the video content generated by GODIVA and the input text description.
Figure 4: Similarity between input text and video standard answers
Currently, GODIVA has achieved good test results by pre-training on the public dataset of HowTo100M and finetune on the public dataset of MSR-VTT. Although the existing version only generates ten frames of video, it can be seen that the video has a high coherence and a high correlation with text, which preliminatively verifies the feasibility of text based video generation technology. As the algorithm is updated and computing power improves in the future, the researchers will further refine the video length, quality and resolution, and more details.
Here’s a look at some of GODIVA’s current text-based video effects:
Enter the text: Digit 9 is moving down then up.
Output video:
Enter the text: Digit 7 moves right then left while Digit 3 moves down then up. At the same time, the number 3 moves down and then up.)
Output video:
Enter the text: A baseball game is played.
Output video:
Enter text: A girl on the Voice Kids Talks to the nation.
Output video:
Multi-technology convergence is the trend, “out of thin air” video is far behind?
As for the generation of video, you may have a question: does the AI model search and filter a corresponding video according to the text, or completely create a new video? This question is a bit philosophical. Picasso once said “Good artists copy, great artists steal”. Artists’ artistic creation will integrate and innovate the essence of all aspects absorbed by them, so AI is not immune from the common practice.
Generally, the generation of text to video can be divided into three kinds: the first kind, based on search, screen out the most consistent video (related paper link: arxiv.org/abs/2104.08…
GODIVA, a text-based video technology currently being developed by Microsoft Research Asia, is somewhere in between the second and third methods — partly generated by AI models taken from existing videos and partly generated by AI models themselves. However, vQ-Vae and GAN as the core technologies for text generation video, there are certain deficiencies, but also have their own advantages.
“In the future, the fusion and complementary advantages of VQ-Vae and GAN technologies will become a research direction of text generation video. We are also trying to innovatively combine multiple AI technologies to improve the content quality and length of generated videos, and hope to advance NLP pre-training models in cutting-edge areas such as multimodal processing and common sense knowledge acquisition by focusing on video understanding and generation research, “duan said.
GODIVA: Generating Open-domain Videos from nAtural Descriptions
Links to papers: arxiv.org/abs/2104.14…
Chen Fei Wu, Lun Huang (Duke University), Qianxi Zhang, Bin Yang Li, Lei Ji, Fan Yang, Guillermo Capiro (Duke University), Nan Duan
Microsoft Research Asia’s latest research results: text generated video, just one step