With the development and application of various new AI technologies, they not only bring us changes in the form and content of services, such as various video effects and function realization, but more importantly, the optimization of service quality and cost. This paper is compiled from the online lecture shared by Wu Jinzhong, NVIDIA deep learning solution architect, and discusses in detail the research and latest practical applications of audio and video AI technology enabled by NVIDIA GPU hard codec scheme and CUDA parallel computing architecture.
By Kim Jong Oh
Article collation/LiveVideoStack
I am Wu Jinzhong, architect of Nvidia deep Learning solutions. Today, I would like to introduce nvidia’s solutions in live broadcast scenarios.
Nvidia proposes two solutions, Broadcast Engine (RBX) and Maxine, for the application scenarios of consumer Gpus and data center Gpus. I’ll start with a brief overview of the two solutions, then go into detail about the video compression technologies codec and Transcoding, and finally, NGC, which is the hub for Nvidia’s highly optimized solution that makes it easy for developers to use gpus.
1. RBX & MAXINE
At present, in the live broadcast scenario of Nvidia, GPU functions are divided into two parts according to the running platform, one is RBX, the other is MAXINE. RBX relies on the powerful computing power of RTX GPU and supports real-time AI effects, such as background noise elimination, green screen, hypersegmentation, face tracking, etc. MAXINE is for data centers. It’s nvidia’s latest solution for 2020. Both RBX and MAXINE are actually a collection of AI capabilities.
The picture on the right shows the stack structure of RBX and MAXINE. Their functionality is provided by a set of SDKS. The green screen and hypersegmentation we just talked about are provided by the video effects SDK, background noise elimination is provided by the audio effects SDK, and Face Tracking is provided by the AI effects. At the same time, it also integrates the functions of Codec, using the GPU hard Codec unit to provide streaming capability. Jarvis is nvidia’s solution to conversational AI. Jarvis is introduced into live broadcasts to provide transcription, ASR, TTS and other functions. Audio2Face is the voice-driven Avatar function that was used in the Misty water effect we saw on GTC. Meanwhile, the latest Research results from NVIDIA Research will continue to be added. We will explain the AI video compression technology in detail later.
Further down is CUDA parallel computing architecture, which provides powerful parallel computing capabilities of Gpus. All of the AI models we just mentioned need to be accelerated through CUDA and TensorRT for real-time computing. TensorRT is an SDK for AI model inference acceleration provided by Nvidia. At the same time, the Tensor Core matrix multiplier unit has been added to the GPU from Volta (such as V100) generation to further accelerate model inference. Tensor Core can also speed up training by mixing precision training. Further down is the hardware layer of the GPU.
1.1 RBX
These are the main features of MAXINE, which fall into several broad categories. Video effects include green screen, hypersegmented, Upscale, artifact Reduction. The function of the green screen is to split the foreground and background, which can be blurred or replaced with other pictures, videos or game images. Hyper and Upscale both have resolution enhancement. Hyperpartitioning also has an artifact reduction function in addition to video enhancement, which is computation-heavy. Upscale has a certain sharpening effect and is a lightweight algorithm. Artifact reduction is primarily used to remove artifacts caused at low bit rates.
Audio effects include the elimination of background noise, such as PC fans, people whispering in the background during live broadcast, and rain outside the window. AR effects include face detection, tracking, Landmark points and 3D mesh generation. Face detection outputs 2D bounding boxes and supports multiple bounding boxes. Users can also specify the number of output bounding boxes according to their own scenarios.
The detection of key points on a human face outputs 126-point landmarks, which can track eyes, lips and eyeballs. Its input data can be image + bounding boxes or image can be directly input to internally detect faces and regenerate bounding boxes. The processing method of 3D mesh is similar to the above two methods. In general, input image + landmark point. If landmark points are not available, landmark points can also be calculated internally, and then 3D mesh can be obtained by fitting landmark points. The url at the bottom right contains more details.
1.2 MAXINE
MAXINE is Nvidia’s solution for video conferencing. According to statistics, the number of video conferencing in the world has exceeded 30 million per day. Due to the impact of COVID-19, video conferencing tends to become a regular demand. MAXINE has many features in common with RBX, but MAXINE is deployed in the cloud. One advantage of cloud deployment is that it is unbound from terminal computing power and provides users with a consistent quality of service. MAXINE also offers green screen, hyper-segmented, Artifact Reduction, elimination of background noise, transcription, NLU, TTS, AR effects, and TRT/ Tensor Core for acceleration. MAXINE can be deployed either through Kubernetes microservices or through DeepStream. MAXINE also provides an SDK for development, which users can integrate into their systems according to their needs.
DeepStream is Nvidia’s proposed end-to-end video streaming solution. It is based on the open source framework GStreamer, which mainly integrates two parts: one is the function of GPU CODEC, the other is the function of AI reasoning. DeepSteam’s plugin structure makes it easy for users to develop their own features.
*BROADCAST APP DEMO
This is the elimination of background noise. This feature is already embedded in popular live streaming software, such as OBS. Other examples in videos are green screen effects, background blurring replaced by video or game images, etc.; There’s also the camera following effect.
*NVIDIA MAXINE
This is the hyperpartition effect, 360p on the left, 720p on the right. Next up is the camera follow effect we just mentioned, which is deployed in the cloud. Followed by green screen effects, background noise elimination, Audio2Face, translation in video conferencing, AI video compression technology, face correction.
2. AI video compression technology
Here is a detailed explanation of AI video compression technology, which is a recent Research result of NVIDIA Research. There are two core requirements for video conferencing: one is bandwidth reduction. Video occupies a large amount of bandwidth resources, so bandwidth reduction has become the core requirement. Another requirement is that the user experience is not high. The face orientation in video conference is generally not directly facing the screen, which cannot achieve the effect of face-to-face communication. Therefore, improving user experience is also a core requirement. If the face orientation can be corrected, it will greatly improve the user experience.
Let’s look at the traditional video compression technology. Taking H264 as an example, it adopts the I frame, P frame and B frame coding technology. Even so, the amount of data transmitted over the network is still huge, and video conferencing has its special characteristics. In general, there is only one user in each signal and the background is static. The one Shot Free View technology introduced here proposes the technique of transmitting only one image and then initializing it with very little data per frame for video reconstruction. The initial image contains information about the user’s appearance identity. Each frame only needs to transmit the head pose information and the facial expression information delta. The benefit is that the amount of data transmitted per frame is greatly reduced and the video is further compressed. At the same time, the use of GAN technology also ensures the quality of video reconstruction.
The first picture on the right is the original video, the second picture is the reconstructed video, but the reconstructed video’s face orientation is not corrected, and the third picture is the result of video reconstruction after face alignment. This is mainly because of the decomposition technique of the main key points proposed in the paper. The face identity information and movement information are decomposed, so in the process of reconstruction, only the facial information can be kept unchanged, but the information of head posture can be adjusted, and then the face orientation can be corrected. Compared with the existing method, this article is free view, which can be rotated at a larger Angle near the view of the original image.
At the same time, Nvidia has done a lot of work on GAN. Relevant to this article are pix2pixHD and Vid2vid to solve the problem of continuous time domain. And few Shot Vid2vid to solve the problem of recreating a video with fewer images; Then one Shot Free View to solve the single shot large Angle rotation problem. In addition, FOMM, First Order Motion Model is closely related to this article. Different from FOMM, this paper uses implicit 3D key points instead of 2D key points. At the same time, the head pose, movement information and identify information are separated for face correction.
Now let’s look at the process. This paper contains two steps: The first step is feature extraction, including feature extraction of source images and driving videos. The decomposed information includes three parts: The first part is appearance identity, which has nothing to do with the user’s movement information and can be extracted only through a user’s picture, which is then input into the following network for reconstruction. The second is the head pose R/T (rotation or translation of two parts), which needs to be transmitted per frame; The third is the expression Delta facial expression, which also needs to be uploaded every frame to reconstruct facial expression.
The input of the network in the lower left figure includes two parts: source image and driving video. Since it is the person in the Source image that needs to be reconstructed, we need to extract the information of appearance identity from the source image, which has nothing to do with the posture of the user. We only need a picture of the user. Subsequently, expression delta and head pose R/T data need to be extracted from the driving video. Because we want to drive the source image through the expression of the target, reconstruction can be carried out after we have the above three information. If the head pose is not provided by the driving video but specified by the user, then pose editing can be achieved.
The picture on the right shows the process of feature extraction, in which every square is a DL model, and there are four DL models in total. Firstly, appearance feature extractor F, which is the information of extracting features. It only needs to be done once, and the output is appearance feature FS. The second is the head pose Estimator H, which outputs two vectors of rotation and translation, which need to be transmitted per frame. The third is the expression delta, which also needs to be transmitted per frame; The fourth is key points Detector L, which extracts a certain number of 3D key points. Here, 3D key points are implicit and not really in 3D space, just like 3D mesh. The key points used in the paper are 20, which were learned unsupervised from the Internet. Here L only computs the source image, and the output Xc,k and Jc,k provides the default pose information.
We look at the blue arrow in the figure. Once we have the key points Xc,k and Jc,k, we can rotate or translate them. Finally, we add expression Delta, and then we can reconstruct the target image. Here Ru and Tu are the user-specified head poses.
Visually see the separation of identity information and movement information. The two images on the left are source image and driving image. Key points Xc and K are implicit key points in three-dimensional space. As a demonstration, only five key points are displayed. Rotate and shift the key points, and the reconstructed image is the rotated and shifted image. It can be seen that the head posture of such an image has been aligned with the head posture in the driving image, but the facial expression is not consistent. Finally, we had to add delta to the facial expression to reconstruct it. We can compare the effect of reconstruction. In the process of reconstruction, identity information remains unchanged, and the head posture and expression are consistent with the driving video.
What I just said is the first step, and the second step is video reconstruction, which can be divided into two parts: the first is feature volume wrapping, and the second step is to feed the deformed feature volume W (FS) to the generator for frame reconstruction. The idea of feature volume here is borrowed from FFOM. One improvement is to replace the original 2D explicit key points with 3D implicit key points, so that 3D implicit key points can be rotated and shifted. Those of you who are interested can look at the paper.
Now let’s look at the data compression part. The transmitted data consists of two parts: one is the original image, only need to transmit once; The other is R/T, head pose, and expression delta that need to be transmitted per frame. These data also need to be transmitted in a compressed way. Let’s look at this flow chart.
The input picture is D, and the head pose, R/T and facial expression delta are obtained after coding. These data need to be transmitted to the receiver for reconstruction. In the process of transmission, the format of data coding is entropy coding.
This graph shows the amount of data that needs to be transmitted per frame in two configurations (20 key points and adaptive key points).
If we look at the average amount of data, less than 100 bytes of data need to be transferred per frame. Overall, ai-based video compression technology can save ten times the network bandwidth compared to H264. Of course, when there are new objects in the background, or there are hats, glasses, masks and other shielding objects, a new Source image needs to be uploaded at this time, and then reconstruction based on this new source image is carried out. The new Source image can also be compressed and uploaded.
This is a comparison of the two images. On the left is H264, which has a bandwidth of 97KB per frame; The image on the right shows the video compression technology, whose bandwidth is only 0.1KB. It can be seen that the reconstruction effect of AI video compression technology is still very good at low bit rate.
Taking a look at the renderings, many details of the H264 code are lost in the same bandwidth. In the case of low bandwidth, AI video compression has a better reconstruction.
These are some of the resources.
An important module in live broadcast is codec. The editing code unit and the computing unit on the GPU are independently opened and can be used simultaneously. What they share is video memory and PCIe bandwidth. Encoder supports commonly used H264 and H265 encoding formats. Decoder supports more formats, such as VP8, VP9, etc. Support for decoding AV1 has been added to the latest amps architecture, with resolution up to 8192×8192. The codec of H264 can reach 4096×4096 on GPU, and the codec of H265 can reach 8192×8192.
Let’s take a look at the performance of CODEC, which is shown here decoding at 1080p. The comparison of the decoding capability of each GPU on H264, HEVC, HEVC10bit and VP9 in different formats. The highest line is T4, and it can be seen that THE performance of T4 decoding in each format is relatively superior. T4 is mainly used for online service scenarios, and its goal is that a variety of tasks can be performed well on T4. For H264, the decode Performance for T4 is about 1000 frames per second and for H265 it is about 2100 frames per second.
H264 1080p30fps code on the left, HEVC 1080p30fps code on the right.
For T4 FAST, 19 and 13 channels of concurrency can be supported respectively. This is the case where you are not sensitive to latency; In case of latency sensitivity, such as cloud games, live video, b-frame, and Look Ahead technology, there is no way to use latency. The number of concurrent channels is 17 and 12. On the whole, the number of concurrent paths supported by GPU is higher than that supported by CPU.
In the transcoding scenario, Codec’s SDK has been integrated into FFmpeg, using the same process as the traditional approach. First of all, video stream comes in –> set GPU decoding (h264_CUVID) –> do some filtering (scale/transpose) on GPU –> set GPU encoding (H264_nVENc) –> Finally output video.
It also supports one – to – multiple transcoding on GPU. Multiple Gpus are supported. The value can be -GPU list or -hwaccel_device 1. It also supports the mixed operation of GPU and CPU. For example, if some formats are not supported on GPU, the decoding work can be transferred to CPU. After the decoding is completed, the decoded data can be transferred to GPU through hwupload_CUDA. The GPU also supports common filters such as resize, scale, crop, and so on.
This is a typical FFmepg command line. -vsync 0 is typically specified, which ensures that neither duplicate frames are generated nor frames are discarded. This parameter is often used when comparing CPU and GPU functions. – HwAccel Cuvid ensures that decoded data resides on the GPU. Then set the h264 hard decoding function, input video, copy audio, use H264 to encode, set the bitrate to 5M, and finally output video.
A relatively neglected point in transcoding is the implicit data exchange between CPU and GPU. To ensure that decoded data resides on the GPU, the -hwaccel cuvid parameter needs to be specified. When this parameter is not specified, FFmpeg will copy the decoded data back to the CPU by default. When we use the GPU for encoding, the data needs to be copied back to the GPU. This is an unnecessary data exchange, so try to reduce the use of this method.
Sometimes there are some render missions during the live broadcast. On a server-side Headless server, OpenGL is usually used together with EGL to implement GPU off-screen rendering. Render the result. The simplest method is to read pixels to the CPU through glReadPixels, and then perform subsequent processing, such as filtering, and finally falling disks.
Another more efficient way: after rendering, map it to CUDA context through CUDA/OpenGL interoperation, and then filter it by CUDA kernel, copy it to CPU, and finally drop disk.
Photorealistic rendering can also be used for ray tracing using the OptiX SDK. For example, on the V100, Ray Tracing uses CUDA cores for ray intersections. Starting from Turing’s generation, RT cores were added to T4 to accelerate ray intersection, which can be directly developed using the OptiX SDK or the Ray Tracing extension in Vulkan. Meanwhile, OptiX was also integrated into DXR. On the server side, the OpenGL development environment can be downloaded from NGC, Vulkan can be downloaded from Container, and OptiX can be downloaded from the product page with complete sample code and documentation. Vulkan Ray Tracing can be referred to the documentation at the bottom. If you need to evaluate ray Tracing performance, you can use Nsight Graphics to profiling.
The figure above summarizes how data is exchanged in each module. It involves decoder and Encoder, which are provided by Codec’s SDK. AI reasoning and training, which are all provided by CUDA and various frameworks. OpenGL and Vulkan are part of the driver. So many modules need to interoperate that CUDA can act as a conduit. For decoders, pixels can be manipulated using the CUDA kernel between DecodLockFrame and UnlockFrame. For encoder, before EncodeFrame, CUDA kernel can be called to do color space conversion and so on. For OpenGL, resources need to be registered into CUDA first, and when rendering is complete, mapped and unmapped into CUDA context. For Vulkan, data can be copied to CUDA using transfer capabilities in Graphics and Computer Queues. For the AI itself, it is in a CUDA context, so there is no data exchange.
3. NGC
Here’s NGC, which provides containers for some of Nvidia’s solutions, pre-trained models and scripts, Helm Charts, gPU-enabled training frameworks, and more. The goal is to accelerate product development and deployment for AI, data science, HPC, visualization, and more. Currently, there are over 100 Containers and over 30 trained Models for x86, ARM, and Power platforms. Cloud, data center, and EDGE scenarios are supported.
The resources provided by NGC are securely verified and fully tested, with NGC updated monthly to provide the latest features and best performance for different scenarios supported. Supports The use of Docker Container, Singularity, Bare Metal, VMs, Kubernetes, etc.
This table shows performance comparisons between V100 and A100 training in three different scenarios. In BERT’s case, the first square shows the performance of the V20.05 V100 and the second square shows the performance of the V20.07 V100. Even if all V100 Gpus are used, the performance can be improved through software upgrades. With the latest generation OF GPU A100, the training performance can be significantly improved, and the NGC team will continue to optimize it.
NGC supports CV scenarios such as RESnet-50, SSD, MobileNet and VGG16. NLP has WaveGlow, BERT and NeMo. NeMo is NVIDIA’s open source toolkit for building conversational AI applications. For recommendations, there are Wide & Deep, DLRM, etc. NGC provides a pre-training model, which can be used for transfer training or inference directly. NGC also provides sample code, custom models, scripts to reproduce, and so on.
At the same time, NGC also provides Transfer Learning Toolkit, Transfer Learning Toolkit, to solve the problems in industrial scenarios such as insufficient training data, insufficient labels, inconsistent data distribution, inconsistent scenes, etc., which need to be reconstructed from the existing model. In terms of AI reasoning, NGC provides TensorRT container, which uses layer fusion, kernel auto-tuning and other technologies to optimize the AI reasoning process, so as to reduce delay and improve throughput. TensorRT is also integrated into mainstream frameworks where TensorRT’s capabilities can be used. NGC provides models for TRT pre-training on T4 and V100, including models for FP32, FP16 and INT8, among others.
Triton is a lightweight Inference server provided by Nvidia. Using CUDA Streams, Triton supports heterogeneous multiple Gpus, parallel Inference of multiple models, model import from existing frameworks, TRT, TF, Pytorch and other frameworks. This project has been open source. The tool provided by Nvidia can be downloaded from NGC. It is convenient and quick to download Docker Container, or you can download it from the product page.
* Video viewing address: mp.weixin.qq.com/s/u-F0VxEsi…