In the media content communication industry, video as a carrier of information transmission, its importance is more and more high. Generally, in order to cope with different playing scenarios, the video needs to modify the encapsulation container format, encoding format, resolution, bit rate and other media stream attributes, which are collectively referred to as video transcoding.
Based on the distributed processing cluster and large-scale distribution of system resources, the voD service of netease Yunit can meet the playback requirements of all terminal devices and provide enterprise users with extremely fast and stable video uploading, storage, transcoding, playing and downloading and other cloud service functions. The distributed task processing system of Yunxin carries the media processing capability, with the main functions of audio and video transencapsulation, transcoding, merging, screenshots, encryption, adding and subtracting watermarks, as well as a series of pre-processing functions such as image stretching, image enhancement and volume normalization.
Video transcoding is a core function of media processing. It usually takes a long time to transcode large video files. In order to improve service quality, we will focus on improving the video transcoding rate. This article will mainly focus on fragment transcoding and introduce netease Yunxin’s efforts and effect improvement in transcoding speed.
Influence factors and optimization of transcoding performance
The common video transcoding process is as follows:
In the transcoding process, the bottleneck is mainly in video stream, so our discussion on speed improvement is mainly for video stream processing, audio stream processing is not considered. In view of the video stream processing, the following aspects are discussed:
- Source video: Generally, the longer the source video, the longer the codec time.
- Encapsulation and codec: for video encapsulation, key frame clip clipping and other processes that do not need decoding and coding, the amount of computation required is very small, generally consuming 1~2s. If you need to re-encode and decode, it will take different time depending on the source video and the output parameters of the encoding. Common encoding formats, bit rate, resolution and frame rate are included. For example, different encoding algorithms have different compression rates and computational complexity, resulting in different time consuming. For example, AV1 encoding time is longer than H.264 encoding time. The larger the target bit rate, resolution, frame rate and other parameters are, the more time-consuming the calculation is usually.
- Horizontal and vertical scaling of computing resources: The stronger the single-core computing capability of a common processor, the smaller the transcoding time. The use of GPU, which is more suitable for image processing, is also conducive to reducing transcoding time. Improving the concurrent computation of transcoding execution stream also helps reduce transcoding time. The number of concurrent paths here can be multithreading or multi-process. Multithreading refers to the promotion of multithreading within a single process, while multi-process refers to the calculation of multiple processes on multiple sliced files after the file is sliced.
- Clustering task scheduling: multi-tenant cloud service system, are usually based on the allocation of resources between the tenant and the tenant priority dantian code and design task priority scheduling algorithm, scheduling efficiency is mainly embodied in the following aspects: how to use much less time scheduling tasks, how to use the cluster with fewer resources to achieve high throughput, how to do the priority and the balance design of hunger.
In view of the above influencing factors, we propose the following optimization directions: improving hardware capability, optimizing coding, fragment transcoding and improving cluster scheduling efficiency.
Dedicated hardware acceleration
Multimedia processing is a typical computation-intensive scenario, so it is very important to optimize the overall computing performance of multimedia applications. CPU is a kind of general computing resource. Offload of video image to dedicated hardware is a common scheme. At present, the industry such as Intel, NVIDIA, AMD, ARM, TI and other chip manufacturers have corresponding multimedia hardware acceleration schemes to improve the coding efficiency of high bit rate, high resolution and other video scenes.
Our transcoding system is mainly based on FFmpeg multimedia processing framework, Vendor solutions supported on Linux platforms include Intel’s VA-API (Video Acceleration API) and Nvidia’s VDPAU (Video Decode and Presentation API for) Unix), and both vendors also support the more proprietary Intel Quick Sync Video and NVENC/NVDEC acceleration solutions. At present, we mainly use Intel core graphics card video acceleration ability, combined with FFmpeg community QSV Plugin and VAPPI Plugin two ways, for AVDecoder, AVFilter, AVEncoder three modules to do hardware acceleration. Hardware acceleration technologies, vendors, and communities continue to optimize, and we will detail further practices in this area in our upcoming series of articles.
AMD large core server
Here, it mainly refers to the server equipped with AMD EPYC series processor. Compared with our previous online server, its single-core computing power is stronger and its parallel computing capacity is better. The improvement of single-core computing power enables us to have an overall general improvement in decoding, pre-processing and coding, while the ultra-large core makes the single-machine multi-process scenario in our fragment transcoding scheme more powerful and greatly avoids the cross-machine IO communication of media data.
Since the research Codec
NE264/NE265 is a video encoder independently developed by netease Yunxin, which has been applied in NERTC and live vod of Yunxin. In addition to the improvement of coding performance, the more important technical advantage of NE264 is low bandwidth and high picture quality, which is suitable for high bit rate and high definition live scenes (such as: live games, online concerts, product launches, etc.), can ensure that the subjective picture quality of the human eye remains unchanged, saving 20% to 30% of the average bit rate. Here is no longer introduced, interested can pay attention to netease zhiqi technology + wechat public number.
Shard transcoding
If the above performance optimization methods are vertical, then the fragment transcoding described in this section is horizontal. Video stream is essentially composed of a series of images and divided into a series of GOP with IDR frame as the boundary. Each GOP is an independent set of images. This content characteristic of video files determines that we can refer to the algorithm idea of MapReduce to cut video files into multiple fragments, transcode the fragments in parallel, and finally combine them into a complete file.
Task scheduling
In addition to the flow optimization for a single transcoding calculation, we also need to improve the overall scheduling efficiency of cluster resources. In the scheduling algorithm, the scheduling node should not only receive the task submission, but also complete the key process of task delivery. The algorithm design of this task delivery needs to balance multi-tenant allocation, task priority preemption and maximizing cluster resource utilization.
We designed two task delivery mechanisms:
- The Master node pushes tasks to compute nodes
- Compute nodes proactively pull tasks from the Master node
The advantage of the former is higher real-time performance, but the disadvantage is that the resource perspective of the Master on the compute node is a snapshot snapshot. In some cases, the lag of the snapshot information may lead to overload of some nodes. The advantage of the latter is that the nodes take tasks as needed to execute, and there will be no overload of some nodes. At the same time, the programmability of task selection is more convenient, while the disadvantage is that the Master has insufficient control over the real-time strength of global resource allocation.
Practice of fragment transcoding scheme
Media process
The simple process of media processing is shown in the figure below, which can be divided into four steps: fragment forward encapsulation (on demand), video sharding, parallel transcoding, and video merging.
Fragmentation process
In the case of sufficient cluster resources, there is generally no backlog and resource preemption in task scheduling and distribution. In this case, the processing and calculation of video stream itself will generally consume 80%-90% of the whole task cycle. Therefore, optimization in this stage can achieve higher cost performance benefits.
The two dimensions of hardware capability improvement and coding optimization are aimed at improving the computing efficiency of a single transcoding process, but the resources that a single process can call are limited, and the speed of large video files is also limited. Therefore, here we discuss how to use the idea of distributed MapReduce to shorten the time consumed by a transcoding task. The following chapter will elaborate on the implementation of fragment transcoding technology.
The infrastructure of sharding transcoding process is shown in the figure above. We first introduce the following concepts:
- Parent task: Similar to the Job in Hadoop, the transcoding Job submitted by the client needs to be divided into multiple small fragments.
- Subtasks: Similar to tasks in Hadoop, multiple fragments are packaged into Task subtasks that can be independently scheduled and executed.
- Parent Worker: compute node that performs the parent task;
- Child Worker: compute node that performs subtasks.
The main process of fragment transcoding:
- Dispatch Center sends a transcoding job to Worker0. Worker0 determines whether to fragment transcoding based on policies such as the master switch, job configuration, and video file size.
- If the fragment transcoding is determined, it is divided into N slices.
- Package n transcoding tasks and submit them to the Dispatch Center.
- Dispatch Center dispatches these N code subtasks to n workers that meet the requirements;
- After transcoding worker1 to N is complete, the callback is sent to Worker0.
- Worker0 downloads transcoded shard videos from N workers, and merges the transcoded shards together after all the transcoding is completed.
- Sends a callback to the client.
Subtask scheduling
In the scheduling system, each user’s task queue is independent and task quota is set separately. When the Dispatch center receives the fetch job request from the compute node, the scheduling line first selects the user queue with the smallest proportion of used quota (a simpler algorithm model can be the number of scheduled tasks/total quota of users) from multiple user queues. A subtask that meets the criteria of the compute node is returned from the queue head. Subtask scheduling and common task scheduling are different in scheduling priority and node selection, and need to be designed separately. Here we give a brief introduction.
-
Subtask priority Subtasks do not need to be re-queued in their respective user queues. The goal of scheduling subtasks is to be scheduled in the first time. In fact, the parent task has already been scheduled, and the system is designed to speed up the execution of the task, so it is distributed again. If it has to compete with other tasks that are not scheduled, it is unfair for the task and weakens the role of acceleration. Therefore, for slicing tasks, it will be placed in a global high-priority queue, which is preferentially selected for scheduling.
-
Subtask scheduling node selection
The sub-task scheduling nodes are mainly affected by the following factors:
- The machine type
Machine types are divided into hardware transcoding machine and ordinary transcoding machine. Since encoders used in the two environments are different, there may be defects in the video after the merged fragment. Therefore, we choose to schedule the sub-task to the same machine type as the parent task.
- Code version
Different versions of the code may cause that the outgoing fragments cannot be combined together properly. Therefore, when such version iteration occurs, the code version on the worker of the compute node can be used to determine which other compute nodes the subtasks can be scheduled to.
- Data is stored
When the tasks on the parent worker are concurrent, multiple upload and download network transmission will be carried out at the same time, which will increase the time of IO stage of shard file. Therefore, sub-tasks are preferentially executed on the parent worker to save the time of IO and upload and download on the network.
The problem of stragglers
In the scenario of shard transcoding, Straggler problem refers to that in multiple sub-tasks, if most of the tasks have been completed, but a few sub-tasks remain to be completed, the parent worker cannot enter the next process for a long time, thus causing the task to be blocked. This is a common phenomenon in distributed systems, and research papers on this problem are also common.
The solution to this problem will greatly affect the efficiency of the system. If the parent worker chooses to wait for the sub-task all the time, the task may wait for a long time, which is contrary to our original intention to speed up. Therefore, based on the principle of ensuring that the task can be completed within a limited time, there are the following optimization directions:
1. Redundant scheduling
This solution references MapReduce’s solution to the problem of stragglers in Hadoop: When the timeout standard is reached and the sub-task is not completed, the parent worker will send a new Tsak sub-task to the Dispatch center for the same shard file again for rescheduling and re-execution. When one subtask completes, cancel the other.
The idea is to trade space for time, not to put hope on a single node, but to adopt a horse-racing mechanism. However, when this happens too much, it will cause a lot of redundant tasks, and there is no guarantee that the new subtasks will not block.
2. The parent worker succeeds
In order to solve the shortcomings of redundant scheduling, we optimize it. When the timeout standard is reached and the sub-task is not completed, the parent worker will select the fragments with the least progress for transcoding. Similarly, the completion of one task cancels the other redundant task. If there are sub-tasks not completed, then continue to select, complete their own transcoding, until all sub-tasks are completed.
The difference between the second scheme and the first scheme is that redundant tasks will not be rescheduled to other workers for execution, but the parent worker will be given priority for redundant execution. In this way, the parent worker keeps transcoding fragments until the job is complete. The biggest advantage is that the parent worker will not be in the infinite waiting state without unlimited resource consumption. Only in a few cases, when the parent worker has a high load, other workers with idle resources will be considered.
Subtask progress tracking
When the parent worker selects a subtask to execute, the progress of the subtask needs to be collected and the subtask with the slowest progress needs to be selected for redundant execution. In the calculation of task progress, we divide a transcoding into four stages: waiting for scheduling, download and preparation, transcoding calculation execution, upload and closure.
The beginning of different stages indicates the arrival of different progress:
Waiting for scheduling 0% → Download and prepare 20% → Transcoding calculation execution 30% → Upload and finish 90% → End 100%
Transcoding calculation execution accounts for 70%, and the execution speed cannot be guaranteed. Therefore, detailed calculation of the transcoding progress is required. The transcoding execution flow periodically outputs metric logs and monitors the calculation.
HLS/DASH encapsulation
HLS is different from other encapsulation formats in that it contains multiple TS files and M3U8 files. The fragmentation and transcoding of HLS video transfer tasks increases the complexity of fragmented video transmission and management. Therefore, our solution to this problem is to first convert the source video into MP4 video, then merge the video on the parent Worker, and then convert the whole video into HLS encapsulation.
The test data
By recording and comparing the conversion rates of the same video to videos with different resolutions, we can find that each single optimization measure improves the transcoding speed to different degrees. In actual online scenes, we usually decide to use several optimization methods comprehensively according to user Settings and video attributes.
Test video 1 properties:
Duration: 00:47:19.21, bitrate: 6087 KB /s
Stream #0:0: Video: h264 (High), yuv420p, 2160×1216, 5951 kb/s, 30 fps
Stream #0:1: Audio: aac (LC), 44100 Hz, stereo, fltp, 127 kb/s
Test Video 2 properties:
Duration: 02:00:00. 86, bitrate: 4388 KB/s
Stream #0:0: Video: h264 (High), yuvj420p, 1920×1080, 4257 kb/s, 25 fps
Stream #0:1: Audio: aac (LC), 48000 Hz, stereo, fltp, 125 kb/s
conclusion
The above is the whole content of this paper. Netease Cloud transcoding team mainly improves video transcoding speed through scheduling optimization, hardware capability, self-developed coding, fragment transcoding and other dimensions. Test results show that transcoding speed is significantly improved. In addition, the main design of fragment transcoding module in cloud message transcoding system is introduced emphatically. We will continue to explore technology, speed up and cover more scenes. In the next series of articles, we will also share other aspects of cluster resource scheduling algorithms, hardware transcoding practices, and welcome to keep following us.
The authors introduce
Luo Weiheng, senior server development engineer of netease Cloud Messaging, graduated from School of Computer Science of Wuhan University with a master’s degree. He is a core member of netease Cloud messaging transcoding team. He is currently responsible for the design and development of cloud messaging media task processing system and is committed to improving the quality of cloud messaging video transcoding service.