takeaway

AI algorithm has been widely used in the video entertainment industry. One of the biggest bottlenecks in the process of video data processing is the video frame extraction delay, which often takes up most of the time of the whole service. In addition, different AI algorithms applied to different businesses have different requirements for video frame extraction.

Therefore, this paper introduces the application of an efficient universal frame extraction tool in AI video inference service, which can reduce the overall processing delay of the service. In view of the different needs of AI algorithm for video frame extraction, the function of generalization is provided in different use scenarios.

AI algorithms have been widely used in AI video reasoning services. At present, there are hundreds of services related to video data in IQiyi AI services, and each service is composed of multiple algorithms. These AI algorithms have different requirements for input video data and different deployment hardware platforms, leading to various challenges for AI algorithms in video reasoning services.

For example, the main challenges of video audit services are as follows: in order to improve user experience, the audit needs to be completed within a very short period of time; The video uploaded by the user has a variety of encoding formats, which requires the video frame extraction tool to support different encoding grids. Video review business needs to review vulgar, bloody, violent, political, children’s cult and other aspects. Some algorithms are deployed on GPU, and some algorithms are deployed on CPU, which requires the frame extraction tool to be able to process with a very short delay no matter on CPU or GPU.

In terms of word production, transition point, behavior recognition, video framing and other businesses based on long video, the main challenges are as follows: The frame extraction tool needs to ensure the accuracy of the result of frame extraction, that is, the extracted video frame and its time stamp are exactly consistent with the original video. Under the scenario of high throughput to improve resource utilization, the long video frame extraction can also be processed as soon as possible to improve the work efficiency of students with different business lines.

1. The overall service delay is large and the utilization of hardware resources is low

For example, a 1-hour, 25FPS, 1080P video will have a total of 90,000 images after full frame extraction. A single service takes a long time, seriously affecting service production efficiency and resulting in low service efficiency. The AI video reasoning service link is mainly composed of several aspects: download, frame extraction, pre-processing, AI algorithm processing, post-processing and upload. Video frame extraction and AI algorithm reasoning occupy most of the time. For example: 1 hour, H.264, 1080P video using a 4-core 6148 CPU frame extraction JPEG image needs 760 seconds.

When THE AI video reasoning uses CPU resources in preprocessing or frame extraction, GPU resources are not fully used. Or a single algorithm model consumes less GPU hardware resources, which may lead to low GPU resource utilization.

Second, algorithm requirements vary greatly, and deployment hardware resources are different

Iqiyi’S AI algorithm is booming in the field of video and image. Different AI algorithms have different requirements for frame extraction when they are used in different businesses: The number of frames extracted per second, whether images of different formats need to be saved when extracting frames, whether RGB data can be directly stored in memory or video memory when extracting frames, whether only key frames are extracted when extracting frames, zooming and clipping images when extracting frames, obtaining image time stamps and so on. Different requirements make it difficult to meet all requirements through a single set of existing solutions.

Project research

At present, the hardware platform of video decoding mainly includes CPU, GPU, FPGA and professional codec chip, among which FPGA does not fully support AI algorithm, while professional decoding chip has too single function. Therefore, CPU and GPU are used as the common codec hardware on the service side. The most commonly used tool for CPU decoding is FFmpeg, which can better meet the needs of different AI algorithms to extract frames. The following describes CPU and GPU respectively.

A general scheme of CPU frame extraction applied to AI algorithm

At present, the most common methods of using FFmpeg frame extraction on CPU to apply to AI algorithm are as follows:

1. FFmpeg will extract the video frame and save it as an image, and the AI algorithm will call: the most traditional way is to download the video, use FFmpeg to decode and save the image, and the AI algorithm will read the image, make inference after pre-processing, and upload the result after post-processing. This traditional approach adds unnecessary time to overall service processing and has two major drawbacks:

A. The processing delay of video frame extraction and algorithm inference is too large, and each module is blocked. The subsequent steps can be processed only after the previous steps are completely processed;

B. A 1080P RGB lossless original image needs 5MB storage space, and a 1-hour 1080P video with all frames extracted and the original image saved needs 450G storage space, which will bring huge storage pressure.

For this video extract frame often save JPEG image format, JPEG image has a very high compression rate, a 1080P JPEG image often only need 0.1MB storage space, compared to save the original image can save dozens of times of storage space. However, its defect is lossy compression, that is, after the JPEG image is read, there is a certain information loss compared with the original image, which may lead to lower accuracy of AI algorithm reasoning. In addition, when the AI algorithm reads the image, it needs to decode the JPEG image into YUV format and convert the YUV image into RGB format required by the algorithm. This shows that the extraction of frames to save JPEG images provided to the algorithm is actually the worst policy, but some services and really need to save the image completely, so the scheme is still used in some services.

2. In view of the defects of the above scheme, FFmpeg frame extraction is currently used on THE CPU to provide the AI algorithm with a better scheme: after decoding the video in YUV format, the color space is converted to RGB format and stored in memory; The AI algorithm directly reads the RGB image data in memory and pipelines each link so that each link can be processed asynchronously. Its frame diagram is:

Compared with scheme 1, scheme 2 greatly reduces the delay, and no longer requires lossy compression image, which can retain the real information of the image to the maximum extent and avoid the reduction of AI algorithm accuracy. However, nowadays videos are usually of high resolution. When extracting frames of 1080P and 4K videos, the delay of extracting frames may be longer than the processing time of AI algorithm even in the non-falling disk mode (non-falling disk: after decoding the video, YUV format is converted to RGB data and stored directly in memory or video memory). Especially after the AI algorithm has gone through graph optimization, operator optimization and fixed-point quantization, the main bottleneck of the overall service delay is reflected in the long CPU frame extraction time. It takes 350 seconds to use a 1080P video on CPU 6148 without falling disk (falling disk: the video is decoded to YUV format, and the decoded YUV format image is re-encoded and saved on non-volatile storage (such as hard disk, SSD), usually in JPEG format. , it takes 760 seconds to draw frames. In addition, as a video entertainment company, IQiyi’s AI algorithm often needs accurate time stamps to calibrate the exact video position corresponding to the image of frame extraction during video processing, while the open source FFmpeg cannot directly provide accurate time stamps for frame extraction.

2. General scheme of GPU frame extraction applied to AI algorithm

Nvidia-provided GPU frame extraction can significantly improve the speed compared to CPU using FFmpeg frame extraction. H.264 and 1080P video can reach more than 500 FPS on GPU V100, and can reach more than 1000 FPS on GPU T4. Therefore, the delay of GPU frame extraction is smaller than that of CPU frame extraction, and its main defects are as follows:

1. Compared with FFmpeg, it provides too few functions, such as 1 second extraction of N frames, only extraction of key frames, saving JPEG images after decoding, etc.;

2. Decoding only supports partial formats, which cannot meet all conditions;

3. The image decoded by GPU is still stored on the graphics card, and the AI algorithm often needs to preprocess the image before reasoning, while the preprocessing of video GPU after drawing frames is still on THE CPU, resulting in time-consuming data transmission and unnecessary delay, which is especially reflected in the case that every frame of video data needs to be processed. Data copy between two CPU-Gpus takes a long time and cannot be completely covered by the calculation delay.

The best way is to use CUDA functions to directly calculate the preprocessing on GPU. However, as there are too many services, CUDA optimization for the preprocessing of each algorithm needs too much manpower, so this scheme cannot be extended to all services.

4. Most of the current AI algorithms are written by Python, which makes it difficult to use GPU to extract frames directly. Although NVIDIA provides relevant tools to make it possible to use Python to call GPU to extract frames, it has more restrictions on the installation environment, and sometimes conflicts with the AI algorithm dependent environment, which makes it unable to meet the requirements of most AI services.

Implementation of general efficient Frame extraction scheme in video inference

Based on the above research, this section will elaborate on the optimization and function improvement of frame extraction on CPU and GPU, the addition of Python interface, and the pipeline optimization of frame extraction tool and AI algorithm in the overall process.

1. Improvement and optimization of CPU frame extraction

(1) Accurately obtain the time stamp of the extracted frame image: Timestamp in the video shows the timestamp PTS and decoding time stamp DTS, DTS is mainly used to identify to decode video frames into the decoder decoding sequence, and PTS is actually displayed in the image frames in video time point, if not B frame in the video, DTS and PTS sequence, but at the time when there is a B, P frame in the video The order of DTS and PTS is different. In the AI service, the timestamp is PTS, which corresponds to the actual time point of the video frame when it is played. The results of the images extracted from frames through AI algorithm inference need to be completely consistent with the PTS time points in the video, which strictly requires that the PTS obtained during frame extraction must be completely accurate. However, FFmpeg cannot directly return the PTS of the frame when extracting the frame. This scheme ensures that the PTS obtained by extracting the frame is consistent with the point position in the original video stream by optimizing the OUTPUT control logic of FFmpeg.

(2) ACCELERATION optimization of CPU frame extraction without falling disk and falling disk: This scheme adopts the method of resource exchange speed. For acceleration optimization of falling disk, multithreading is used to extract part of video clips respectively. In the case of video frame extraction without falling disk, it is necessary to ensure that the sequence of frame extraction results is provided to the algorithm when multi-thread fragment extraction is used. Each sub-thread is responsible for multiple small time periods, and the images obtained by each sub-thread after extraction are calibrated in sequence with time stamps. Ensure that the data provided by each child thread in turn is exactly the same as the result of the single-thread extraction.

Scheme 4 CPU frame extraction without falling disk frame extraction optimization

2. Improvement and optimization of GPU frame extraction

(1) Increase and improvement of GPU frame extraction function: After communicating with the algorithm and business requirements, the scheme adds the functions of GPU to extract N frames within a unit time of video, extract frames from video key frames, extract frames from a certain period of video, accurately obtain time stamp, save JPEG and other image formats when extracting frames, and other functions. In view of YUV image encoding, YUV image needs to be encoded when saving JPEG image, and the encoding uses CPU processing delay. In order to reduce the time delay of saving JPEG, CUDA functions are programmed to encode YUV image into JPEG format. On the GPU V100, peak performance can reach 3000 FPS.

(2) Optimization of using GPU memory directly for the image after drawing frames: some important services need to return results in time, and the business hopes to minimize the delay. Therefore, this scheme will decode YUV data of video, and call CUDA core on GPU to realize YUV conversion to RGB and all other preprocessing functions to ensure that the overall processing minimizes data copy between CPU and GPU.

(3) Increase the image sending back to memory after drawing frames, and THE AI algorithm directly uses Python to call GPU to draw frames and CUDA functions: Considering that most current AI algorithms are developed based on Python, it is difficult for developers to completely transform all AI algorithms. In order for the AI algorithm developed based on Python to be able to use GPU frame extraction (C++ development), this scheme uses Pybind11, making C++ and Python very convenient to mix calls. In addition, we provide the same strategy as above for services that have high processing latency and want to be able to call CUDA functions on the Python side for preprocessing acceleration.

Figure 5 shows the overall frame drawing of GPU frame extraction. Video data and demand information of frame extraction are transmitted to GPU frame extraction module through Pybind11. The CUDA initialization module initializes the CUDA context only once in the main thread, and pushes the CUDA context during the child thread decoding to avoid reinitialization for each video frame extraction. After decoding the video frame, the color space conversion is completed through the subsequent processing module and the image processing requirements of the AI algorithm after decoding are met. If the algorithm needs to put the pre-processing and post-processing on GPU for processing, the corresponding CUDA function will be called; otherwise, the image data will be directly output to the AI algorithm. In addition, GPU frame extraction and AI algorithm reasoning parallel processing, other links are asynchronous execution, can maximize layer to reduce delay.

Scheme 5 frame extraction optimization frame diagram of GPU

3. Optimization of the overall process

GPU frame extraction has a shorter delay than CPU frame extraction, but it only supports common formats such as H.264, H265 and VPx. It does not support less commonly used encoding formats such as H263. The video encoding formats uploaded by customers cannot be determined whether GPU decoding is supported. In this scheme, after ffProbe is used to obtain the encoding format of the video, the video can be judged to use GPU or CPU for frame extraction decoding. In addition, considering that the input data may not be video and there are bugs in the AI algorithm, detailed exception handling and log management are carried out in this scheme.

conclusion

In order to meet different application scenarios and requirements, this solution is optimized for different business requirements from different angles: in the short video “send before review” business, it meets the requirement that the review of 5-minute video can be completed within 30 seconds. Among them, the low tonality detection sub-service for example, compared with the transformation before using FFmpeg frame extraction save image post-processing performance improved 10 times. In the “line generation” service of long video, frames need to be extracted on CPU before optimization, and then uploaded to the cloud to perform algorithm reasoning on GPU container after completion of frame extraction, which is improved 10.6 times after optimization. In the long video “transition point analysis” service, since the algorithm input requires the JPEG image of full extraction frame, although the overall performance can also be improved by 2 times, but compared with the non-drop disk scheme, the performance improvement is less. It can be seen that in order to minimize the overall service processing delay, the best way is to directly store the decoded data in video memory without falling off the disk for AI algorithm to use.

As AI video reasoning services are widely used in various business lines of IQiyi, the AI service team needs to provide rich AI algorithm models and reduce the service processing time as much as possible from the perspective of saving hardware resources, improving work efficiency and meeting the sensitivity of some business delays. Therefore, further work should be carried out in many aspects, such as reforming the original online service with large delay, perfecting the frame extraction tool, providing the acceleration function library in the pre-processing and post-processing of the algorithm, and further optimizing the AI algorithm model.

Maybe you’d like to see more

Best practices for identifying common AI elements in UI automated testing \

Iqiyi Xie Danming: Use AI to improve the efficiency of creators, make consumers simple and happy \