Edit | Debra

This issue is excerpted from a white paper entitled “How Cloud Service Providers (CSPS) can Benefit from Intel ® Xtron ®Skylake Platform for Media Cloud Applications” written by Intel’s Professor Yang Lu. Please reply to the full text of the white paper with the keyword “Intel” in the AI Front public account.

We are pleased to invite Mr. Lu Yang, senior architect of Intel Data Center Marketing Department, to share with us the construction and Improvement of High-performance Video Cloud Services.

First of all, I would like to introduce today’s lecturer: Lu Yang, senior architect of Intel Data center Marketing Department, mainly responsible for Internet customers in China. He has more than 10 years of experience in development and Internet work in Intel. He works closely with frontline ISV/ customers in Asia Pacific region, around Intel IA architecture, helping customers with system performance measurement and tuning, platform migration and upgrade, as well as in-depth technical cooperation. This sharing will mainly be carried out from the following six aspects:

  • Media cloud computing industry status quo

  • Impact of instruction set upgrade

  • Traditional video processing applications

  • HEVC/H.265 application

  • Image processing application

  • Deep learning applications for video and images

Media cloud computing industry status quo

Media processing is one of the hottest and fastest developing applications in the current cloud computing industry, especially the video and image processing applications and services that occupy a large amount of computing and storage resources. Cloud service providers (CSPS) have been working to improve the efficiency of media cloud computing, especially the performance of video processing, analytics, search and streaming media.

This presentation introduces the new technologies of Intel’s Xeon platform and their performance improvements for media cloud computing. It explains how the new SIMD (Single instruction multiple Data Stream) AVX-512 instruction set can help improve performance for video, image processing and video deep learning applications.

Media cloud computing applications and services have been growing rapidly in recent years. By 2020, video will account for 82 percent of all traffic consumed by Internet users.

In recent years, various emerging media cloud computing applications have emerged in an endless stream, such as video transcoding, video understanding, analysis, deep learning, monitoring, video search, video broadcasting, conferencing, cloud gaming and artificial intelligence (AI).


The more complex the requirements of cloud computing scenarios and businesses, the more challenging the processing power of back-end servers will be, especially as 4K/8K video formats rise and mature. 4K and 8K video formats offer higher video quality and a better end-user experience, but they also require higher processing power and more storage and network bandwidth resources.


Nowadays, a huge amount of videos and pictures are generated, uploaded and downloaded every day. How to deal with these massive data in the most efficient way is a great challenge for media cloud computing service providers (CSPS). In China, media cloud service providers are striving to find the most efficient media cloud solutions and platforms to achieve the best performance while maintaining the quality of video, ensuring better user experience and more reasonable cost.

Impact of instruction set upgrade

The basic media cloud computing module includes video transcoding, editing, feature extraction and analysis. Among them, video transcoding consumes most computing resources and is the basis for other further processing and analysis. Intel’s SIMD vectorization technology is key to optimizing these computation-intensive operations.

At present, XEON has integrated AVX-512 technology, the performance characteristics and coding method of this technology, using the IA architecture platform to solve technical problems to achieve performance improvement. This solution provides media cloud customers with the following advantages:

  • Efficient video processing solutions for the cloud service provider industry. Basic video encoding and transcoding performance increased by 2 to 4 times, which can significantly increase the processing power of the media processing cluster.

  • High-performance image processing solutions for a variety of online applications. Reduce online image processing latency and bandwidth.

  • Accelerate deep learning algorithms to help eliminate performance bottlenecks in applications from emerging cloud service providers.

The instruction set has evolved from MMX, SSE, AVX, AVX2 to AVX-512 on generations of x86 platforms, and the vector width has grown from 64B to 512B.

MMX: 8x 64-bit register (MM0… MM7)

SSE: 8x 128-bit register (XMM0… XMM7), 4x single precision FP/XMM registers

SSE2: wider integer vector width, 128B; Double precision FP/XMM register

SSE3: Vector instructions for upgrading

SSE4 (4.1 and 4.2) : 16x 128-bit registers (XMM0-XMM15), with new instructions (47 + 7 new instructions)

AVX: 256-bit wide SIMD for floating-point calculations, and 16×256-bit wide (YMM0-YMM16)

AVX2:256-bit SIMD is suitable for integer computation, new instructions are suitable for FMA (mixed multiply and add operations) as well as extended instructions

Avx-512:512-bit SIMD instructions are suitable for integer and floating point calculations, with 2 512-bit FMA

The new XEON platform introduces 512-bit vectomization technology that delivers up to a two-fold increase in peak performance compared to AVX2. Microarchitecture-level performance upgrades, especially for SIMD vectifying, can significantly improve video and image processing performance.

Traditional video processing applications

So is the use of AVX-512 technology to improve the performance of media cloud computing? In terms of traditional video processing, offline video transcoding applications are typically CPU – and memory-intensive applications, whose most time-consuming core code can be optimized by SIMD vectorization.

Some functions cannot be automatically optimized by the compiler, and there is no similar high-performance API to call directly, so we need to analyze the code and manually rewrite it using SIMD vectorization instructions according to the development manual and related documentation. If you can refactor your code with the help of SIMD’s instruction set, you can expect significant performance gains even with compilation tweaks.

Taking the traditional H.264 video transcoding program as an example, as shown in Figure 3, most of the hot spot functions have been vectored and optimized by using SIMD (SSE/AVX2) instructions, but avX-512 technology has not been used so far.


FIG. 5.8* 4-block calculation and SIMD optimization diagram

Taking the basic 8×4 block computing function as an example, to make the function X264_pixel_SATd_8x4 easier to understand, we designed the graph model in Figure 5. We can see that the 8×4 block SATD calculation requires the use of two Hadamard transformations (first by row, then by column). In addition, we notice that the Hadamard transformation has four inputs, so we need to store the data in four separate SIMD units for parallel computation. We can load each 128-bit data (i.e., 8×16-bit integers in each black box) into an M128I, forming a total of 4 *m128i, to be used as input for the first Hadamar transformation. In addition, the matrix transpose is needed to help form the second Hadamar transformation (by column). To test the performance of the SIMD vectorization implementation, we first randomly generate elements in two pixel matrices using RAND ()% 10. To ensure accuracy, we tested functions separately instead of in batches, and the performance results are shown in Table 2.

From Table 2, we can see that the optimal implementation of SIMD vectomization (highlighted and red font) in each line achieves a three-to-five-fold performance acceleration compared to the original version. Specifically, in the 8×4 block calculation example, the SSE code achieved a 3.26-fold performance improvement over the original version and outperformed the other two SIMD implementations.

The reason is that the 8×4 block makes full use of SSE’s 128-bit registers, whereas AVX2 and AVX-512 both waste some space. Importantly, however, the AVX-512 code achieved optimal performance in 8×16 and 16×16 block calculations, of which 16×16 was the target function in our initial configuration (that is, X264_pixel_SATD_16x16).

Based on the programming model and examples of avX-512 vectoring code described above, customers can rewrite popular functions in their video applications to ensure maximum performance gains.

HEVC/H.265 application

Video coding standards have evolved mainly through the development of familiar ITU-T and ISO/IEC standards. H.265/HEVC (Efficient Video Coding), introduced in 2013, is the latest video codec standard from ISO/IEC and ITU-T to maximize compression capability and reduce data loss. HEVC/H.265 technology helps video cloud service providers provide high quality video at a smaller bandwidth and further support ultra high resolution video services in 4K (4096×2160) and 8K (7680×4320).

The computational complexity of h.265 /HEVC codecs is more than four times that of h.264 /MPEG, placing unprecedented demands on the video processing power of back-end server platforms, and SIMD vectorization on x86 has proven to significantly improve HEVC encoding performance.

As shown in Figure 6, the SIMD (SSE/AVX2) instructions have been integrated into the H.265/HEVC source code. However, AVX-512 has not been enabled. Next, we will use DCT (discrete cosine transform) as an example to illustrate how to use avX-512 instruction set to accelerate HEVC video processing applications, including 4×4, 8×8, 16×16 (eighth popular function), and 32×32 (fourth popular function) calculations.

To find the source code for DCT, we set “–disable ASM” in the X265 CCmake configuration and then use VTune to view the source code. The source code for DCT32 is shown in Figure 7.

Let’s show you how to use SIMD for basic 4 x 4 block calculations on Intel’s new Xeon platform, and then how to use AVX-512 for further accelerated calculations.

Figure 8 shows the core concepts of DCT4. As we can see, the basic operation in DCT4 is matrix products. Thus, we can load each 128-bit data (that is, 4x 32-bit integers in each black box) into an MM128I and perform SIMD operations on it, such as _MM_MULlo_EPI32, _MM_ADD_EPI32, and _MM_SRai_EPI32. Unlike SATD optimization in X264, AVX2 and AVX-512 can further accelerate DCT4 by reducing the number of cycles. In particular, AVX2 can reduce the number of cycles from 4 to 2 and load each 2x8x32 = 256 bit data into m256I, while AVX-512 can further reduce the number of cycles to 1 and load each 4x8x32 = 512 bit data into __m512i. 8×8, 16×16, and 32×32 block calculations can also be optimized using a similar approach to the 4×4 example above, where SSE first accelerates matrix products within cycles, while AVX2 and AVX-512 can be further accelerated by reducing the number of cycles.

To test the performance of the SIMD implementation, we randomly generate elements in the source matrix using rand()%40 — rand()%40. Test results are shown in Table 3.

As you can see from Table 3, the AVX-512 code was the most efficient choice for 4×4/8×8/32×32/ data sets, providing 44.46%, 70.45%, and 37.60% performance improvements over the initial code, respectively. In theory, avX-512 should be the most efficient choice in all cases, including DCT4, DCT8, DCT16, and DCT32.

However, initialization operations, such as _MM512_set_EPI32, are time consuming and therefore cancel out some of the advantages. In DCT16, SSE can use constant initialization operations (such as _MM_set1_EPI32) and achieve the best performance, 43.50% faster than the initial code. In order to optimize the performance of x86 platforms and make it easy for customers to leverage and deploy advanced Intel architecture technologies, Intel has developed a complete set of high performance libraries for client and server platforms and a variety of domains, such as the system performance analysis tool Vtune, Intel compiler ICC, mathematics core library MKL, cluster analysis, graphics and image development kit MediaSDK, as well as multi-threaded programming tool TBB and so on.

IPP develops optimal implementations of thread-level parallelism and vectorization for the following applications and algorithms:

  1. Image, video and audio processing

  2. Data communication

  3. Data compression and encryption

  4. Signal processing etc

GraphicsMagick is a common image processing library that has been widely used in many cloud processing applications. In order to better support existing image processing applications, IPP2018 implemented a variety of performance enhancing plug-ins and patches in a series of functions from the IPP library to the GraphicsMagick API to achieve performance improvement with minimal manual intervention mode. In this way, many image processing functions such as resizeImage, scaleImage, GaussianBlurImage, flipImage and flopImage are optimized. Image scaling is used as an example to evaluate the performance of IA SIMD technology, especially the contribution of avX-512 vectoring instruction set on the new XEON platform.

Figure 9 is the result of optimizing the initial GraphicsMagick function using the IPP image scaling API. Since the IPP library has implemented and integrated multiple SIMD instructions, the performance of the AVX-512 is more than 20% higher than that of the AVX2 for these five standard image scaling applications.

Deep learning applications for video and images

Video and image files contain a lot of valuable information, such as time and place, people and their behavior, and even what the people are wearing and the environment is changing.

As shown in figure 10, we can according to the type of information using data mining and machine learning, extract a lot of useful information, and then specialized analysis research to the necessary conclusion, anti piracy, for example, to help find the criminal suspect or missing, analysis the relationship between the video, and according to the characters of the hobby to promote more business model and so on. The huge potential business value of advertising and promotion based on video content has prompted media cloud customers to invest more resources in video analysis and further data mining and deep learning applications.

Deep learning technology is one of the fastest growing areas in cloud computing data centers and has become the latest major force driving the server market. Many media cloud service providers have already begun to delve into and develop this area.

For high-dimensional and high-parallel video and image processing, such a typical computationally intensive application, extracting and analyzing video data will consume a lot of computing resources, and the X86 SIMD vectorization instruction can greatly improve computational throughput and program execution efficiency.

Intel has developed a high-performance library to optimize the performance of deep learning applications. Intel Mathematical Function Library (Intel MKL) is designed to speed up mathematical operation functions for machine learning, science, engineering, finance, and design applications, such as dense and sparse linear algebra (BLAS: Basic Linear Algebra subroutines, LAPACK: Linear Algebra, PARDISO: Sparse matrix solving), FFT, vector mathematics, summary statistics, deep neural networks, etc. These traditional deep learning programs have maximized processor performance through optimal threading and SIMD vectorization.

In caffe’s scoring and training program, for example, Intel AVX2/AVX-512 instructions speed up the scoring process by two to nine times and the training process by two to four times, as shown below. The AVX-512 instructions on Intel’s new Xeon platform provide 10% to 40% more performance than the AVX2.


Media cloud computing applications are becoming more and more popular and are becoming an important part of the traditional data center and mobile Internet industry. For these data – and computation-intensive applications, the x86 platform has built a remarkable ecosystem for video, image, audio, and further deep learning computing.

This share details the new AVX-512 technology and instruction set on Intel’s new Xeon platform and demonstrates how it can be leveraged to optimize media cloud applications. With the emergence of new business and usage patterns and the steady upgrade of IA intelligent platform, high-performance, high reliability technology is bound to bring more benefits to more and more media processing applications.

Q&A session

Q1: Will the improvement of AVX-512 over AVX2 have any visual impact on the user level?

A1: Mr. Lu Yang: As mentioned in the article, the improvement of AVX-512 compared with AVX2 is the execution efficiency of the program, and the intuitive feeling is the reduction of the execution time of the program and the improvement of throughput

Q2: Is the source code for Caffe optimized with AVX512 open source? How can developers use it effectively?

Miss Lu Yang: Intel optimized Caffe is open source. You can refer to Git Clone

https://github.com/intel/caffe.git

In fact, Intel has optimized many open source DL/ML projects, which can be found at software.intel.com

Q3: MKL, IPP is now free to use? Open source?

A3: lu Yang teacher: IPP can be downloaded from the following address https://software.intel.com/en-us/intel-ipp

MKL download address is https://software.intel.com/en-us/mkl is a free download, some not all open source code

Q4: The AVX-512 instructions mentioned in the share provide 10% to 40% more performance on Intel’s new Xeon platform than AVX2. What is the main reason for this improvement?

A4: Miss Lu Yang: The improvement of AVX512 over AVX2 is that the bit width of vector register is doubled

Q5: In terms of deep learning, how should hardware and software be combined? What is the role of hardware?

Chip upgrade and network upgrade first depends on the demand, what kind of workload needs gigabit, or 10 gigabit, or 25 GIGABit, or 100 gigabit, whether RDMA is needed, to be clear. The computing capacity of the chip should match the corresponding network performance, and the optimization of the network is also crucial. The use of DPDK can greatly improve the CPU network processing capacity and reduce the cost. Of course, the use of DPDK requires the investment of relevant technical capabilities, which will increase the labor cost. Comprehensive evaluation is required to achieve the optimal TCO.

In fact, no matter for deep learning or other Internet applications, the optimal performance can only be achieved by upgrading the software and hardware. Upgrade hardware architecture to get more powerful cpus and newer microarchitecture improvements, including CPU frequency, core count, register architecture, cache line upgrades, memory and network improvements, and more. The software’s role is to sense and enable these new capabilities, otherwise all the advantages and technologies of the hardware platform cannot be exploited.