This article is based on Pranav Sodhani’s Evaluate Videos with the Advanced Video Quality Tool in WWDC 2021. Pranav Sodhani is a member of Apple’s display and color technology team with expertise in algorithm development, machine learning, color science and video technology. Translator Tao Jinliang, senior audio and video development engineer of netease Yunxin, has years of end-to-end audio and video work experience.

This article focuses on how the Advanced Video Quality tool (AVQT) helps us accurately assess the perceived quality of compressed video files. Using the AVFoundation framework, AVQT supports a wide range of video formats, codecs, resolutions, and frame rates in the SDR and HDR domains, enabling simple and efficient workflows — no need to decode to raw pixel formats, for example. AVQT uses Metal, which sinks heavy pixel-level computing onto the GPU to achieve high processing speeds, typically for analyzing video that exceeds live video frame rates. With its excellent ease of use and computational efficiency, AVQT enables the removal of low-quality videos from the video catalog before they otherwise affect users in the application.

Use a background

In this talk, we will introduce AVQT(Advanced Video Quality Tool), a Video Quality Tool, and show how it can be used to evaluate the perceived Quality of compressed Video in an application or content creation workflow. Let’s start with a typical video delivery workflow.

In this workflow, high-quality source video goes through video compression and optional video reduction to generate video with lower bit rates. These low-bit-rate videos can then be easily transmitted over bandwidth-constrained networks.

Several ways to use this workflow include the AVFoundation API (for example, AVAssetWriter), applications (for example, Compressor), or one of your own video compression workflows.

Common attempts to shrink and compress the original video can cause the video to blur. This leads to a decrease in the subjective quality of the source video and to visible opacity. An example of this opacity is the blocky effect in compressed video, as shown in the frame on the right.

Another example is when the video looks blurry and the details of the video begin to disappear, this artifact can adversely affect the consumer’s video quality experience. We all know that consumers expect a high-quality video experience. Therefore, implementing this feature is very important.

Now, the first step in doing this is to assess the quality of the delivered content. The most accurate way is to have real people watch the video and rate it according to its quality. But if we want to evaluate a lot of video, this is time-consuming and unscalable. Here we can choose another objective way of representing video quality so that we can automate the process for speed and scalability.

In this setup, the aware video quality tool takes compressed and source video as input and outputs a video quality score, which can be a floating point number ranging from 1 to 5, and simulates how real people rate compressed video.

What is AVQT?

Today, we are excited to offer developers the Advanced Video Quality Tool, or AVQT. Let’s learn more about AVQT.

So what exactly is AVQT? AVQT is an executable for the macOS command line that attempts to mimic how real people rate the quality of compressed video. We can use AVQT to calculate frame level and segment level scores, where segments are usually several seconds long. Of course, we also added support for all AVFoundation-based video formats in AVQT, including SDR as well as video formats such as HDR, HDR10, HLG, and Dolby Vision.

Three main features of AVQT

Next, we’ll discuss three key attributes of AVQT that make it useful across applications. First, we’ll look at subjective perceived consistency, then we’ll discuss AVQT’s rapid computing speed, and finally show why it’s important to set viewing parameters when predicting video quality. Let’s look at each of them in detail.

Subjective perception consistency

AVQT is closely related to the human perception of video quality and is suitable for a variety of content types, such as animation, nature scenes or sports. We found that traditional video quality indicators, such as PSNR and structural similarity (SSIM), can not be uniformly and objectively evaluated in different content types.

Let’s look at an example.

This is a frame from a high quality sports clip, our first source video. If we look at the same frame in the compressed video, we can see that this frame does have a sufficiently high perceived quality, with a PSNR score of about 35 and an AVQT score of 4.4.

Next, we run the same test on the second source video. The compressed video in this case seems to have visible artifacts, and in particular, we can see some artifacts on the human face. But, interestingly, it got about 35 the same PSNR score as the last video, but this time AVQT rated it around 2.5, which means poor quality. We think the AVQT score here is the correct prediction. But this is just one example we’ve chosen to illustrate what can go wrong with cross-content evaluation.

We wanted to test AVQT’s perceptual accuracy on different video sets. Therefore, we evaluated it on publicly available video quality datasets. These data sets include source videos, compressed videos, and video quality scores provided by human subjects.

Here, we take a look at the results of two data sets: Waterloo IVC 4K and VQEG HD3:

  • Waterloo IVC dataset: including 20 source videos and 480 compressed videos, covering encoding and scaling artifacts, it covers four different video resolutions and two different video standards.
  • VQEG HD3 data set: Relatively small, it has 9 source videos and 72 compressed videos, which are generated using video encoding with 1080p video resolution.

In order to objectively measure the performance of video quality indicators, we use the Person correlation coefficient and RMSE distance measure:

  • The Pearson correlation coefficient, or PCC, measures the correlation between the predicted score and the subjective score, with a higher PCC indicating a better correlation.
  • RMSE measures the difference between the forecast and the subjective score, and a lower RMSE value implies a higher prediction accuracy.

Now, we want to evaluate AVQT’s ability to predict the scores given by human subjects. In the figure below, on the X-axis is the actual subjective video quality score, and on the Y-axis is the AVQT predicted score, each point representing a compressed video.

As you can see from the scatter plot, with the exception of a few outliers, AVQT did a good job of predicting the subjective score of this data set, which was also reflected in the high PCC and low RMSE scores. We also saw high performance on the VQEG HD3 dataset.

Fast calculation speed

Let’s continue our discussion of AVQT’s computational speed. We all know that high computing speed is important to ensure scalability. AVQT’s algorithm is designed and optimized to run fast on Metal, which allows us to browse large video files very quickly. It also handles all preprocessing locally, so we don’t have to decode videos and scale them offline. **AVQT can run 1080p video at 175 frames per second. ** So if we sometimes have a 10-minute, 24-Hz 1080p video, AVQT can calculate its quality in 1.5 minutes.

Set viewing Parameters

The last property we’ll discuss is setting the viewing parameter. Our viewing Settings can affect the quality of the video we perceive when watching it. In particular, factors such as display size, display resolution and viewing distance may mask or exaggerate artifacts in the video.

To solve this problem, AVQT takes these parameters as tool inputs and then tries to predict the correct trend as these parameters change. Let’s take a look at one such case and consider two situations:

In scenario A, we watch 4K video at 1.5 screen height viewing distance on A 4K monitor. In scenario B, we are watching the same video on the same monitor, but now at three times the height of the screen. Obviously, in scenario B, we miss some details that are visible when you look closely. This means that the video quality we perceive in scenario B will be higher than in scenario A. AVQT can be used to calculate the score of the final video of different quality levels due to viewing distance, thus reflecting some variation trend.

As shown in the figure above, AVQT scores increase as viewing distance increases from 1.5 hours to 3 hours. For more technical details, check out the README documentation provided with the tool.

Now that everyone is excited about AVQT, let’s take a look at how to use the tool properly. We’ll be making AVQT available to everyone soon via the Apple Developer portal.

Let’s start with a demonstration. First, I have downloaded AVQT and installed it on my system. Looking at “which AVQT”, I can see that AVQT is placed in the usr/local/bin directory. Now you can call AVQT, the help command, to read about the use of the different flags supported by AVQT, and much more.

The current directory has a sample reference and a sample compressed video that I used to run AVQT. We will provide the reference and test files as input and specify an output file named sample_output.csv. The tool prints progress on the screen and reports segmentation scores. The default segment duration is 6 seconds, and since this segment is 5 seconds long, we only have one segment. Next, look at the output file, where you can view the frame level score. Finally, we put the segmentation level score at the bottom.

In addition to the options shown during the demo, the tool has several additional features built in.

For example, we can use the segment-duration and grey-pooling flags to change the way frame-level scores are aggregated. Similarly, viewing Settings can be specified using viewing distance and display resolution flags.

See the readme file for more details. So far, we’ve looked at some of the key properties of AVQT and demonstrated how to use the command line tool on a pair of videos to generate a video quality score.

Use cases for AVQT

Now let’s look at a specific case where we can use AVQT to optimize the bit rate of HLS.

The HLS layer is encoded at different bit rates. We know that selecting these bit rates is not always a simple process.

To help solve this problem, we have published some bitrate guidelines in the HLS authoring Specification documentation. These bit rates are just initial coding targets for delivering typical content through HLS. We also know that different content has different coding complexity, which means that the optimal bit rate varies from content to content.

Therefore, bitrates that work for one type of content, such as animated movies, may not work for sporting events.

Let’s take a look at how we can use AVQT as feedback to help us determine the best bit rate for our content. First, we start with the initial target bit rate, which we use to encode our source video and create the HLS layer. We then calculate the video quality score using AVQT through the source video and the encoded HLS layer. Finally, we can analyze the AVQT score to determine whether to increase or decrease the target bit rate of the HLS layer.

To demonstrate this, let’s select a specific HLS layer. In this case, we chose 2160p resolution video at 11.6 megabits per second. We will then encode the first two sequences, animation and sports, using the recommended bit rates. After we have the coding layers ready, we use AVQT to calculate their video quality scores.

The following figure shows the AVQT scores of the two video sequences. For this particular layer, we want high video quality, so we set the threshold to 4.5, which means near excellent quality. As you can see, although this bit rate is sufficient for this animation clip, it is not sufficient for the sports clip.

Therefore, we go back and use this feedback to adjust our bitrate target, which needs to be increased for the sports clip and recalculated for its AVQT score.

Our goal is to increase the bit rate by 10%. Here we have plotted the new AVQT score for the sports clip, the updated score is now higher than the four-and-a-half threshold we expected, and it is also closer to the video quality of the animated content.

Finally, we hope that the presentation can show you one thing: video compression can lead to visible artifacts, which can affect the consumer’s video quality experience.

We can use AVQT to evaluate the quality of compressed video. AVQT, available as a macOS command-line tool, is fast and can set viewing parameters. It also supports all AVFoundation-based video formats and can be used to optimize video quality in the HLS layer.

conclusion

This is Pranav Sodhani’s full translation of his presentation at WWDC 2021. Please point out any mistakes.

At present, the super score function realized by the whole platform of netease Yunxin can just use AVQT to evaluate the subjective image quality after the super score. We also welcome you to use the latest SDK to experience our super score function.

The authors introduce

Pranav Sodhani is from Apple’s display and color technology team. Pranav has expertise in algorithm development, machine learning, color science, and video technology. He received his MASTER’s degree in computer Science from THE University of California, Los Angeles (UCLA) in 2017 and his Bachelor’s degree in Electrical Engineering from IIT G in 2015. He has conducted research at universities in Canada and South Korea and presented papers at international machine learning conferences. He has received numerous scholarships and awards, including O.P. Jindal Engineering and Management Scholarship (OPJEMS), Mitacs Globalink Award, and gold medal winner of the 4th International Mathematical Olympiad. He is also the author of “Haha Fluonly – Pranav Sodhani Original Joke Book,” to be published in 2018 for an Indian audience.

  • Share video records: developer.apple.com/videos/play…
  • Reference: www.its.bldrdoc.gov/vqeg/vqeg-h…

More technical dry goods, welcome to pay attention to [netease Smart enterprise technology +] wechat public number