H.266/VVC affine motion model

The introduction

In July 2020, the 19th Meeting of the Joint Video Expert Working Group (JVET) was successfully concluded, announcing the birth of a new generation of video coding standard “Multi-function Video Coding (H.266/VVC, hereafter referred to as VVC)”. This comes just seven years after the last generation of video coding standard “High Performance Video Coding (H.265/HEVC)” came out. In fact, in recent years, with the explosive popularity of short video, online conference and other mobile video applications, the global demand for video information is growing rapidly. The industry generally feels the huge pressure of bandwidth and storage, and desperately wants high-performance video compression algorithms. Although HEVC launched in 2013 significantly improves the compression performance compared with the previous generation of standard “Advanced Video Coding (H.264/AVC)”, it still cannot meet the growing demand of the industry. In response to this situation, ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG) jointly established JVET in 2015. The strategic goal of surpassing HEVC is put forward. After nearly 3 years of technical accumulation and brewing, the formulation of VVC standard was officially launched in April 2018. After more than two years of hard work, JVET experts finally completed the development of VVC standards at the 19th meeting. According to the authoritative evaluation, encoding the same quality video program, using VVC compared to using HEVC can save about half of the bandwidth.

VVC uses a lot of new technologies, such as multi-fork tree block partition technology, historical motion vector prediction technology, luminance chroma linear model prediction technology, etc. This paper will focus on the affine motion model technology in VVC. As we all know, standard documents, reference codes and so on are public information, and the relevant academic literature is also like a mountain of books. In order to provide readers with more valuable information, this paper focuses more on the introduction of the background and principle of the relevant methods, not exhaustive, not limited to technical details, in order to give readers a glimpse of the whole picture in a short time.

In the early studies of video coding, it was found that translational motion model could not effectively represent complex motion such as rotation and scaling. Further study shows that Affine motion model can describe these complex motions well. Since the 1980s, scholars have studied global affine transformation, local affine transformation and other methods, trying to introduce affine motion model in the process of video coding to improve the coding efficiency of inter-frame prediction. Although the research on affine motion model in video coding has been enthusiastic for nearly 30 years, it is a pity that affine motion model has not been included in the latest video coding standards in the industry for a long time, and the relevant methods are always written on paper rather than written into chips. The reasons are roughly trapped in the following two points. First, the complexity of relevant algorithms of affine motion models is extremely high, which is beyond the processing capacity of contemporary practical products, no matter at the encoding end or the decoding end. Second, the relevant research is relatively academic, and its design does not take into account how the algorithm is perfectly embedded into the modern video coding standard framework. The overall design of modern video coding standards is mostly a hybrid video coding framework based on block partitioning. If the affine motion model cannot be smoothly integrated into this framework, it is difficult to be adopted by the standard.

Fortunately, after the birth of HEVC, a series of creative studies by Chinese scholars gradually solved the above two problems, paving the way for affine motion model to enter the industrial standard. Huang et al. [1] was the first to break the deadlock by using three sets of corner motion vectors to represent a set of 6-parameter affine models as a special candidate for merge mode. This method creatively uses motion vectors stored in the hybrid coding framework to represent affine models and opens the door for affine models to be integrated into the hybrid video coding framework. Li et al. [2] further developed the method of [1]. He simplified the parameters of the model from six to four. More importantly, he generalized [1] ‘s approach from Merge mode to plain Inter mode. To simplify the encoder, he introduced a fast motion estimation method based on gradient method to estimate the affine model. In order to simplify the decoder, he enlarged the motion compensation accuracy of the affine motion model from pixel level to 4×4 sub-block level. [2] is a milestone, indicating that affine motion model has basically completely overcome the above two obstacles, and is only one step away from standardization. After VVC was started, Zhang et al. [3] further improved the coding efficiency of affine motion model and proposed adaptive switching of 4-parameter / 6-parameter affine model and re-prediction of corner motion vector. VVC finally adopted the improved affine motion model. After nearly 40 years of development, affine motion model technology in video coding has finally come to fruition. Looking back on this process, we can see how tortuous and difficult it is to realize the industrial application of a technology from brewing, birth and improvement.

From affine transformation to affine motion compensation

In order to better understand the technique of affine motion modeling, we first briefly introduce affine transformations. Affine transformation is a common reversible transformation in geometry [4]. It is defined as “if the collinear point set before and after a reversible transformation remains the collinear point set, the transformation is an affine transformation. By definition, affine transformations can keep points colinear. Geometrically, it can be proved that non-collinear points are still non-collinear after affine transformation, and parallel lines are still parallel after affine transformation. The affine transformation is expressed as

(x, y) and (x ‘, y ‘) represent the coordinates of the points before and after the transformation. Formula (1) The affine transformation of this form is called the 6-parameter affine transformation and is the general form of the affine transformation. Five kinds of transformations and their combinations can be described: translation, mirror image, rotation, scaling and miscut. The constraint on formula (1) gives us

Formula (2) The affine transformation of this form is called the 4-parameter affine transformation and is a special form of affine transformation. Three kinds of transformations and their combinations can be described: translation, rotation and scaling. It can be seen that the expression ability of 4-parameter affine transformation is inferior to that of 6-parameter affine transformation. However, in general natural video signals, translation, rotation and scaling are common, while mirror image and mistangent are rare. Therefore, 4-parameter affine transform has the advantage of fewer parameters in the field of video coding. In VVC, both types of affine transformations may be used.

When affine transformation is used in video coding, that is, when the so-called “affine motion model” is applied, what we actually mean is that there is an affine transformation relationship between the coordinates of the current pixel point to be encoded (x, y) and the coordinates of the reference pixel point in motion compensation (x ‘, y ‘). For formula (1) and formula (2), we can easily obtain

Where mv(x, y) = (MVH (x, y), MVV (x, y))T represents the motion vector of pixel point coordinates (x, y). Since the parameters can be given arbitrarily, we will use a and d to represent a ‘and d’ in the following text without causing misunderstanding.

In VVC, we can use one (one-way prediction) or two (two-way prediction) affine models for each block of code. In practice, we do not save or transfer parameters A ~ F, but use the so-called “corner motion vector export” method to save or pass these parameters indirectly. As shown in FIG. 1, the 4-parameter affine model can be represented by the corner motion vector MV0 in the upper left corner and mv1 in the upper right corner. The parametric affine model can be represented by MV0, Mv1 and the corner motion vector mv2 in the lower left corner. After a simple formula derivation (we omit the calculus here), we can get,

For the 4-parameter model:

For the 6 parameter model:

Among them

In the above formula, we assume that the coordinates of the three corner points on the top left, top right and bottom left are (x0, y0), (x0 + W, y0) and (x0 + W, y0 + H) respectively.

Obviously, theoretically, as long as two or three angular motion vectors of a coding block are given, we can calculate the motion vectors of any position in the block according to the above 4-parameter model or 6-parameter model. However, in the implementation of VVC, in order to save calculation, we only calculate the motion vector once for the center point of each 4×4 sub-block. All pixels in the whole 4×4 sub-block will share the motion vector and uniformly complete motion compensation, as shown in FIG. 2. This is an approximate operation of affine model, which causes a certain loss of coding efficiency. However, this approximation is necessary for computing complexity and hardware bandwidth.

Affine coding pattern in VVC

Angular motion vector plays a key role in VVC affine motion compensation algorithm. According to the source of corner motion vector, VVC affine coding mode is divided into two categories. The first type is affine Inter mode, and the second type is affine Merge mode.

When affine Inter mode is adopted, the difference of motion vector must be encoded and transmitted, similar to ordinary Inter mode. The difference is that in ordinary Inter mode, only one motion vector needs to be transmitted in each predicted direction (List 0/1), while in affine Inter mode, two (4-parameter affine model) or three (6-parameter affine model) motion vectors need to be transmitted to represent two or three corner motion vectors (called corner motion vectors). The encoder can choose to use 4-parameter affine model or 6-parameter affine model according to its performance, and then transmit a flag bit to inform the decoder of the result of selection. The decoder decodes two or three corner motion vectors based on this flag bit. When adopting affine Inter mode, VVC supports the accuracy of three motion vectors, i.e. {1/16, 1/4, 1} pixel precision, and the encoder sends the precision information of the selected motion vector to the decoder.

Similar to common Inter mode, affine Inter mode will generate two groups of predicted angular motion vector candidates. There are two main generation modes of prediction angular motion vector: inheritance mode (FIG. 3) and construction mode (FIG. 4). In the inheritance mode, the adjacent affine mode coding block (where A0 is in FIG. 3) is used to deduce the predicted value of the corner motion vector generated by the current coding block. Returning to formulas (5) and (6), we can see that there is actually no limit to the range of (x, y) in the formula, and this target point can be either inside or outside the block. When we apply formula (5) or (6) to the coding block where A0 is located, if the target point coordinates are set at the upper left corner of the current block (i.e. (x0, y0) in FIG. -2), we can get the predicted value of the corner point motion vector at the upper left corner of the current coding block. Similarly, the predicted angular motion vectors of the other two angular points can be obtained. The angular motion vector obtained by inheritance method has high prediction accuracy and is preferred when possible. The construction method does not require the adjacent coding blocks to adopt affine mode, but directly uses the motion vectors saved by the 4×4 sub-blocks adjacent to the corner to predict the corner motion vectors. For example, {MVA, MVB, MVC} will be used to generate the predicted corner motion vector in the upper left corner; {MVD, MVE} will be used to generate the predicted corner motion vector in the upper right corner; {MVF, MVG} will be used to generate the predicted corner motion vector in the lower left corner. If the structure inheritance way and the way of predicting angular point less than two motion vector candidate, VVC will use ordinary system model predictive motion vector to fill each prediction motion vector angular point (in fact, this situation is equivalent to predict affine model only translational motion) to make up the candidate prediction motion vector angular point. It should be noted that the predicted corner motion vector generated in VVC is required to point to the same reference frame as the target corner motion vector.

Now that we have the predictive corner motion vector, we can encode the corner motion vector. The corner motion vector in the upper left corner is similar to the ordinary motion vector, and only the difference between it and the predicted corner motion vector in the upper left corner (MVD) needs to be coded. For the other two corner points, the corner motion vector reprediction technique is introduced in VVC. That is, the MVD of the other two corner points needs to be predicted again with the MVD of the upper left corner point. On the coding side, we have

Corner motion vector reprediction technology can improve the coding efficiency of corner motion vector. Figure 5 shows an example. It is generally considered that affine motion can be divided into translational motion and non-translational motion. In the affine model of VVC, the corner motion vector MV0 in the upper left corner represents translational motion, while MV1 — MV0 and Mv2 — mv0 represent non-translational motion. If we independently predict the translational motion and the non-translational motion, we will get mvd0 = Mv0 — mvp0 and MVD ‘1 = Mv1-mv0 — (mVP1-mVP0), MVD’ 2 = mv2-mv0 — (mvp2-mvp0). We get the formula (8).

Affine Merge pattern appears as a subblock Merge pattern in VVC. There are two types of sub-block Merge modes in VVC. One is the sub-block time-domain motion vector prediction mode, and the other is affine Merge mode. The first candidate queue of sub-block Merge mode in VVC is the sub-block time-domain motion vector prediction mode candidate, and the rest are affine Merge mode candidate. Up to five affine Merge pattern candidates can be supported in VVC. Similar to the affine Inter mode, the generation of predicted corner motion vectors for affine Merge mode is mainly in inheritance mode (FIG. 3) and construction mode (FIG. 4). Different from affine Inter mode, affine Merge mode does not need to send reference frame information and MVD. The reference frame information and motion vectors are derived directly from the Merge mode candidate. The predicted corner motion vectors obtained by inheritance and construction are directly used as candidate corner motion vectors of affine Merge mode. In addition, the construction method in affine Merge mode is more flexible. In addition to the motion vectors stored in the adjacent 4×4 sub-blocks of the upper left, upper right and lower left corners, the time domain prediction motion vectors in the lower right corner can also be used to predict the corner motion vectors in the lower right corner. After simple conversion, the corner motion vector in the lower right corner can be used to calculate the upper left, upper right or lower left corner motion vector.

Further simplification of affine motion model

Although the method in [2] has been greatly simplified compared with the traditional affine motion model coding method, it is still complicated for practical application, especially for hardware implementation. To solve this problem, the motion compensation method of affine motion model is further simplified in VVC, mainly including:

  1. In the case of gamut 4:2:0, the subblocks of chromaticity block are divided into 4×4 subblocks instead of 2×2 subblocks, that is, one chromaticity block corresponds to four brightness blocks. The subblock motion vectors of chroma block are averaged by the corresponding subblock motion vectors of upper left and lower right brightness block. This minimizes the bandwidth required for chromaticity block motion compensation.

  2. VVC uses 8-bit sub-pixel interpolation filter to compensate the motion of ordinary Inter block and 6-bit sub-pixel interpolation filter to compensate the motion of affine mode coding block. This is because the 4×4 block bidirectional prediction pattern does not appear in ordinary Inter blocks, while the 4×4 block bidirectional prediction pattern appears in affine mode coding blocks. In terms of average computation, the smaller the block size, the more computation. The addition and multiplication times of subpixel interpolation in affine mode coding blocks can be reduced as much as possible by using a 6-bit subpixel interpolation filter.

  3. In hardware implementations, the smaller the block size of motion compensation, the higher the average bandwidth requirement. In order to reduce the high bandwidth requirements caused by 4×4 sub-block motion compensation, VVC adopts the bounding box scheme, as shown in Figure 6. In this scheme, in affine mode, there are four 4×4 sub-blocks in narita shape inside the current block, and their corresponding reference pixels must fall in the same boundary box. This bounding box requires only slightly more reference pixels than an 8×8 block would need for motion compensation. In this way, the reference pixels for motion compensation of four narita glyph 4×4 sub-blocks can be taken out at one time, and the required bandwidth is only slightly more than that for motion compensation of an 8×8 block. If the above conditions are not met, the current block returns to ordinary Inter mode and does not perform affine motion compensation operation.

Performance of affine motion model in VVC

The JVET organization published an official evaluation of the coding efficiency of each coding tool in jVET-S0013. The experimental platform was VTM-9, and the test environment was JVET’s official universal test environment, including random access (RA) test and low delay B-frame (LDB) test. The test method is to close the tool to be tested and investigate the performance loss of VTM encoder. The greater the performance loss, the stronger the performance of the tool. Table 1 shows the various test sequence categories and the overall average results. It can be seen from the experimental results that the affine motion model can provide more than 3% encoding performance gain in VVC on average, and the influence on encoder time complexity is 20%-30% and decoder time complexity is about 3%-8%. Compared to the other coding tools listed in JVEt-S0013, the affine motion model is arguably one of the most powerful coding tools in VVC besides the flexible block partitioning structure.

In addition, table 2 lists the individual test results for each sequence. We can see that the performance of the affine motion model coding tool is strongly dependent on the contents of the sequence. For sequences rich in non-translational motion, such as Cactus, Catrobot, BQSquare, Slideshow, etc., the affine motion model coding tool can greatly improve the coding efficiency, which can exceed 10% or even 15% in the best case. However, for sequences with no non-translational motion, such as Campfire, BQTerrace, etc., this tool is not very effective.

Table 1: Average coding efficiency of affine motion model

RA LDB
Y U V EncT DecT Y U V EncT DecT
Class A1 2.16% 2.08% 1.81% 81% 96%
Class A2 6.13% 4.27% 3.91% 79% 96%
Class B 3.20% 2.44% 2.34% 80% 97% 3.74% 3.02% 3.44% 73% 91%
Class C 1.46% 1.02% 0.87% 83% 98% 2.19% 1.62% 1.45% 79% 94%
Class E 3.10% 1.90% 2.51% 62% 93%
Overall 3.11% 2.36% 2.16% 81% 97% 3.06% 2.27% 2.54% 72% 92%
Class D 2.35% 1.53% 1.52% 85% 99% 5.08% 4.14% 4.45% 81% 92%
Class F 3.14% 2.37% 2.34% 86% 99% 4.48% 3.21% 3.87% 80% 98%

Table 2: Specific coding efficiency of affine motion model

Class Sequence RA LDB
Y Cb Cr Y Cb Cr
A14K Tango 1.31% 1.72% 1.07%
FoodMarket4 4.70% 4.30% 4.11%
CampfireParty 0.47% 0.21% 0.24%
A24K CatRobot 8.38% 6.16% 5.31%
DaylightRoad 6.08% 4.08% 3.76%
ParkRunning3 3.93% 2.57% 2.66%
B1080p MarketPlace 4.73% 3.41% 4.06% 3.99% 2.69% 2.90%
RitualDance 2.39% 1.82% 1.60% 1.74% 1.25% 1.48%
Cactus 7.34% 5.48% 4.58% 10.88% 9.69% 9.80%
BasketballDrive 1.23% 1.13% 1.02% 1.76% 1.42% 1.47%
BQTerrace 0.30% 0.36% 0.44% 0.33% 0.04% 1.54%
CWVGA BasketballDrill 0.67% 0.34% 0.16% 0.86% 0.18% – 0.06%
BQMall 1.06% 0.87% 0.72% 1.27% 0.48% 0.67%
PartyScene 2.81% 2.09% 2.00% 5.31% 5.31% 4.28%
BasketballPass 1.30% 0.78% 0.62% 1.31% 0.51% 0.91%
DWQVGA BasketballPass 0.68% – 0.15% 0.02% 0.77% 0.11% – 0.07%
BQSquare 5.57% 3.98% 4.12% 15.24% 13.47% 14.80%
BlowingBubbles 1.86% 1.71% 1.56% 2.94% 2.59% 2.44%
RaceHorses 1.28% 0.57% 0.40% 1.35% 0.39% 0.65%
E720p FourPeople 1.71% 0.91% 1.18%
Johnny 3.95% 2.53% 3.63%
KristenAndSara 3.66% 2.25% 2.72%
FSCC BasketballDrillText 0.53% 0.17% 0.07% 0.40% 0.39% 0.08%
ArenaOfValor 1.91% 0.90% 0.95% 1.78% 0.93% 0.76%
SlideEditing 0.06% 0.05% 0.11% – 0.07% – 0.12% – 0.31%
SlideShow 10.08% 8.34% 8.25% 15.80% 11.66% 14.97%

Afterword.

Video coding compression technology, especially video coding standard technology, has come all the way from h.261 /MPEG1 in the early 1990s to today. After more than 30 years of development, it has grown into a very complex and huge knowledge system. VVC coding tool set as a representative of the new generation of coding technology is changing with each passing day, its sophistication and the industry is familiar with the traditional technology is not comparable. This paper attempts to introduce a new tool in VVC, affine motion model technology, which is a very characteristic tool, to guide the interested readers, hoping to popularize the knowledge of VVC. If readers want to further understand the theoretical basis of affine motion model technology, they are recommended to refer to [5][6]. Due to limited space, some related technologies, such as Optical Flow-based Predictive correction (PROF), are not introduced in this paper, and relevant technical details can be refer to [7].

reference

  1. H. Huang, J. Woods, Y. Zhao, and H. Bai, “Control-point representation and differential coding affine-motion compensation,”IEEE Transactions on Circuits and Systems for Video Technology, Vol. 23, No. 10, pp.1651 — 1660, Oct. 2013.
  2. L. Li, H. Li, D. Liu, Z. Li, H. Yang, S. Lin, H. Chen, and F. Wu, “An Efficient Four-parameter Affine Motion Model for Video Coding,”IEEE Transactions on Circuits and Systems for Video Technology, Apr. 2017.
  3. K. Zhang, Y. Chen, L. Zhang, W. Chien and M. Karczewicz, “An Improved Framework of Affine Motion Compensation in Video Coding,”IEEE Transactions on Image Processing, vol. 28, no. 3, March 2019.
  4. You chengye (eds.), analytical geometry, Peking University press, January 2004.
  5. K. Zhang, L. Zhang, H. Liu, J. Xu, Z. Deng and Y. Wang, “Interweaved prediction for video coding”, IEEE Transactions on Image Processing,Vol. 29, pp. 6422 — 6437, Apr. 2020.
  6. H. Meuel, J. Ostermann, “Analysis of Affine Motion-Compensated Prediction in Video Coding”, IEEE Transactions on Image Processing, vol. 29, pp. 7359-7374, June, 2020.
  7. J. Luo, Y. He, “CE2-related: Prediction of optical refinement in Affine mode,” JVEt-N0236, Mar. 2019.