By Yuwen He

1. An overview of the

In recent years, the upgrade of network infrastructure, the popularization of smart phones, and the continuous development of video technology enable users to enjoy higher quality video services. Figure 1 shows the trend in video quality. One direction is to improve the image resolution, from standard definition (SD), HD (HD) to ultra high definition (UHD 4K/8K) development, resolution improvement can expand the viewing field of view, so as to improve the subjective feelings of users. Another direction is to improve pixel quality, including Bit Depth, Wide Color Gamut (WCG) and High Dynamic Range (HDR), These improvements in pixel quality effectively improve the user’s perception of texture detail, brightness and color richness, thus improving the subjective quality. And the combination of uHD and high contrast and color dynamic range (BT.2020) further improves the user’s subjective experience. On this basis, 360 video provides a 360-degree omnidirectional view, and users can get an immersive viewing experience through the helmet-mounted display device. On the other hand, users can easily switch viewing angles through head-mounted display devices or smartphones according to their own interests. As shown in Figure 2, since the smartphone itself has a 3D attitude sensor, users can easily switch viewing angles by changing the orientation of the smartphone. This way of viewing video changes the traditional fixed view of video viewing mode, increasing users’ interest in video content. Currently, 360 video can be shot using simultaneous wide-angle cameras with multiple viewpoints, such as two or six directions, and then stitched together using Stitch technology to synthesize 360 video. 360 video of these new features to suit the virtual reality (VR) and augmented reality (AR) applications, and with the popularity of 360 video filming and display devices, it is in the entertainment, education, social media, industrial design, interactive games, online shopping, remote medical treatment, virtual tourism areas such as there are some good demonstration application. With the promotion of 5G networks and smartphones with stronger video shooting and graphics processing capabilities, the quality of 360 video applications will be greatly improved, and real-time interactive 360 video applications with low latency will be possible.

Figure 1. Trends in video quality

Figure 2. Users switch perspectives when watching 360 videos

2. 360 Video Coding Challenge

Since 360 video has a much larger field of view than UHD/HD video, 360 video itself needs a resolution of more than 16K to achieve the same clarity as normal HD resolution. This requires 360 video compression to be very efficient, 360 video coding and transmission technology has brought new challenges. The ITU-T and ISO/IEC Joint Video Expert Group (JVET) has developed the new Video compression standard H.266/VVC (Versatile Video Coding) in 2020, which is tailored to SDR, HDR, and 360 Video to meet the needs of different Video applications now and in the future. JVET has done a comprehensive study on 360 video, including 360 video projection method to 2D plane, encoding frame packaging method, 360 video quality evaluation method, and 360 video coding optimization method. There are more exploration and testing, forming an effective 360 video processing, coding, quality evaluation system [2][3][4]. Figure 3 shows the processing flow of 360 video, including shooting, synthesis, 2D plane projection, encoding, decoding and display area extraction under a given perspective.

Figure 3. 360 video processing flow

JVET specifies 14 360 video 2D plane projection formats for testing [3] : Equirectangular projection (ERP), Adjusted equal-area projection (AEP), Cubemap projection (CMP), Adjusted equal-area projection (AEP), Adjusted equal-area projection (CMP) Adjusted Cubemap Projection (ACP), Adjusted Cubemap (EAC), Adjusted Cubemap (HEC), Adjusted Cubemap (HEC) Generalized Cubemap Projection (GCMP), Rotated Sphere Projection (SSP), Rotated Sphere Projection (RSP), Fisheye, Segmented Sphere Projection (SSP) Octahedron Projection (OHP), Icosahedron Projection (ISP), Truncated Square Pyramid (TSP) Equatorial Projection (ECP). 360 videos can contain the entire sphere (longitude directions [-180, 180], dimensional directions [-90, 90]), and some applications only face the user’s front hemisphere (longitude directions [-90, 90], latitude directions [-90, 90]). The 360Lib reference software [4] supports the conversion of these projection formats, the calculation of various quality evaluation indicators, and the extraction of display videos for viewing according to the user’s specified browsing trajectory to support the subjective quality evaluation of 360 videos. The 360 Video processing reference software can be combined with h.266 /VVC reference software to support the entire process of 360 video projection format conversion, encoded frame packaging and encoding.

Figure 4. ERP projection

Figure 5. CMP projection. Grey areas are filled areas

Compared 360 video coding and the common video coding, the main challenges are the following three aspects: (1) 360 video synthesis of 2 d format and encoding to use 2 d plane projection plane projection formats may be inconsistent, such as synthesis of 360 video generally use ERP formats, and coding when considering the coding efficiency may use other projection formats for encoding; (2) The sampling rate of each pixel of the coded image corresponding to the 360-degree sphere is not uniform: for example, ERP (as shown in FIG. 4) the images of the North and South Pole regions are severely stretched and deformed; CMP (as shown in FIG. 5) The sampling rate of the corresponding sphere inside each surface is also different — the farther away from the center of the surface, the higher the sampling rate is, which leads to the object being stretched and deformed at the edge of the surface. (3) If the user’s view spans multiple faces in the 2D plane projection format of the 360 video, there will be discontinuous Seam artifacts, as shown by the red arrow in Figure 6. This is caused by the coding process not taking into account the boundaries between these faces.

ERP format encoding & CMP format encoding

Figure 6. Seam phenomenon observed when ERP and CMP projections are encoded and displayed

3. 360 video coding technology

To address these challenges in 360 video coding, H.266/VVC defines some coding techniques specifically for 360 video. Let’s introduce one by one.

(1) Wrap-around Motion Compensation

Since 360 video is defined as video on a sphere, when 2D plane projection is encapsulated into encoded frames, there are actually pixels outside the frame boundary, but these pixels are placed elsewhere. For example, in the encapsulated ERP diagram shown in Figure 4, the left boundary and right boundary of this graph correspond to the same boundary on the sphere, so the pixels outside the left boundary are actually inside the right boundary (or the pixels outside the right boundary are actually inside the left boundary). Based on this characteristic, H.266/VVC defines the horizontal surround motion compensation technique. As shown in FIG. 7, if the prediction Reference block of the current coding block of the current frame exceeds the image boundary, then the exceeded part (the dotted shaded area on the left side of FIG. 7) can find the corresponding block in the image by translating the distance “ERP Width” to the right (Reference block(Wrapped around) in FIG. 7). In this way, the pixels outside the image do not need to be filled with the repeated filling of traditional coding, but with the corresponding continuous area of the sphere, so the prediction accuracy can be improved. H.266/VVC only defines the wrap-around compensation in the horizontal direction, and the wrap-around offset is defined in the PPS (see Table 1). The number of integral pixels of the brightness signal shifted in the horizontal direction of the compensation of the surrounding motion can be calculated by the encoded image width and the compensation parameters of the surrounding motion (pps_pic_width_minus_WRaparound_offset in Table 1). This technique improves the prediction accuracy of the image across the boundary, so it can effectively reduce the seam phenomenon mentioned above for the moving video, especially when the object moves beyond the image boundary.

Figure 7. Padded ERP surround motion compensation

Table 1. Wrap-around offset indicated in PPS:

(2) Virtual Boundary

The 360 video projection format may contain multiple faces (such as CMP), which, when placed into the encoded frame structure by operations such as translation, rotation, inversion, scaling, padding, etc., may cause discontinuous boundaries within the encoded image frame. Figure 8 shows an example. If the six CMP faces are placed in two rows and three columns: the right, front, and left faces are placed in the first row, and the bottom, back, and top faces are placed in the second row, then the content on either side of the boundary between the first and second rows (indicated by the yellow line in Figure 8) is discontinuous. In this case, if in-loop filtering (including Deblocking and sample adaptive offset (SAO)), Adaptive loop Filtering (ALF)) refers to the pixels of the other side when filtering on both sides near the boundary, and the joint phenomenon will occur when 360 video is normally displayed (as shown in FIG. 9 [7]). To avoid these seams, H.266/VVC defines a syntax for virtual boundaries. Users can define some virtual boundaries within the image. The position information of these virtual boundaries can be encoded in the sequence parameter set (SPS) (see Table 2) or the image head (PH). Loop filtering at these defined virtual boundaries cannot reference pixels on the other side of the boundary. Sps_virtual_boundaries_enabled_flag Enables or disables the virtual boundary definition. In the example in Figure 8, the loop filtering problem can be solved by defining a virtual boundary with the same horizontal position as the yellow line in Figure 8.

Figure 8. Discontinuous boundary within a CMP (3X2 layout) image

FIG. 9. The joint phenomenon generated by loop filtering when filtering the discontinuous boundary with reference to the pixels on the other side of the boundary

Table 2. Virutal boundaries in SPS indicate:

(3) Subpicture coding

If the full-view 360 video is transmitted to the client all at once, the bit rate requirement will be very high. To facilitate transmission, the subpicture encoding tool [1][8] is defined in H.266/VVC. The encoder can choose to have each subpicture decoded independently. Users can determine the image resolution corresponding to different locations of 360 video according to their importance. For example, because users often view images around the front view, the front area can have a higher resolution, the left and right can have a lower resolution, and the top, bottom and back can have a lower resolution. As shown in Figure 10, 360 video coding frames can be divided into four regions: Front ViewPort area (red dotted box), Left ViewPort area (yellow dotted box), Right ViewPort area (blue dotted box), Top/Bottom/Back Viewport area (green dotted box), These four regions can be combined to encode in a single frame. Each region is composed of multiple subgraphs, so the four regions can be transmitted together or as a combination of multiple subgraphs. In this way, 360 video transmission provides a good flexibility. 360 video application can selectively transmit according to bandwidth and user viewing area, such as extracting user viewing window (composed of sub-graphs) in front Viewport area for transmission.

Figure 10. Subgraph area division corresponding to 360 video perspective

(4) ERP and GCMP instructions

For the correct display of 360 video, ERP and GCMP are defined in H.266/VVC standard for 360 video 2D plane projection format SEI (Enhancement Information) [6]. The receiver can use the SEI information to do some post-processing and display the video correctly. Table 3 defines the SEI information of ERP, where erp_left_guard_band_width and erp_right_guard_band_width respectively indicate the number of pixels filled outward from the left and right edges of the ERP image. These pixels are encoded together with THE ERP image. After decoding, they can be used to perform mixed filtering on the image boundary [10] to improve the continuity at the boundary and avoid the occurrence of seams. Table 4 defines the SEI information for GCMP. GCMP can represent both regular and custom CMP by defining a sampling function (gcMP_mapping_function_type) on each cube surface. Gcmp_packing_type Specifies different encoding frame encapsulation formats. Gcmp_guard_band_type specifies the filling method for filling pixels outside the discontinuous boundary, while gcmp_guard_band_samples_minus1 specifies the number of pixels to fill outside the discontinuous boundary. JVET reference software 360Lib [4] supports these SEI information to specify the output of SEI information when 360 video projection format encoding and decoding [9]. Users can use these decoded videos and the output SEI information to convert and display the video in the correct format.

Table 3. ERP SEI information [6] :

Table 4. GCMP SEI information [6] :

4. 360 video quality evaluation

Since 360 video coding is carried out in 2D projection space, the sampling rate of these pixels is different when they correspond to the sphere, and what users see is the spherical video effect, so the quality evaluation of 360 video can not only calculate various quality evaluation indexes (such as PSNR) for decoded video pixels with traditional methods. JVET presents an end-to-end 360 video quality evaluation method after testing, as shown in Figure 11. It comprehensively considers the influence of three aspects on coding performance: one is the preprocessing including 2D plane projection format (including subsampling) and the coding frame packing method; The second is the codec system itself; The third is post-processing, which includes pixel blending and projection format conversion (including upsampling). The resolution of 360 original video is relatively high (such as 6K or 8K), while the resolution of coding will be lower, generally 4K (UHD) or 2K (HD), so it is necessary to sample 360 video before coding. Encoding frame encapsulation is to fill 360 video in projection format into an ordinary encoding frame by translating, rotating, flipping, zooming, filling and other operations on each side, so that it is easy to encode with traditional video encoders (H.266/H.265/H.264). Post-processing one is to reduce the seam phenomenon mentioned in the second part by using a distance-based weighted average of the coded fill pixels and the pixels contained in the 360 video. Another post-processing task is to upsample the coding resolution to the original video source resolution for quality comparison.

Figure 11. End-to-end quality evaluation method [2]

Figure 12 shows the weights used in the ERP and CMP formats to calculate WS-PSNR. For ERP, the weight of equator is the largest, the weight of poles is the smallest, and the weight of the same latitude is the same. For CMP, the center weight of each face of the cube is the largest, the edge weight of each face is the smallest, and the pixel position with the same distance from the center point of each face has the same weight.

(a) Weight w(I, j) when SSE is calculated by ERP format

(b) Weight w(I, j) when SSE is calculated by CMP scheme

Figure 12. Weights used by WS-PSNR to calculate distortion

The ERP and GCMP projection formats were tested using HM-16.16, VTM-12.0 and 360LiB-12.0 reference software according to the 360 video test conditions specified by JVET [11]. The test results are listed in Table 5. The quality was evaluated using the end-to-end WS-PSNR described earlier. ERP and GCMP use the filling method when encapsulating the encoded image frame, and the mixed filtering method is used in the post-decoding process. Table 5 (a) shows the performance of VTM relative to HM in ERP projection format; (b) is the performance of VTM relative to HM in GCMP projection format; (c) is the performance of GCMP relative to ERP under VTM.

Table 5. Comparison of test results of VTM and HM on 360 video projection formats:

reference

[1] B. Bross, J. Chen, S. Liu, Y. K. Wang, “Versatile Video Coding,” JVET-S2001, 19th Meeting, 22 June — 1 July 2020.

[2] P. Hanhart, Y. He, Y. Ye, J. Boyce, Z. Deng, L. Xu, “360-degree Video Quality Evaluation”, PCS 2018.

[3] Y. Ye, J. Boyce, “Algorithm descriptions of projection format conversion and video quality metrics in 360Lib”, JVET-Q2004, March 2020.

[4] 360 lib software, jvet hhi. Fraunhofer. DE/SVN/svn_360…

[5] Y. Sun, A. Lu, L. Yu, “Ahg-1: Ws-psnr for 360 Video Objective Quality Evaluation, “Joint Video Exploration Team of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, JVET-D0040, Oct. 2016, Chengdu, China.

[6] J. Boyce, V. Drugeon, G. J. Sullivan, Y.-K. Wang, “Supplemental enhancement information messages for coded video bitstreams (Draft 5),” JVET-S2007, June 2020.

[7] s. – y. Lin, Liu, j. l. – l. Lin, y. -c. Chang, c. -c. Ju, p. Hanhart, Y., “He AHG 12: Loop Filter disabled across Virtual boundaries “, JVET-N0438, Mar.2019.

[8] Y.-K. Wang, etc., “AHG12: Text for subpicture agreements integration”, JVET-O1143, July 2019.

[9] Y. Wang, Y. He, L. Zhang, ” SW Support of 360 SEI Messages”, JVET-S0257, July 2020.

[10] L. Lee, J.-L. Lin, Y. Wang, Y. He, L. Zhang, “AHG6: Blending with padded samples for GCMP”, JVET-T0118, Oct. 2020.

[11] Y. He, J. Boyce, K. Choi, J.-L. Lin, “JVET common test conditions and evaluation procedures for 360° video”, JVET-U2012, Jan. 2021.