The clarity of VR content has long been a concern and an important factor in making the experience more immersive. However, many users who experience VR videos complain that they are watching 4K or even 8K content, but the actual experience is not as good as the 1080P quality of the phone. ** Did I buy a fake VR device? Did you just watch some fake 4K content?

Why isn’t VR video high-definition enough? There are a few things we might want to clear up.

First, 4K picture quality is not equal to 4K look and feel

For traditional video playback, users are familiar with, that is, using mobile phones, ipads, TVS and other media to watch video. In front of us is a relatively small screen, the user can only stare at the screen to watch. Traditional video viewing has been evolving for many years, ** during which the user experience has improved mainly by increasing the resolution of the video. ** From the traditional DVD to the current 4K/8K, the clarity of video has increased more than N times.

Graph from network

Is there an end to the improvement of the resolution, and is increasing the resolution to improve the user’s viewing experience?

In fact, there is another important indicator between the screen resolution and the actual viewing experience — PPD (Pixels Per Degree), which is the core standard to measure the video viewing clarity.

In the case of a phone with a retina screen, the level of the screen in the user’s field of view is about 10 degrees, assuming the user is 30cm away from the screen. According to Apple’s definition of a retina screen, this screen is about 600 pixels wide, which means that each degree of field of view is allocated 60 pixels, or 60PPD, which means retina level visual effect.

It is in this sense that **4K screen resolution does not necessarily represent the uHD viewing experience, the key is to see how many pixels the human eye can get from it. ** When viewing a traditional screen from a normal distance, users can basically see the whole screen, so the resolution of the whole screen is consistent with the user’s viewing experience.

But in a THREE-DIMENSIONAL VR video, where the user is looking at a small piece of the screen with a magnifying glass, the resolution is exponentially higher.

Ii. PPD of VR glasses

Photo from the Internet

So, to the second question, ** how much PPD can a user achieve in front of a 2K mobile phone screen 40CM away, or in front of an 8K TV 2 meters away? ** The figure below gives a rough figure.

Table data is from THE PPT “KEY Technologies and Standards development of VR and Free View Video”, speaker, Wang Ronggang, Shenzhen Graduate School of Peking University

It can be seen that when watching 4K TV, the PPD of ** users can reach 80+, which has exceeded the upper limit of retina display effect. ** Some researchers have done experiments in which unwitting users were allowed to watch 4K and 8K TVS and then independently distinguish the definition of the TV. The results showed that the correct judgment and the wrong judgment were almost equal.

But going back to the VR devices, in the VR devices shown above, PPD plummeted without exception, falling far short of the retina standard of 60PPD. Why are the same pixels so blurry in A VR device?

Pictures come from the Internet

A typical VR video is a model of a sphere, and the user is equivalent to standing at the center of the sphere and looking outward. ** Due to the limited visual Angle of the human eye, the user can only see a small part of the 360-degree sphere at the same time. ** The rest of the sphere is visible only when the user rotates the view.

The area the user sees is called the “Viewport” and is the yellow area in the image above.

At this point, it’s easy to understand why VR is not clear: if the whole sphere is a 4K resolution video, the user will probably only see a small area with 1K x 1K resolution, the PPD value will be greatly reduced, and the image quality will inevitably be blurred.

Iii. How to improve the clarity of VR video

So how can we improve the clarity of VR video? How about improving the user experience?

The first is to improve the resolution of the video. In theory, as long as the full screen is clear enough, VR will look and feel better.

But simply improve the resolution, 4K is not clear on 8K, 8K is not clear on 16K, is this ok?

** The fact is, with this level of clarity, the complexity of encoding, transmission, and decoding increases exponentially. ** Currently, there are very few mobile devices that can decode a full 8K picture in real time, let alone a resolution above 8K.

As a result, the current hardware will soon reach a performance bottleneck, and will not be able to cope with such high-definition streaming, mass production, and delivered to users.

Now the market VR equipment more general solution is: since the user can only see a small area in VR, then only decode the image of this small area, let the user watch, when the user rotates the viewport, and then update the corresponding area.

Since the current coding techniques are mainly for rectangular image blocks, we can make a 4×4 partition of the original image and encode each small block independently. If the original image is 8K (7680×4320), then each partition is exactly 1080P. In this way, it seems that all we need to do is determine where the user’s current viewport is, cover a few pieces, and then just decode those pieces and render them to the window.

** This does solve some of the problems and reduce some of the decoding burden, but the division is less than ideal. * * look carefully you can get the answer from the above: if the user’s viewport position on the diagram, we need to decode nine pieces at the same time (i.e., to decode the full 9/16), but the user main viewport is in the center of the 9 block 1 block, on the edge of the eight small pieces, though all were decoded, but only render a little detail. Decoding resources are still wasted.

And if the picture is subdivided, and will increase the number of decoders, generally speaking, mobile phone or VR integrated machine, the number of hardware decoders are limited, can not create too many, open too many hard decoders at one time, is not feasible.

In view of this problem, iQiyi technical team has also put forward its own solution after a series of practices, and selected the Tile block coding method, so as to realize the 8K technology under VR.

Simply put, ** is a frame image, which can be divided into a number of rectangular sub-blocks. The encoding parameters of each block are consistent, but they are independently encoded. ** In decoding, only need to put the MxN block into a rectangle, each block frame data head to end, together to the decoder, the decoded image is a MxN segmentation pattern, in rendering the corresponding region to the window on it. This reduces the number of decoders.

Iqiyi World Congress scene picture

As shown in the figure above, 8K video is divided by 8×8, and the user viewport is the yellow area. At this time, 12 small pieces can be seen. Suppose we combine every 4 small pieces together to form a 2×2 rectangle and send them to the decoder together. Then only three decoders are needed to cover the scene.

At present, the problem of decoding and displaying the area within the user viewport has been solved perfectly. But a large area outside the viewport is still dark. It’s impossible for the user not to rotate the view. What happens when he quickly shifts the view somewhere else?

The encoding of new images outside the field of view is all based on the unit of GOP (Group of Pictures). New images also need to be decoded from the starting point of GOP, and it takes time to decode the frame sequence. So when the viewport rotates quickly, the user will see where there is no decoding currently underway.

Iqiyi World Congress scene picture

The solution to this problem is to encode a relatively low definition of a stream, such as 1080P or 2K, this stream is exactly the same as the 8K full image, but the resolution and bit rate is lower. During decoding, a decoder is always turned on to decode this 1080P or 2K bit stream, and immediately after decoding, the decoder is pasted on the whole rendered sphere and presented to the user as a bottom-pocket display video to prevent the appearance of “black field”.

Iqiyi World Congress scene picture

As the user rotates the view, ** several new 8K tiles are decoded to cover the corresponding positions of the sphere so that the user can see a clearer video. ** In the figure above, assuming the user moves the viewport to the right, the red Tile will move out of the viewport, the green Tile will move in, and the green Tile will be updated within a GOP.

Iqiyi World Congress scene picture

Tile combination is very flexible, as long as the MxN rectangle can be formed, 2×2, 2×3, 4×2, etc., can be freely combined.

The whole decoded rendering architecture is shown in the figure above. After receiving the data, the Tile number is calculated according to the user’s attitude at this time, and the Tile number list in the current viewport is obtained. ** Then combine the tiles, 2×2, 2×3, etc. Next, each combined Tile is sent to the corresponding decoding thread, ** multiple threads start decoding in parallel. The frames output by each decoder, after PTS synchronization, are finally transmitted to the renderer. The renderer does either anti-distortion rendering, or direct rendering.

Photo from the Internet

The current attitude of the user is given by the Sensor in real time. By transforming the spherical coordinates to the plane cartesian coordinate system, the latitude and longitude range of the current user viewport can be obtained, and then the tiles currently covered on the sphere can be obtained.

This method makes the streaming media playback of VR devices more clear and smooth, and lowers the threshold of hardware devices, making the 8K look and feel in VR a reality.

Iv. VR Industry standards and future

LiveVideoStack 2019 conference, Visbit CTO Zhou Changyin PPT showing THE VR industry standard

In fact, the experience of using a VR device is influenced by many factors other than clarity. ** The figure above is a general indicator of VR immersion. Here are a few of them to expand on.

1. The MTP time delay

MTP (Motion-to-Photon) is the time before the input action (head rotation) and the screen update display (light emitted from the refreshed screen).

Photo from the Internet

This indicator is very important. Good experience requires that the MTP delay should not be greater than 20ms. If the delay is too large, the user rotates the Angle of view, but the picture does not change in time, the experience is very dizzy.

2. Image error caused by lens distortion

Photo from the Internet

Radial distortion, where light rays are more curved away from the center of the lens, is further classified as pillow distortion and barrel distortion. Usually, the distortion rate should be controlled at about 1%.

Photo from the Internet

** Tangential distortion, ** the lens is not parallel to the camera sensor plane, which is mostly caused by installation deviation. The image will be like “lying down.”

3. Contrast of monocular and binocular visual effects

Photo from the Internet

Human beings are binocular visual animals. In reality, the target seen by both eyes has a certain parallax, and the reaction in the brain is that the scenery has more three-dimensional sense and the depth of field is obvious.

Others, such as resolution, frame rate, and level, are more familiar to users, and are similar to the corresponding concepts of ordinary video, so they will not be repeated.

These are some of the technical indicators of VR playback, each of which has a direct impact on the user experience.

Finally, we want to talk about the progress and exploration of VR in the future. We think there are the following aspects:

1. Reduce the code rate

The picture is from the PPT of VR and Free View Video Key Technology and Standard development, speaker, Wang Ronggang, Shenzhen Graduate School of Peking University

Video coding is also constantly updated with the improvement of definition. According to the state Administration of Radio, Film and Television “5G High-tech Video – VR Video Technology White Paper (2020)” standard, in 8KVR definition, encoding standards such as H.265 and AVS2 can still be used. However, more advanced standards such as H.266/AVS3 are needed to support the resolution above 8K in the future. The expected bit rate of AVS3 can be halved compared to AVS2.

Photo from the Internet

** Reducing the bit rate can also be achieved by changing the projection. ** The traditional ERP(Equirectangular Projection) Projection method greatly stretches the arctic and South regions, resulting in a lot of pixel redundancy, which brings additional bit rate to the coding.

The cube projection and four pyramid projection above are improvements on ERP, which can effectively reduce the number of pixels to be coded.

2. Reduce the transmission load

Photo from the Internet

If all 360 degree spherical data is transmitted through the network, it obviously needs higher bandwidth support. Based on the actual usage of VR goggles, users rarely turn 180 degrees while watching videos. So consider loading only half of the sphere and updating the half sphere when the user’s view is slightly turned.

** At the same time, the prediction algorithm based on deep learning or AI can also be considered to predict the hot spots in the image and the user’s future motion trajectory, ** the data of the hot spots or the subsequent data on the motion trajectory can be loaded in advance.

3. Optimize the decoding module

Iqiyi World Congress

Decoding module can support CPU+GPU mixed scheduling, such as the CPU is only responsible for decoding the relatively low definition of the background bit stream. The GPU is responsible for decoding many hd tiles.

The 8×8 partitioning method can also be slightly improved. For example, the images of the North and South Pole regions are simple, and the original pixels on the spherical model are less, so the tiles in these regions can be divided into larger ones, and a decoder can be used to cover more of the original image range.

4. From 3DoF to 6DoF

Photo from the Internet

**DoF (Degree of Freedom) is an important index in VR technology. ** refers to the basic motion modes of objects in space, including forward and backward, left and right, up and down, back and forth, left and right swing, and horizontal rotation. I won’t explain too much about DoF here. In general, the more **DoF, the more flexible the way an object moves in space. Similarly, the higher the DoF, the more ways users are allowed to interact with their devices.

Photo from the Internet

As we all know, natural immersive interactive experience is the constant pursuit of VR technology. Currently, many devices support 6DoF. I believe that in the near future, more and more people will be able to feel the deep immersive effect of VR!

Some pictures are from the Internet. If you have any copyright problems, please contact us in time.