Author: Wang Hui

Background:

Panoramic video technologyIt is an important technology in VR/AR field to realize 3D video, while traditional panoramic video only has three degrees of freedom (3DOF), that is, the position of the observer is fixed, only yaw, roll and pitch three degrees of freedom can be experienced. When six degrees of freedom (6DOF) are allowed, the observer can move freely in a limited space and can fully experience yaw, pitch, roll, as well as front/back, up/down, and left/right, as shown in the figure below.

The content of panoramic video is divided into computer graphics rendering image and shot real scene. For the latter, the position of the camera in the real video is determined. If the perspective is moved during viewing, there is a problem of missing data in the corresponding new position, which leads to irregular stretching deformation of the watched image, as shown in the figure below (Figure 1). If each view collects a panoramic image, the amount of data is too large. In order to solve the problem of too large data amount, the six-degree-of-freedom panoramic video in the previous methods will introduce depth to solve the problem. In terms of depth acquisition, it is often necessary to construct an array of color or/and depth cameras to obtain the depth.

This paper presents a low cost and easy to use 6 dOF panoramic video technology. We introduce deep neural network, which can not only predict the depth view of panoramic view, but also automatically and intelligently fill in the data missing when moving view, so that the observer can “walk at will” in a certain range of free space and watch the image without deformation, as shown in the figure below (Figure 2). In particular, the method in this paper can estimate the depth view of panoramic view very well, which does not depend on the depth camera, so the scope of use is not limited, including indoor and outdoor.

Figure 1

Figure 2

Technical Introduction:

Depth information in panoramic video:

First, let’s take a look at what depth information is in panoramic video. Depth information means that in addition to the color information, each pixel in the video/photo also has a Depth information, which is usually called D/Depth in RGBD, which represents the distance of the pixel from the camera’s imaging plane. The depth information of panoramic video is all the pixels in the surrounding 360-degree space with distance information, so it provides rich 360-degree scene structure information.

Depth information is generally captured in two ways: active (structured light, ToF) and passive (computer vision computing Multi View Stereo: depth calculation through multiple photographs). The active approach requires deep access to the equipment, which has problems such as outdoor influence and multi-device and multi-path interference. On the other hand, the passive approach requires complex calculations and robustness is difficult to achieve, especially with low textures, repetitive textures, transparent textures, and specular textures.

Panoramic view depth estimation model:

We propose a deep learning method to estimate the depth view corresponding to the panoramic view. The deep network adopts the classic encoder-decoder model, in which the encoder can adopt the backbone model commonly used, such as ResNet, VGG, etc. The depth decoder converts the output to an output of depth values. In order to meet the depth estimation of high-resolution panoramic video, we combined the loss of each scale together to make multi-scale estimation, which can achieve high-resolution depth reconstruction of target panoramic view.

Panoramic view depth estimation model

Our training model is used for prediction. For example, by inputting the panoramic view on the left of the following figure, the corresponding depth view on the right can be output, as shown below:

Panoramic views and their corresponding depth view results

RGB+D training data:

The above panoramic view depth estimation model requires a large amount of RGB+D training data, and there are few open data sets in this aspect. We collected and generated two categories of training data (as our self-developed panoramic view RGBD data set), including groudTruth data collected by using self-built equipment and Groudtruth data generated by Computer Graphics. To build its own training database for panoramic views and their corresponding depth views.

Acquisition equipment

I. We set up the collection equipment for panoramic view and its depth view. ToF camera is used for the depth collection, as shown in the figure above. Use consumer phone, ToF Sensor, PTZ, tripod and dedicated collection APP to obtain color and depth information simultaneously. Professional PTZ can take multiple RGB and D image pairs at fixed Angle intervals. Here, in order to improve the quality of the ToF camera shooting results, especially the black hole caused by overexposure and underexplosion, we proposed a real-time and low-cost depth completion algorithm to enhance the quality of the original depth map. The process and results are as follows (left: warp transform results of the depth map; Right: enhanced depth map results).

RGB+D image pair generation algorithm

Depth completion algorithm

As can be seen from the above flow chart, color map and depth map can be aligned with the help of calibration algorithm, which is called RGB+D image pair. For multiple RGB+D image pairs, we adopted the classic panoramic view stitching algorithm, that is, by means of feature point matching, precise camera poses were obtained, and the final panoramic view and its corresponding depth view were obtained through stitching and post-processing, as the groudtruth trained by us, as shown in the figure below.

The training data is collected

Second, we use graphics to construct a batch of model scene data, and through Unreal/Unity and other rendering engines, rendering panoramic view and its corresponding depth view, increasing training data, to supplement the lack of our own database.

Panoramic view filling techniques:

As mentioned above, since the camera position of live-shot panoramic video is determined, if the perspective is moved during viewing, the corresponding new position has the problem of missing data. In order to solve this problem, this paper puts forward the panoramic view filling technique. By calculating the discontinuities (threshold judgment) of the depth view and doing expansion calculation at the discontinuities, that is, doing expansion at the discontinuous image background, and then processing the expansion result into a binary map as the input mask of the filling/Inpainting model. To fill and repair the color pixels in the mask area, so as to obtain the background filling (filling and repairing) results, which well solves the problem of missing background pixels caused by the foreground occlusion background under the fixed perspective.

Panoramic view filling technique

Fill in the details of the results (left) and corresponding depth expansion results (right) (above: bookshelf under the TV)

Free view roaming drawing:

After obtaining the panoramic view, the corresponding depth view of the panoramic view, the background filling view, and the corresponding depth view of the background filling view, we can draw the panoramic video of six degrees of freedom in the way of free view roaming. Our drawing process is as follows:

Drawing process

In the method presented in this paper, the user can select a free viewpoint and change the three-dimensional position of the viewpoint and the three-dimensional orientation of the line of sight to draw a panoramic video with six degrees of freedom. Our advantage is that we can draw high-quality results without stretching deformation, and the results have motion parallax, because we have added the calculated depth information in the panoramic view; In addition, we estimated and pre-generated the filling pixels (RGB+D) of the missing part of the background, which can well fill the holes and make up for the lack of data acquisition, and finally achieve the low cost and high quality 6-DOF panoramic video.

Overall solution:

The six-degree-of-freedom panoramic video solution described in this paper includes a panorama (RGB+D) capture device, an algorithm SDK (which can be flexibly deployed as a cloud service), panorama Mosaic, panorama estimation depth, background filling model, and a client player for free-view roaming rendering.

Our overall solution can be used for VR/AR and all types of 6DoF video (including: bullet time, free view video, etc.). In particular, we plan the end – the combination of cloud computing framework (below), the first six degrees of freedom in the cloud storage panoramic video data, customer/client in watching players by the interaction of the modified parameters, and updates the free roaming perspective drawing parameters calculation, upload the drawing of panoramic video rendering parameters to the cloud, In addition, the current hd drawing result is delivered to the user’s playback end for display.

The end-to-end cloud computing framework allows us to efficiently draw panoramic video data with 10K-16K resolution (RGB+D) by using computing devices in the cloud (such as GPU). After obtaining high-resolution rendering results (1K-2K), they can be delivered to the client of the user. This framework can effectively combine the advantages of end-cloud to obtain low delay and high quality 6-DOF panoramic video.

End-to-end cloud computing framework

Conclusion:

The six degrees of freedom of panoramic video, presented in this paper are very advanced, combined the technology of deep learning, can the production of low-cost traditional panoramic video corresponding to the depth of the video, and the quality of the fill be prospects of background pixels (RGB + D), thus in the process of free viewpoint roaming, generate no deformation of tensile, and have the six degrees of freedom video motion parallax. In addition, the solution in this paper, including self-developed hardware acquisition equipment and end-to-end cloud computing framework, can generate high definition results with low latency in practical applications, which can be used in VR/AR and the next generation 6DoF video.