One, foreword

All objects in the objective world are three-dimensional. It is an important content of computer graphics to describe and display the three-dimensional objects in the objective world. 3d reconstruction is one of the core techniques of computer graphics and image, which is widely used in many fields. For example, the medical industry can use 3D models of biological organs to simulate surgical anatomy or adjuvant therapy; The entertainment industry can use 3D models to animate and simulate people and animals. The architecture industry can use 3D architectural models to verify the spatial rationality, aesthetic visual effects and so on of buildings and landscape designs. At present, most people are studying model recognition, but this is only a part of computer vision. Computer vision in the real sense should go beyond two-dimensional and perceive three-dimensional environment [1]. We live in three dimensions, and to interact and perceive more effectively, we need to restore the world to three dimensions.

2. Introduction of common 3D representation methods

The computer screen itself is a two-dimensional plane, and the reason why we can appreciate three-dimensional images like real objects [2] is because the difference in color and gray level on the computer screen produces an illusion, and the two-dimensional screen is perceived as a THREE-DIMENSIONAL environment. According to chromology, the convex part of the edge of 3D objects generally shows high brightness color, while the concave part is slightly dark due to the occlusion of light, and the human eye also has the imaging characteristics of near large and far small, which will form a 3D three-dimensional sense. For how to express 3D model effect in computer, there are different presentation methods according to different usage requirements, which can be basically divided into four categories: Depth Map, Point Cloud [3], Voxel [4] and Mesh [5].

2.1 Depth Map

Fig.1 Depth diagram of the cube

A depth map is a 2D image in which each pixel records the distance from Viewpoint to the surface of the occlusion (the occlusion is the object that generates the shadow), equivalent to only the Z-axis information in the 3D information, and the corresponding vertices of these pixels are “visible” to the observer.

2.2 Voxel

Fig.2 Minecraft

Voxel or stereo pixel is short for VolumePixel, which is the smallest unit of model data in three-dimensional space and a kind of regular data. Conceptually similar to pixels in a two-dimensional image, voxels themselves contain no data about their positions in space (i.e. their coordinates), but can be deduced from their positions relative to other voxels. As shown in Fig.2, minecraft, a popular PC 3D game, allows players to arbitrarily stack a voxel block in their own world to construct their own exclusive and personalized 3D characters and world.

2.3 Mesh

Fig.3 Dolphin grid

Polygonmesh is a collection of vertices and polygons representing the shape of polyhedron in 3d computer graphics. It is a kind of irregular structure data. These grids are usually made up of triangles, quadrangles, or other simple convex polygons. Among them, the most commonly used is triangular mesh, which usually needs to store three types of information: vertices, edges and faces.

Vertices: Each triangular mesh has three vertices that may be shared with other triangular meshes.

Edge: An edge connecting two vertices, each triangular mesh having three edges.

Faces: Each triangular mesh corresponds to a face, which can be represented by a list of vertices or edges.

2.4 Point Cloud

Fig.4 Donut point cloud image

Point clouds refer to data recorded in the form of points. Each point contains a wealth of information, including three-dimensional coordinates X, Y, Z, color, classification value, intensity value, time and so on. Point cloud can atomize the real world, and the real world can be restored through high-precision point cloud data.

So which one do we use as our common 3D model representation? According to the introduction, Voxel, limited by resolution and expression ability, will lack a lot of details; In Point Cloud, there is no connection between points, which will lack the surface information of objects; In comparison, Mesh is characterized by light weight and rich shape details.

3d reconstruction technology can be roughly divided into contact and non-contact two kinds. Among them, the more common non-contact methods are those based on active vision, such as laser scanning, structured light, shadow method and Kinect technology, and those based on machine learning, such as statistical learning method, neural network method, deep learning and semantic method.

3. 3D Reconstruction technology based on active vision [7]

3.1 Laser scanning method

Laser scanning method actually uses laser rangefinder to measure the real scene. First, the laser rangefinder emits a beam of light to the surface of the object. Then, the distance of the object from the laser rangefinder is determined according to the time difference between receiving and sending the signal, so as to obtain the size and shape of the measured object.

3.2 Structured light method

The principle of structured light method is that the projection equipment, the image acquisition equipment and the object to be measured constitute a 3D reconstruction system according to the calibration criterion. Secondly, the structure light map with certain regularity is projected on the measured object surface and the reference plane respectively. Then, visual sensor is used for image acquisition, so as to obtain the structured light image projection information of the object surface and the object reference plane. Finally, using the triangulation principle, image processing and other techniques to process the obtained image data, calculate the depth information of the object surface, so as to realize the conversion from two-dimensional image to three-dimensional image. According to different projected images, structured light can be divided into point structured light, line structured light, plane structured light, network structured light and color structured light.

3.3 shadow method

Shadow method is a simple, reliable and low-power method to reconstruct 3d model of object. This is a method based on weak structured light. Compared with the traditional structured light method, this method has very low requirements. It only needs to face a camera to the object illuminated by light, capture the moving shadow by moving the object in front of the light source, and observe the spatial position of the shadow, so as to reconstruct the three-dimensional structure model of the object.

3.4 to access technology

Kinect sensor is a consumer 3D camera that develops rapidly in recent years. It directly uses laser speckle ranging to obtain the depth information of the scene. Kinect sensor is shown in the figure below. The camera is in the middle of the Kinect sensor, and the left and right lenses are called 3D depth sensors, which have the function of focusing and can obtain depth information, color information and other information at the same time. Kinect needs to be calibrated in advance before use, and most of the calibration uses Zhang Zhengyou calibration method.

4. 3d reconstruction technology based on Pixel2Mesh

Pixel2Mesh (Generating3D Mesh Models From Single RGB Images)

4.1 Overall Framework

Fig.5 Network architecture diagram

1. First, an InputImage is given: InputImage.

2. Initialize a fixed-size Ellipsoid (with three-axis radius of 0.2, 0.2 and 0.8m respectively) for any input image as its initial 3d shape: Ellipsoid Mesh.

3. The whole network can be divided into two parts: image feature extraction network and hierarchical grid deformation network.

1. The above part is responsible for extracting the features of the input image with full convolutional neural network (CNN) [8].

2. The following part is responsible for extracting 3d mesh features with graph convolutional neural network (GCN) [9] and continuously deforming the 3D mesh, gradually deforming the ellipsoid mesh into the required 3D model, with the goal of obtaining the final aircraft model.

4. Notice that the PerceptualFeature Pooling layer in the figure connects the above 2D image information with the following 3Dmesh information, that is, the node states of the graph convolutional network in the 3Dmesh are adjusted by referring to the 2D image features. This process can be seen as Mesh Deformation.

5. Another key component is GraphUnpooling. This module is To increase the number of nodes in the figure in turn. From the figure, it can be seen that the number of nodes is changed from 156–>628–>2466, which is actually the embodiment of coarse-to-fine.

4.2 FIG. Convolutional neural network GCN

Let’s first look at how graph convolutional neural network [6] extracts features. Generally, RNN and CNN neural networks are respectively used to extract features from one-dimensional speech data and two-dimensional matrix image data in Euclidean space. GCN is actually another form of graph structure in needle data structure to extract features. GCN has a natural advantage in representing 3D structures. As we know from the previous, 3D mesh describes 3D objects by vertices, edges and faces, which corresponds to graph convolutional neural network G = (V, E, F), which are Vertex Vertex, Edge and Feature vector Feature respectively.

The formula of graph convolution is defined as follows:

Where, FLP and FL +1p represent the feature vectors of vertex P before and after the convolution operation respectively. N(p) refers to the p neighbor node of the vertex; W1 and w2 indicate the parameters to be learned.

The hidden layer of graph convolution is shown as follows:

Where, f represents a propagation rule; Each hidden layer Hi corresponds to a shape feature matrix with the dimension of Nxfi (N is the number of nodes in the graph, fi is the number of input features of each node). Each row in the matrix represents the fi dimension feature representation of the corresponding node in the row. A is the adjacency matrix of NxN. Add the weight matrix FixFi +1 of the I layer, then when the input layer is Nxfi, the output dimension is NxNxNxfixfi+1, that is, Nxfi+1. In each hidden layer, GCN aggregates this information using propagation rule F to form the features of the next layer, so that the features of the graph structure become more and more abstract in each successive layer.

It can be seen from the above two expressions that the nodes of graph convolutional neural network are updated according to their own characteristics and the characteristics of adjacent nodes.

4.3 Mesh Deformation Block

Function: Input 2D CNN feature and 3D vertex position and shape feature, output new 3D vertex position and shape feature

Fig.6 MeshDeformation Block

In order to generate the 3D mesh model corresponding to the object displayed in the input image, 2D CNN feature (graph P) needs to be introduced into the mesh Deformation Block from the input image, which requires the fusion of the image feature network and the vertex position (CI-1) in the current grid model. The above fused features are then cascaded with mesh shape features (FI-1) attached to the vertices of the input graph to merge the input into the G-ResNet based module. G-resnet is a Graph-based ResNet network that generates new vertex position coordinates (Ci) and 3D shape features (Fi) for each vertex.

4.4 Perceptual Feature Pooling Layer

Function: Fuses 3D vertex positions with 2D CNN features

Fig.7 PerceptualFeature Pooling Layer

The module extracts the corresponding information from the image feature P according to 3d vertex coordinates, and then fuses the extracted vertex features with the vertex features of the last moment. The specific approach is to assume the three-dimensional coordinates of a vertex, use the camera internal parameters to calculate the two-dimensional projection of the vertex on the input image plane, and then use bilinear interpolation to gather the features of the four adjacent pixels, which can be input into GCN to extract the structural features of the graph. In particular, the features extracted from the ‘Conv3_3’, ‘CONV4_3’ and ‘Conv5_3’ layers were cascaded to give a total number of channels of 1280 (256+512+512). This perceptual feature is then connected to the 128-dimensional 3D feature from the input grid to obtain a total dimension of 1408.

4.5 G – ResNet

Function: Used to extract features in graph structure

After obtaining 1408 dimensional features of each vertex that can represent 3d mesh information and 2D image information, the model designs a GCN based on ResNet structure to predict the new position and shape features of each vertex, which requires more efficient exchange of information between vertices [10]. However, as introduced by GCN in Iv (2), each convolution only allows feature exchange between adjacent pixels, which greatly reduces the efficiency of information exchange. To solve this problem, a very deep G-ResNet network was built by means of short connections. In this framework, the G-renet of all blocks has the same structure, consisting of 14 128-channel graph residual network layers.

4.6 Graph Unpooling Layer

Effect: Increases the number of vertices in GCNN

Fig.8 GraphUnpooling schematic diagram

Because each graph convolved block itself has a fixed number of vertices, it allows us to start with a grid with fewer vertices and only add more vertices as necessary, which can reduce memory overhead and produce better results. A simple method is to add a vertex [11] to the center of each triangle and connect it to the three vertices of the triangle. However, this can lead to an imbalance in vertex degree. Inspired by the vertex-adding strategy in mesh subdivision algorithms commonly used in computer graphics, we add a vertex to the center of each edge and connect it to the two endpoints of the edge (see Fig.8.a). The 3d character of the newly added vertex is set to the average of its two adjacent vertices, and if we add three vertices to the same triangle (dotted line), we also connect them. Therefore, we create four new triangles for each triangle in the original grid, and the number of vertices increases evenly with the number of edges in the original grid.

4.7 Loss

1.Chamfer Loss [3]

Chamfer distance refers to the distance lc between two points, p represents a specific node, and Q is the neighbor node of p node. The purpose is to constrain the position between the vertices of the grid and return the vertices to their correct orientation, but it is not enough to produce a good 3D grid.

2.Normal Loss

Normal loss requires that the edges between vertices and their neighboring vertices be perpendicular to the observable grid truth, and optimization of this loss is equivalent to forcing local fitting of the normals of the tangent plane to be consistent with the observed values.

3. LaplacianRegularization Laplacian regularization

Laplacian Regularization encourages adjacent vertices to have the same movement, prevents vertices from moving too freely to avoid mesh self-intersection, and maintains the relative position between adjacent vertices during deformation.

4.Edge Length Regularization Regularization

Edge Length Regularization prevents outliers from occurring, and the distance deviation between vertices is too large to constrain Edge Length.

Final loss is:

Five, the summary

Fig.9 Results

Due to the limitations of deep neural network, previous methods are mostly presented by Voxel and Point Cloud, but it is not easy to transform it into a Mesh. Pixel2Mesh uses the neural network based on graph structure to gradually transform the ellipsoid to produce the correct geometric shape. This paper introduces the background, representation and Pixel2Mesh algorithm of 3D mesh reconstruction. Contributions are summarized as follows:

(1) The end-to-end neural network is used to directly generate 3d object data represented by mesh from a single color graph

(2) Graph convolutional neural network is used to represent 3D mesh information, and the ellipse is gradually deformed by the features extracted from the input image to produce the correct geometric shape

(3) In order To make the whole deformation process more stable, the coarse-to-fine method is also adopted in this paper

(4) Several different loss functions are designed for the generated mesh to make the generated effect of the whole model better

Fig.10 Aircraft Mesh effect

Fig.11 Stool Mes h The effect

Future work This algorithm is applied to 3D model reconstruction of objects, which can be expected to be extended to more general cases, such as scene-level reconstruction, and learn multi-image multi-view reconstruction (Piexl2Mesh++) [12].

References:

  1. Zhiqin Chen and HaoZhang. Learning implicit fifields for generative shape modeling. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 5939 — 5948, 2019.

  2. Angjoo Kanazawa,Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik. Learningcategory-specifific mesh reconstruction from image collections. In ECCV, 2018.

  3. Fan, H., Su, H., Guibas,L.J.: A point set generation network for 3d object reconstruction from a singleimage. In CVPR, 2017.

  4. Choy, C.B., Xu, D., Gwak,J., Chen, K., Savarese, S.: 3d-r2n2: A unifified approach for single andmulti-view 3d object reconstruction. In ECCV, 2016.

  5. Rohit Girdhar, DavidF. Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning a predictable andgenerative vector representation for objects. In ECCV, 2016.

  6. Thomas N. Kipf and MaxWelling. Semi-supervised classifification with graph convolutional networks.In ICLR, 2016.

  7. Zheng Taixiong, Huang Shuai, Li Yongfu, Feng Mingchi. Review on key technologies of 3D reconstruction based on vision. Automation, 2020, 46 (4) : 631-652. The doi: 10.16383 / j.a as. 2017. C170502

  8. Lars Mescheder, MichaelOechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancynetworks: Learning 3D Reconstruction in Function Space. In CVPR, Pages 4460 — 4470, 2019.

  9. Peng-Shuai Wang, Yang Liu,Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-cnn: Octree-based convolutional neuralnetworks for 3d shape analysis. ACMTransactions on Graphics (TOG) , 36(4):72, 2017.

  10. Christian Hane,Shubham Tulsiani, and Jitendra Malik. Hierarchical surface prediction for 3dobject reconstruction. In 3DV,2017.

  11. Sunghoon Im, Hae-GonJeon, Stephen Lin, and In SoKweon. Dpsnet: End-to-end deep plane sweep stereo.In ICLR, 2018.

  12. Chao Wen and Yinda Zhangand Zhuwen Li and Yanwei Fu: Multi-View 3D Mesh Generation via Deformation. InECCV, 2019.

The article comes from the little Information AI Image graphics Lab (AIIG) team