A, an overview of

The paper address: https://arxiv.org/ftp/arxiv/papers/2003/2003.09644.pdf

Code to download: https://github.com/charlesq34/pointnet2

Thesis Framework: Research Objectives:

To enhancePointNetThe ability to recognize fine-grained patterns and to generalize complex scenarios enable it to learn features of deep point sets efficiently and robustly.

Solution:

  1. The hierarchical neural network of PointNet is used recursively to divide the input points.
  2. In the training process, the patterns detected at different scales are adaptive weighted with the help of random input loss, and the multi-scale features are combined according to the input data.

Research Contributions:

  1. PointNet++ leverages neighborhoods for robustness and detail capture at multiple scales and is effective in learning hierarchical features about distance measures.
  2. For non-uniform point sampling problems, two new set abstraction layers are proposed to intelligently aggregate multi-scale information according to local point density.

Experimental data sets: MNIST, ModelNet40, SHREC15, ScanNet

Ii. Related Work

The idea of layered feature learning has been very successful. Among all the learning models, convolutional neural network is one of the most prominent models. However, convolution does not apply to unordered sets of points with distance measures, which is the focus of our work.

Pointnet and Order Matters examine how deep learning can be applied to unordered collections. But they ignore the underlying distance metric, even if there is only one set of points. As a result, they cannot capture the local context of a point and are sensitive to global collection transformations and normalization. In this work, we target the points sampled from the metric space and address these problems by explicitly considering potential distance measures in the design.

The points sampled from the metric space are usually noisy and the sampling density is not uniform. This affects the effective extraction of point features and brings difficulties to learning. One of the key problems is to choose the appropriate scale for point feature design. Previously, in the field of geometric processing or photogrammetry and remote sensing, several methods have been developed to solve this problem. Unlike all of these efforts, our approach learned to extract point features and balance multiple feature scales in an end-to-end manner.

In addition to point sets, there are several popular deep learning representations in 3D metric Spaces, including volumetric meshes and geometric graphs. However, in these works, the problem of uneven sampling density is not explicitly taken into account.

Iii. Method of this paper

3.1 Feature learning of layered point set

While PointNet uses a single Max merge operation to aggregate an entire set of points, our new architecture establishes a hierarchical grouping of points, abstracting larger and larger local areas along the hierarchy.Our hierarchy consists of multiple set abstraction. In set abstraction, a set of points is processed and abstracted to produce a new set with fewer elements. Set abstraction consists of three key layers: the sampling layer, the grouping layer and the PointNet layer. The sampling layer selects a set of points from the input points that define the centroid of the local region. The grouping layer then builds a set of local regions by looking for “adjacent” points around the center of mass. The PointNet layer uses Mini PointNet to encode local region patterns as feature vectors.

One set abstraction one
N x ( d + C ) N * (d + C)
Matrix as input from the
d d
D coordinates and
C C
Characteristics of d
N N
A point. It outputs a
N x ( d + C ) N ^ \ prime * (d + C ^ \ prime)
The matrix, this matrix has
N N^\prime

d d
Dimensional coordinates of the secondary sampling points and summary of the local context of the new
C C^\prime
Eigenvectors of dimension.

The sampling layer is given input points {x1, x2… , xn}\{x_1, x_2… X_n \}{x1, x2… Xn}, we use iterative apostasy sampling (FPS) to select points {xi1, xi2… Xim}\{x_{i1}, x_{i2}… , x_{im}\}{xi1, xi2… , xim}, such that xijx_{ij} Xij is relative to the rest of the distance set {xi1, xi2… Xij −1}\{x_{i1}, x_{i2}… X_ {i_{j−1}}\}{xi1, xi2… , xij−1} farthest point. Compared with random sampling, the coverage of the whole point set is higher when the number of centroids is the same. Different from SCANNING CNN in vector space where data distribution is unknown, our sampling strategy generates receptive fields (receptive fields) in a data-dependent manner.

Grouping layer The input of this layer is a set of points of size N×(d+C)N×(d+C)N×(d+C) N×(d+C) and a set of coordinates of centroid of size N ‘×dN^\prime×dN’ ×d. The output is a group of point sets of size N ‘×K×(d+C)N^\prime×K×(d+C)N’ ×K×(d+C), where each group corresponds to a local region, KKK is the number of points in the neighborhood of the centroid point. Here, KKK varies from group to group, but the subsequent PointNet layer is able to convert a different number of points into a fixed-length local region feature vector. In a convolutional neural network, the local region of a pixel consists of the pixel Manhattan distance using the array index within a specific Manhattan distance, which is the core size. In a set of points sampled from a metric space, the neighborhood of a point is defined by metric distance.

PointNet Layer In this layer, input is N ‘N^\primeN’ local regions of points with data size N ‘×K×(d+C)N^\prime×K×(d+C)N’ ×K×(d+C). Each local region in the output is extracted from its centroid and the local features of the encoded centroid neighborhood. The output data size is N ‘×(d+C’)N^\prime×(d+C^\prime)N ‘×(d+C’). First point coordinate transformation in the local area into relative to the center of mass of the partial framework: xi (j) = xi (j) – x ^ ^ (j) x {} (j) _i = x ^ {} (j) _i – \ hat x ^ {} (j) xi (j) = xi (j) – x ^ (j), I = 1, 2,… Ki= 1,2… Ki= 1,2… K, j= 1,2… , DJ = 1,2… , DJ = 1,2… D, where x^\hat xx^ is the coordinate of the center of mass. Using PointNet as the basic building block for local pattern learning, we can capture point-to-point relationships within local regions by using relative coordinates and point characteristics.

3.2 Robust feature learning at non-uniform sampling density

As mentioned earlier, it is common for a point set to have an uneven density across different regions. This unevenness brings great challenges to feature learning of point sets. Features learned in dense data may not generalize to sparsely sampled regions. Therefore, models trained for sparse point clouds may not be able to recognize fine-grained local structures.

Ideally, we want to examine a set of points as carefully as possible to capture the finest details in a densely sampled region. However, such close inspection is prohibited in low-density areas because local features may be destroyed by under-sampling. In this case, we should be looking for larger features on a larger scale. To achieve this goal, we propose a density-adaptive PointNet layer (pictured below), which learns to merge features from different scale regions as the input sampling density changes. We call our layered network with density-adaptive PointNet layers PointNet++.



In PointNet++, each set abstraction extracts local patterns of multiple scales and makes intelligent combination according to local point density. By grouping local regions and combining features of different scales, we propose two density adaptive layers, as shown above.

3.2.1 Multi-scale Grouping (MSG)

As shown in figure (a) above, a simple and effective way to capture multi-scale patterns is to apply grouping layers for different scales and then extract features for each scale according to the corresponding PointNet. Linking features of different scales to form multi-scale features. We train the network to learn an optimization strategy that combines multi-scale features. This is achieved by randomly dropping input points with a random probability assigned to each instance, which we call the Random Input Dropout. Specifically, for each set of training points, we select a uniformly sampled discard rate θθθ from [0, p][0, p][0, p], where P ≤1p≤1p≤1. For each point, we randomly discard it with a probability theta theta theta. In practice, we set p=0.95p=0.95p=0.95 to avoid generating a null set. In doing so, we provide the network with training sets of various sparsity (caused by θθθ) and different uniformity (caused by the randomness of discarding). During testing, we keep all available scores.

3.2.2 Multi-resolution Grouping (MRG)

The MSG method above is computally expensive because it runs a local PointNet in a large neighborhood for each center of mass. Specifically, because the number of centroid points is usually quite large, the time cost is significant. Here, we propose an alternative approach that avoids such expensive calculations but still retains the ability to adaptively aggregate information based on the distribution properties of the points. In figure (b), the region LiL_iLi at a certain level is characterized by the concatenation of two vectors. A vector (left in figure) was obtained by using set abstraction to sum the features of each subregion from the lower level Li−1L_{I −1}Li−1. The other vector (right) is a feature obtained by directly processing all the original points in the local region using a single PointNet. When the density of the local region is low, the first vector may be less reliable than the second vector, because the subregion where the first vector is computed contains more sparse points and more undersampling occurs. In this case, the weight of the second vector should be higher. On the other hand, when the density of the local region is higher, the first vector has finer detail information due to its ability to recursively check at a lower level at a higher resolution. Compared with MSG method, this method avoids large-scale neighborhood feature extraction at the lowest level and has higher computational efficiency.

3.3 Point feature propagation algorithm for set segmentation

In the set abstraction layer, the original set of points is secondary sampled. However, in a set segmentation task, such as semantic point annotation, we want to obtain the point features of all original points. One solution is to always sample all points as the center of mass in all set levels of abstraction, but this results in higher computational costs. Another method is to propagate the feature from the sub-sampling point to the original point. We adopt a hierarchical propagation strategy of interpolation based on distance and skip links across levels. In the feature propagation layer, we propagate point features from Nl×(d+C)N_l×(d+C)Nl×(d+C) to Nl−1N_{L −1}Nl−1, Where Nl−1N_{L −1}Nl−1 and Nl(Nl≤Nl−1)N_{L}(N_{L}≤N_{L −1})Nl(Nl≤Nl−1) are set abstraction level LLL input and output point set sizes. We implement feature transfer by interpolating the eigenvalue FFF of NlN_{l}Nl −1 on the coordinates of Nl−1N_{L −1}Nl−1. Among the many interpolation choices, we use an inverse distance weighted average based on KKK nearest neighbors (default p=2,k=3p =2,k=3p =2,k=3 in the formula below). Then, the interpolation feature on Nl−1N_{L −1}Nl−1 is connected with the skip link point feature in set abstraction. The concatenated features are then passed through a “cell PointNet”, which is similar to the piece-by-piece convolution in CNNs. Apply several shared full links and ReLU layers to update the eigenvectors of each point. Repeat the process until we pass the feature to the original set of points.


f ( j ) ( x ) = i = 1 k w i ( x ) f i ( j ) i = 1 k w i ( x ) . Among them, w i ( x ) = 1 d ( x . x i ) p . j = 1 . . . . . C F ^ {} (j) (x) = \ frac {\ sum ^ k_ (I = 1} w_i (x) f ^ {} (j) _i} {\ sum ^ k_ (I = 1} w_i (x)}, among them, the w_i (x) = \ frac {1} {d (x, x_i) ^ p}, j = 1,… ,C

Four, the

Data sets: MNIST, ModelNet40, SHREC15, ScanNet

4.1 Classification of point sets in Euclidean metric Spaces

We classified point clouds sampled from 2D(MNIST) and 3D(ModleNet40) Euclidean Spaces and evaluated our network. The MNIST image is converted into a 2d point cloud with digital pixel positions, and the 3D point cloud is sampled from the Modelnet40-shaped mesh surface. By default we use 512 points for MNIST and 1024 points for ModelNet40. In the last line of Figure 2 below (our Baseline), we use face normals as additional point features, where we also use more points (N=5000) to further improve performance. All sets of points are normalized to zero mean and are in a unit sphere. We use a three-tier hierarchical network with three full connectivity layers. For MNIST images, we first normalized the brightness of all pixels to [0,1][0,1][0,1], and then selected all pixels with intensity greater than 0.50.50.5 as valid digital pixels. Then, we convert the digital pixels in the image into a two-dimensional point cloud whose coordinates are within [− 1,1][− 1,1][− 1,1], where the image center is the origin. Expansion points are created to add points that are set to a fixed cardinality to a fixed cardinality (512 in our example). We jitter the initial point cloud (randomly shift the gaussian distribution N(0,0.01)N(0,0.01)N(0,0.01) and trim to 0.030.030.03) to generate the enhancement point. For ModelNet40, we uniformly sampled NNN points from the SURFACE of the CAD model based on the area. For all experiments, we used the ADAM optimizer with a learning rate of 0.0010.0010.001 for training. For data enhancement, we use methods of randomly scaling objects, disturbing object positions, and point sample positions. We also follow the V olumetric and multi-view CNNS for Object Classification on 3D data method of randomly rotating objects to enhance the ModelNet40 data. We used TensorFlow and GTX 1080, Titan X for training, all layers implemented in CUDA to run the GPU. It takes about 20 hours to train our model to converge.

The experimental results In Tables 1 and 2, we compare our approach to a representative set of previous technology levels. Note that PointNet(Vanilla) in Table 2 isPointNetThis is equivalent to our hierarchical network having only one level.

In MNIST, we saw that both PointNet(Vanilla) and PointNet had lower error rates compared to our approach
60.8 % 60.8 \ %

34.6 % 34.6 \ %
. In the ModelNet40 classification, we also see that using the same input data size (1024 points) and features (coordinates only), our classification is much stronger than that of PointNet. Secondly, we observe that the point-set based approach can even achieve better or similar performance than the mature image CNN. In MNIST, our method (based on two-dimensional point set) achieves the accuracy close to the network in the network CNN. In ModelNet40, our approach to ordinary information is significantly superior to the previous state-of-the-art approach, MVCNN.

Robustness analysis of sampling density variation

Sensor data taken directly from the real world often has serious problems with irregular sampling, as shown below. Our approach selects point neighborhoods of multiple scales and learns to balance descriptiveness and robustness by weighting them appropriately.



We randomly dropped points during the test (see left of figure below) to verify the robustness of our network against non-uniform and sparse data. On the right of the figure below, we can see that MSG+DP (multi-scale grouping with random input discarding during training) and MRG+DP are very robust to sampling density changes. From 1024 test points to 256 test points, MSG+DP performance does not decrease
1 % 1 \ %
. In addition, it achieves the best performance at almost all sampling densities compared to other methods. PointNet Vanilla is quite robust under density variations because it focuses on global abstractions rather than fine details. However, the loss of detail also makes it less functional than our method. SSG(Ablated PointNet++, single-scale grouping per level) cannot be generalized to sparse sampling density, while SSG+DP makes up for this problem by randomly discarding points during training time.

4.2 Segmentation of point sets for Semantic scene annotation

In order to verify that our method is suitable for large-scale point cloud analysis, we also evaluate semantic scene annotation tasks.

The goal is to predict the semantic object tag of the point in the indoor scan. ScanNet uses full convolutional neural networks for voxel scanning as our baseline. They rely purely on scanning geometry rather than RGB information and report accuracy on a per voxel basis. In order to make a fair comparison, we removed the RGB information in all experiments, and converted the point cloud label prediction into voxel labels according to ScanNet. We also did a comparison with PointNet. Our method largely exceeds all baseline methods. Compared with ScanNet, which learns on voxel scanning, we learn directly on the point cloud, avoiding additional quantization errors, and carrying out data related sampling to achieve more effective learning. Compared with PointNet, this method introduces layered feature learning and captures geometric features of different scales. This is important to understand scenarios at multiple levels and to mark objects of various sizes.

The experimental results

The blue bars in the figure below represent accuracy per voxel.

As you can see below, the PointNet correctly captures the overall layout of the room, but the furniture is not found. In contrast, our approach is much better at splitting objects apart from the room layout.

Robustness analysis of sampling density variation

To test the performance of our training model on scans with non-uniform sampling densities, we synthesized a virtual scan of Scannet scenarios and evaluated our network on this data. We evaluated our framework in three Settings (SSG, MSG+DP, MRG+DP) and compared it to the baseline method, PointNet.

To generate training data from ScanNet scenarios, we sampled from the initial scenario
1.5 m x 1.5 m x 3 m 1.5 m * 1.5 m * 3 m
Of the cube, and then leave the cube in
p 2 % P \ % 2
The voxel is occupied and
p 70 % P 70 \ %
The surface voxel of has a valid annotation location. We sample such a training cube in operation and rotate it randomly along the upper right axis. Enhancement points are added to the set of points to form a fixed cardinality (8192 in our example). During the test, we similarly split the test scene into smaller cubes, first getting the label predictions for each point in the cube, and then merging the label predictions for all cubes in the same scene. If a point gets a different label from a different cube, we can get the final point label prediction simply by majority voting.

Performance comparison is as followsThe experimental resultsThe yellow bars are shown in the middle. We see that SSG performance is greatly reduced due to the shift of sampling density from a uniform point cloud to a virtual scan scenario. MRG networks, on the other hand, are more robust to sampling density drift because they can automatically switch to coarser granularity characterizations when sampling is sparse. Even though there is a domain gap between the training data (uniform point random loss) and the scanning data with uneven density, our MSG network is little affected and has the best accuracy among the various methods. All these prove the effectiveness of our density adaptive layer design.

4.3 Classification of point sets in non-Euclidean metric Spaces

In this experiment, we show an extension of our method to non-Euclidean space. Solve the same object different attitude recognition problem.



In non-rigid shape classification, a good classifier should be able to correctly classify (a) and (c) in the figure into the same category, even if their poses are different, which requires information about the internal structure. Shapes in SHREC15 are two-dimensional surfaces embedded in three-dimensional space. Geodesic distances along a surface naturally give rise to metric Spaces. (Geodesic distances along the surfaces naturally induce a metric space.) experiments show that adopting PointNet++ in this metric space is an effective way to capture the inner structure of underlying point sets.

For each shape in wenshrec15 Track, we first construct metric Spaces induced by paired geodesic distances. We firstly construct the metric space in-duced by Pairwise geodesic Distances.) Use Interior distance using Barycentric Coordinates method to obtain the embedding metric of simulated geodesy distance. Next, we extract the eigen point features in the metric space, including WKS, HKS and multi-scale Gaussian curvature. We use these features as input and then sample and group the points according to the underlying metric space. In this way, our network learned to capture multi-scale internal structures unaffected by the specific pose of the shape. Another design option involves using the coordinates XYZXYZXYZ as the point elements or the Euclidean space R3\Bbb R^3R3 as the underlying metric space. Experiments have shown that these are not the best options.

Experiment details: We randomly sampled 1024 points on each shape for training and testing. In order to generate the inherent features of the input, 100-dimensional WKS, HKS and multi-scale Gaussian curvature were extracted to obtain 300-dimensional feature vectors for each point. Then the principal component analysis was performed to reduce the feature dimension to 64. We modeled the geodesic distance behind Interior Distance using Barycentric Coordinates with an 8-dimensional embedding, which is used to describe our non-Euclidean metric space, while selecting the neighborhood of the point.

Experimental results: DeepGMGeodesic moments are extracted as shape features and stacked sparse autoencoders are used to process these features and predict shape categories. Our approach uses non-Euclidean metric Spaces and inherent features, achieving the best performance in all Settings and significantly outperforming DeepGM.

4.4 Feature Visualization



We visualized what we learned through the first layer core of the layered network. A voxel grid is created in space, and the local set of points that activate the most neurons in the grid cells is aggregated (up to 100 examples are used). The high-turnout grid cells are retained and converted back to a three-dimensional point cloud, which represents the pattern recognized by the neuron. Since the model is trained on ModelNet40, which is mainly composed of furniture, we can see the plane, double plane, line, Angle and other structures in the visualization.

Five, the conclusion

In this work, we propose PointNet++, a powerful neural network architecture for processing sets of points sampled in metric Spaces. PointNet++ operates recursively on nested partitioning of sets of input points and is effective in learning hierarchical features about distance measures. For the problem of non-uniform point sampling, we propose two new set abstraction layers to intelligently aggregate multi-scale information according to local point density. These contributions enable us to achieve state-of-the-art performance on challenging 3D point cloud benchmarks. How to improve the reasoning speed of the network by sharing more computation in each local area, especially the reasoning speed of MSG layer and MRG layer, is a problem worth thinking about in the future. It is also interesting to note that in high dimensional metric Spaces, cnN-based methods are computationally infeasible, and our method can be well extended.