A, an overview of

The paper address: https://arxiv.org/pdf/1612.00593.pdf

Code to download: https://github.com/charlesq34/pointnet

Thesis Framework: Research Objectives:Using deep learning network to process 3d point cloud data.

Research idea: First, the invariance of point cloud to specific spatial transformation is guaranteed by alignment network, and then the global features and local features are connected in series to comprehensively utilize the features.

Solution: Propose PointNet, a deep learning model that directly processes 3d geometric data (such as point clouds or grids).

Research Contributions:

  1. A new deep network architecture with direct input of three – dimensional disordered point set is designed.
  2. Solve the invariance of point cloud to specific spatial transformation.
  3. It shows how to train such a network to perform 3D shape classification, shape partial segmentation and scene semantic analysis tasks.

Experimental data sets: ModelNet40, ShapeNet, Stanford 3D Semantic Data set

2. About point clouds

Concept of point cloud: Point cloud is a massive set of points expressing the spatial distribution and surface characteristics of the target in the same spatial reference frame. It contains rich information, including three-dimensional coordinates X, Y, Z, color, intensity value, time and so on. A point cloud is essentially a long list of points (an Nx3 matrix, where n is the number of points).

Point clouds contain the following information:

  1. The point cloud obtained according to the laser measurement principle, including three-dimensional coordinates (XYZ) and laser reflection Intensity (Intensity), the Intensity information is related to the target surface material, roughness, the direction of the incident Angle, as well as the emission energy of the instrument, the laser wavelength.
  2. Point clouds based on photogrammetric principles, including three-dimensional coordinates (XYZ) and color information (RGB).
  3. Combining laser measurement and photogrammetry principles, point clouds are obtained, including three-dimensional coordinates (XYZ), laser reflection Intensity (Intensity) and color information (RGB).

Point cloud processing method:

Point cloud data is a subset of points in Euclidean space, and it has the following three characteristics:

2.1. The disorder

Unlike an array of pixels in an image or an array of voxels in a volume grid, a point cloud is a group of points in no particular order. Point cloud data is a set, which is insensitive to the order of data. This means that models dealing with point cloud data need to be invariant to different permutations of data. Methods currently used in the literature include:

  1. Reorder unordered data.
  2. Data enhancement with all permutations of data and then using the RNN model.
  3. Symmetry functions are used to ensure permutation invariance.

Because the third method is concise and easy to implement in the model, the author of this paper chooses to use the third method, which uses the symmetric function Maxpooling to extract the features of the point cloud data.

2.2. Spatial relations between points

An object usually consists of a certain number of point clouds in a specific space, that is to say, there is a spatial relationship between these point clouds. Therefore, the model needs to be able to capture the local structures of nearby points and the combined interactions between local structures. In order to make effective use of this spatial relationship, the authors propose a way of connecting local features and global features to aggregate information.

2.3. Rigid transformation invariance

The object represented by point cloud data should be invariable to some spatial transformations, such as rotation and translation. In this paper, the author proposes that the point cloud data should be aligned before feature extraction to ensure invariance. The alignment operation is achieved by training a small network to obtain the transformation matrix and multiplying the sum input point cloud data.

Iii. Current research

Most of the existing Point Cloud Features are manually created for specific tasks. Point features typically encode statistical properties of certain points and are designed to be invariant to certain transformations, which are usually classified as internal or external. They can also be divided into local features and global features. It is not easy to find the optimal combination of features for a given task.

According to the different representations of 3D Data, there are a variety of Deep Learning methods: Volumetric CNNs: through the representation of objects as voxels in space, three-dimensional convolution similar to two-dimensional. However, due to the data sparsity and computational overhead of 3D convolution, voxel representation is limited by its time and space complexity, which is challenging to deal with very large point clouds and is no longer a mainstream method.

Multiview CNNs: Two-dimensional images from multiple perspectives are combined into THREE-DIMENSIONAL objects. This method applies traditional CNN to multiple two-dimensional images, and features are aggregated by view pooling procedure to form three-dimensional objects.

Spectral CNNs: Use Spectral CNNs on the grid. However, these methods are currently limited to a variety of grids such as organic objects, and it is not obvious how they can be extended to non-equiaxial shapes such as furniture.

Feature-based DNNs: Firstly, 3d data is transformed into vectors by extracting traditional shape features, and then the fully connected network is used to classify shapes. We think that they are limited by the representational power of the extracted features.

From the perspective of data structure, point cloud is a group of Unordered vectors. While most deep learning research has focused on the input of rules, such as sequences (speech and language processing), images and volumes (video or 3D data), not much work has been done on the deep learning of point sets. A recent study by Oriol Vinyals et al explores this question. They used a read-process-write network with an attentional mechanism to consume an unordered set of inputs, and showed that their network had the ability to sort numbers. However, since their work has focused on general sets and NLP applications, there is a lack of a role for geometry in sets.

Iv. Method of this paper

A point cloud is a set of three dimensional points {Pi∣ I = one… ,n}\{P_i | i=1,… N \} {Pi ∣ I = 1,… N}, where each point PiP_iPi is a vector of (x, y, z)(x, y, z)(x, y, z) coordinates plus additional functional channels (such as colors, normals, etc.). For simplicity and clarity, we only use the (x, y, z)(x, y, z)(x, y, z) coordinates for points unless otherwise stated.

Our network architecture was inspired by the point set attribute in Rn\Bbb R^nRn.

4.1.
R n \Bbb R^n
Properties of Point Sets in
R n \Bbb R^n
)

Our input is a subset of points from Euclidean space. It has three main properties:

  1. disorder
  2. The spatial relationship between points
  3. Rigid transformation invariance

(This section has been explained in detail in “II. About Point Clouds”)

4.2. PointNet Architecture

Original description:

Classification Network (the blue part in the figure)
n n
Points as input, the input and feature transform are applied, and then the features of points are aggregated by the method of Max pool. The output is
k k
The classification score of each class. The Segmentation Network (light yellow part in the figure) is an extension of the classification Network. It links together global and local features and the output of each type of score. “MLP” represents a multilayer perceptron, and the number in parentheses is the layer size. Batchnorm is used for all layers that have RELU. The discard layer is used for the last MLP in the classification network.

The main process of the network is:

  1. A 2d tensor of nx3 where n is the number of point clouds and 3 is the coordinate XYZ.
  2. The input data is first multiplied by an input transform learned by T-NET to ensure the invariance of the model for a specific spatial transformation.
  3. After multiple MLP features are extracted from each point cloud data, a T-NET feature transform is used to align the features. Max pool operations are performed on each dimension of the feature to obtain the final global feature.
  4. For the classification task, the global features were used to predict the final classification score by MLP.
  5. For the segmentation task, the global features are connected in series with the local features of each point cloud learned before, and then the classification results of each data point are obtained through MLP.

Our network has three key modules:

  • As the maximum pooling layer for symmetric functions that aggregate information from all points
  • Local and global information combination structure
  • Two joint alignment networks that align input points and point elements simultaneously

We discuss the reasons behind these design choices in a separate paragraph below:

4.2.1 Symmetry functions for Unordered Input

This part is located in the classification network (the blue part in the figure) and builds the classification network body. (Personal Understanding) As stated earlier, for unordered input, the methods currently used in the literature include:

  1. Reorder unordered data.
  2. Data enhancement with all permutations of data and then using the RNN model.
  3. Symmetry functions are used to ensure permutation invariance.

It is pointed out that the reordering method is not stable for point disturbances in high dimensional space. While the idea of using an RNN treats a set of points as a sequence signal and hopes that by training an RNN with a randomly arranged sequence, the order of the data will be fixed in accordance with the input order, OrderMatters has proven that order is indeed important and cannot be completely omitted. While RNN is relatively robust for input ordering for sequences of small length (dozens), it is difficult to scale to thousands of input elements, which is a common size for point sets.

In order to make the result not affected by the input arrangement order, the author of this paper proposes to approximate the general function defined on the point set by applying the symmetric function to the elements in the set:


f ( { x 1 . . . . . x n } ) = g ( h ( x 1 ) . . . . . h ( x n ) ) f(\{x_1,… ,x_n\})=g(h(x_1),… ,h(x_n))

Among them,


f : 2 R N R . h : R N R K . g : R K x x R K n R F: 2 ^ ^ N} {\ Bbb R \ to \ Bbb R, h: \ Bbb R ^ N \ to \ Bbb R ^ K, g: \ underbrace {\ \ Bbb R ^ K times… \ times \ Bbb R ^ K} _ {N} \ to \ Bbb R

GGG is a symmetric function fitted by a combination of a function of one variable and a maximum pooling function. HHH uses multi-layer perceptron network (MLP) fitting to map point cloud data to high dimensions.

4.2.2 Local and Global Information Aggregation

This section is a segmented network (in light red).

The previous step (classification network) outputs a vector
[ f 1 . . . . . f K ] [f_1,…,f_K]
Is the global feature information of the input point cloud data. After calculating the global point cloud feature vector, the global feature is connected with each point feature, and then it is fed back to each point feature. Then, new point-by-point features are extracted based on the combined points, so that the new point-by-point features know both local and global information.

Our network is able to predict the number of points per point depending on local geometry and global semantics. For example, we can accurately predict the normals of each point, verifying that the network is able to aggregate local neighborhood information from the points. Experiments show that the model achieves the best results in both shape segmentation and scene segmentation.

PointNet normal reconstruction result:

4.2.3 Joint Alignment Network

If the point cloud goes through some geometric transformation (such as rigid transformation), then the semantic label of the point cloud must be invariant. So, we expect the representation that we learn from our set of points to be invariant for these transformations. The natural solution is to align all input sets with the canonical space prior to feature extraction. Spatial Transformer Networks, such as Max Jaderberg et al., used sampling and interpolation to align 2D images and implemented a specially customized layer on the GPU. We predict an affine transformation matrix through a micro-network (T-NET) and apply the transformation directly to the coordinates of the input points. T-net itself is similar to the large network, which consists of three basic modules: point-independent feature extraction, maximum set and fully connected. The first T-Net takes the original point cloud as input, and the regression is a 3×33×33×3 matrix mini-pointnet. Consists of a shared MLP(64,128,1024) network on each point (layer output size 64,128,1024), a maximum pool across the points, and two fully connected layers with output size 512,256. The output matrix is initialized to the identity matrix. All layers except the last contain regeneration and batch normalization. The second T-Net has the same network structure as the first except that it outputs a 64×64 matrix.

This idea can also be extended to the alignment problem of feature space. We can insert another alignment network on the point elements and predict a feature transformation matrix to align features from different input point clouds. However, the dimension of transformation matrix in feature space is much higher than that of spatial transformation matrix, which greatly increases the difficulty of optimization. Therefore, we add regularization items to the Softmax training loss. We constrain the eigentransformation matrix to a near-orthogonal matrix:


L r e g = I A A T F 2 L_{reg} = ||I – AA^T||^2_F

Where A is the feature alignment matrix predicted by T-NET. We found that by adding regular terms, the optimization became more stable and our model achieved better performance.

4.3 Theoretical Analysis

4.3.1 Function fitting ability of the network

For a network dealing with point clouds, small perturbations to the set of input points, such as classification or partition fractions, should not significantly change the value of the function through the continuity of the set function.

Set χ = {S: S ⊆ [0, 1] mand ∣ S ∣ = n} \ chi = \ \ {S: S subseteq [0, 1] ^ m and | S | = n \} χ = {S: S ⊆ [0, 1] mand ∣ S ∣ = n}, F: χ – > Rf: \ chi \ to \ Bbb Rf: is χ χ – > R \ chi χ on Hausdorff distance dH (⋅ ⋅) d_H (,), dH (⋅ ⋅) continuous set function. In m d European space namely, ∀ ϵ > 0, ∃ delta > 0 \ forall \ epsilon > 0, \ exists \ delta > 0 ∀ ϵ > 0, ∃ delta > 0, for any S, S ‘∈ χ S, S ^ \ prime \ \ in chiS,’ S ∈ χ, If dH (S, S ‘) < delta d_H (S, S ^ \ prime) < \ deltadH (S, S ‘) < delta ∣ f (S) – (‘ S) ∣ f < ϵ | f (S) – f (S ^ \ prime) | < \ epsilon ∣ f (S) – f (‘ S) ∣ < ϵ. If there are enough neurons at the maximum pooling layer, i.e., the KKK of the formula in (2.1)(2.1)(2.1) is large enough, PointNet can fit any function.

Then the author gives two more theorems: Unseen 1. F: χ – > Rf: \ chi \ to \ Bbb Rf: is χ χ – > R \ chi χ is about dH Hausdorff distance (⋅ ⋅) d_H (,), dH (⋅ ⋅) continuous set function, ∀ϵ>0\forall\epsilon>0∀ϵ>0, has the continuous function HHH and the symmetric function g(x1,… , xn) = gamma ∘ MAXg (x_1,… ,x_n)=\gamma \circ MAXg(x1,… , xn) = gamma ∘ MAX, making any S ∈ χ S \ \ in chiS ∈ χ, are: (note: f ∘ g = f (g) f \ circ g = f (g) f ∘ g = f (g))


f ( S ) gamma ( M A X i = 1 . . . . . n { h ( x i ) } ) < ϵ \begin{vmatrix} f(S)-\gamma \left(\underset{i=1,… ,n}{MAX}\{h(x_i)\}\right) \end{vmatrix} < \epsilon

Among them, the x1,… ,xnx_1,… ,x_nx1,… ,xn is the complete list of any sorted elements in SSS, γ\gammaγ is a continuous function, and MAXMAXMAX is the vector maximum operator, which takes NNN vectors as input and returns a new vector containing the maximum value of each vector element. This principle is mainly to explain that the expression ability of PointNet network is affected by the dimension of the maximum pooling layer, namely KKK in (2.1), (2.1) and (2.1). The larger KKK is, the stronger the expression ability of the network will be.

4.3.2 Network stability

The following Theorem tells us that small data corruption or extra noise points in the input set are unlikely to change the output of our network: Suppose u: χ – > RKu: \ chi \ to \ Bbb R ^ Ku: χ and fairly RK makes u = MAXi = 1,… ,n{h(xi)}u=\underset{i=1,… ,n}{MAX}\{h(x_i)\}u=i=1,… ,nMAX{h(xi)}, and F =γ CVD uF =\gamma\circ UF =γ CVD U, then, (a) ∀S,∃CS,NS star χ,f(T)= F (S) (a) if C_S \subseteq T \subseteq N_S, \forall S,\exist C_S,N_S \subseteq \ chi, f (T) = f (S) (a) if CS ⊆ T ⊆ NS, ∀ S, ∃ CS, NS ⊆ χ, f (T) = f (S) (b) ∣ CS ∣ K or less (b) | C_S | \ leq K (b) ∣ CS ∣ K or less

(a) This theorem indicates that for any input data set SSS, there exists a minimum set CSC_SCS and a maximum set NSN_SNS, such that for any set TTT between CSC_SCS and NSN_SNS, the network output is the same as that of SSS. (b) indicates that the upper bound of the data of the minimum set CSC_SCS is given by the dimension KKK of the output data of the maximum pooling operation. On the other hand, PointNet can summarize the key points that represent the shape of a certain type of object, based on which PointNet can determine the type of object. This capability makes PointNet robust to noise and data loss.

Five, the experiment

The experiment is divided into four parts.

  1. Demonstrated that PointNets can be applied to multiple 3D recognition tasks.
  2. Detailed experiments are provided to verify our network design.
  3. Visualize the content of online learning.
  4. Analyze time and space complexity.

5.1. The application

5.1.1 Experimental design of 3D object classification: Using ModelNet40 shape classification benchmark (data set), including 12311 CAD models from 40 artificial object categories, divided into 9843 for training and 2468 for testing. The 1024 points on the mesh surface are uniformly sampled according to the area and normalized to a unit sphere. During the training process, the point cloud was dynamically enhanced by randomly rotating the object along the upper axis (the coordinate axis pointing upward), and the position of each point was jitter with Gaussian noise with the mean value of 0 and standard deviation of 0.02.

Experimental results: Although more advanced than most networks, it is much better than a multi-view-based approach (MVCNN), which the author believes is due to the loss of the fine geometric details that can be captured in rendered images.

5.1.2 3d Object Parts Segmentation Part segmentation is a challenging fine – grained 3D recognition task. Given a 3D scan or mesh model, the task is to assign part category labels (for example, chair legs, cup handles) to each point or surface. Experimental design: We evaluated on a ShapeNet part data set containing 16,881 shapes in 16 categories, with a total of 50 parts labeled. Most object categories are marked with two to five sections. We represent part segmentation as a point-by-point classification problem. The evaluation index is mIoU (Mean Intersection over Union) based on points. For example, for each shape S of class C, the mIoU value of the shape is calculated: the IoU (Intersection over Union) between real and predicted is calculated for each component of class C. If the union set of true and predicted points is empty, part of the IoU is counted as 1. We then averaged the IoU of all partial types in category C to obtain the mIoU of that shape. To calculate the mIoU of the category, we take the average mIoU of all shapes in the category. We combined our Segmentation version PointNet (the modified version of Segmentation Network in the architecture diagram) with two traditional methods, Interactive Shape co-segmentation via label Propagation A Scalable Active Framework for Region Annotation in 3D Shape Collections (Both of which utilize correspondence between point-by-point geometric features and shapes) And our own 3DCNN.

Experimental results:

Compared to Yi’s approach, PointNet’s mIoU is improved
2.3 % 2.3 \ %
, and outperformed 3DCNN in most categories.

Supplementary experiment:

Experiments were also conducted on simulated Kinect scans to test the robustness of these methods. For each CAD model in the ShapeNet part data set, we use a Blensor Kinect Simulator to generate an incomplete point cloud from six random perspectives. The same network architecture and training Settings were used to train the full shape and partial scan of the PointNet. Here are the results:

MIoU is down only 5.3%. As you can see, some of the data are challenging, but our predictions are reasonable.

5.1.3 Semantic Segmentation in Scenes Our part segmentation network can be easily extended to scene semantic segmentation where point labels are semantic object classes rather than object part labels. We experimented on Stanford 3D Semantic Parsing data set, which included 3D scans from six regions of the Matterport scanner, including 271 rooms. Each dot in the scan is annotated with a semantic tag from one of 13 categories (chair, table, floor, wall, and so on plus clutter). To prepare the training data, we first divided the points by room, and then divided the sample room into blocks with an area of 1m×1m. We trained our modified version of PointNet to predict every point class in every block. Each point is represented by a 9-dimensional vector of XYZ, RGB, and normalized positions (from 0 to 1) about the room. During training, we dynamically randomly sampled 4096 points in each block. In testing, we test all the points. We followed the same protocol as 3D Semantic Parsing of large-scale indoor Spaces, and trained and tested using the K-fold strategy. We compared our approach with a baseline using hand-crafted point features. The baseline extracts the same 9-dimensional local feature and three additional features: local point density, local curvature and normals, using the standard MLP as a classifier.

The experimental results As you can see from the quantitative results in the table, the performance of the PointNet approach is significantly better than that of the baseline approach.The figure above shows the qualitative segmentation results. Our network can output smooth predictions and is robust to missing points and occlusion.

5.2. Analysis of architecture design

This part of the experiment verifies our design choice through control experiment, and also shows the influence of network hyperparameters.

5.2.1 Comparison with other sequential invariant methods

The control experiment still uses the ModelNet40 shape classification problem as the test bed for comparing these schemes.The baselines we compared included
n x 3 N x 3
Multilayer perceptrons on unsorted and sorted points of arrays, RNN models that treat input points as sequences, and models based on symmetric functions.

The symmetric operations in our experiment include maximum pooling, average pooling, and attention-based weighted sums. The attention approach is similar to the approach in Order Matters, where the score is predicted from each point feature and then the score is normalized across the points by computing Softmax1, 2. Then the weighted sum of the normalized fraction and point features is calculated. As shown in the figure, the maximum pooling operation achieves the best performance with a significant advantage, which validates our choice.

  • [1] Preliminary study of Softmax.
  • [2] Why SoftMax?

5.2.2 Validity of input and feature conversion

In the above table, we show the positive effects of input and feature transformations (for alignment). You can see that the most basic architecture has produced quite reasonable results, with a 0.8% performance improvement using input transformation.Regularization lossIs necessary for higher-dimensional transformation work. throughCombine the transformation with the regularization term, we get the best performance.

5.2.3 Robustness test

We use the
5.2.1 5.2.1
In the same architecture as the maximum pooling network, the input points are standardized into unit spheres, and the result is shown in the figure above. For the missing point, when there is
50 % 50 \ %
The accuracy only decreases when the furthest random input sampling is adopted
2.4 % 2.4 \ %

3.8 % 3.8 \ %
. We evaluated two models: one for having
( x . y . z ) (x, y, z)
Coordinate points for training; The other is for
( x . y . z ) (x, y, z)
Add a little density to your training, even when
20 20%
The network also has a point exception
80 80%
The above accuracy.

5.3. Visualizing PointNet

Although the critical points collectively determine the global shape features of a given shape, any point cloud falling between the set of critical points and the upper bound shape will provide exactly the same features. We color-coded all the numbers to show depth information.The image above visualizes some sample shapes
S S
Critical point set of
C S C_S
And the upper bound shape
N S N_S
. The set of points between the two shapes will give exactly the same global shape characteristics
f ( S ) f(S)
.

The critical point set
C S C_S
Describes the shape of the skeleton, the upper bound shape
N S N_S
Represents the point cloud given and input
S S
Same global shape features
f ( S ) f(S)
This means that losing some non-critical points does not change the global shape at all, reflecting the robustness of PointNet.

5.4. Time and space complexity analysis

The table above summarizes the spatial (number of parameters in the network) and temporal (floating-point/sample) complexities of PointNet.

We also compared PointNet to a representative set of volume-based and multi-view architectures from previous work.

While MVCNN and Subvolume(3D CNN) achieve high performance, PointNet is more efficient in terms of computing costs (measured by FLOP/ sample: 141x and 8x efficiency, respectively). In addition, on the network
# p a r a m s \#params
For example, the space efficiency of PointNet is much higher than that of MVCNN (with a 17-fold reduction in parameters).

PointNet is more scalable, and its spatial and temporal complexity in terms of the number of input points is
O ( N ) O(N)
Linear. However, due to the dominant role of convolution in computing time, the time complexity of the multi-view method increases in direct proportion to the image resolution, while the volume convolution based method increases in a cubic curve with the volume size.

As a rule of thumb, using the 1080X GPU on TensorFlow, PointNet is capable of processing more than 1 million points per second for point cloud classification (about 1K objects/second) or semantic segmentation (about 2 rooms/second), showing great potential for real-time applications.

Six, the conclusion

In this work, we propose a new deep neural network PointNet that directly deals with point clouds. Our network provides a unified approach to many 3D recognition tasks, including object classification, part segmentation, and semantic segmentation, while achieving results on a standard basis that are equal to or better than the state of the art. We also provide theoretical analysis and visualization for understanding our networks.