Title
Graph-Based Object Classification for Neuromorphic Vision Sensing
Summary
NVS can significantly improve the sampling rate, energy efficiency and robustness to light changes. However, NVS does not use frames to represent images, so it cannot take advantage of state-of-the-art circular neural networks (CNNs). We propose a compact Graph representation based on NVS and use the combination of residual network for classification. This kind of residual Graph CNN maintains the spatio-temporal consistency of peak events and requires less computation and storage. At present, all image recognition data are derived from CMOS APS. However, APS sensing is cumbersome for machine learning systems because of its limited frame rate, high inter-frame redundancy, slow and fuzzy shutter adjustment under changing lighting, and high power requirements.
In recent years, nVs-based cameras have emerged, hardware that outputs asynchronous streams of events (also known as “spikes”) that reflect changes in the reflection ratio of a scene. This image representation significantly reduces memory usage, power consumption, and redundant information across time, while providing low latency and very high dynamic range. It can be used for image recognition. However, due to the lack of NVS data set and the study of this algorithm, the image recognition based on NVS has been inferior to the image recognition based on APS. Previous approaches either artificially group events into frames or derive complex feature descriptions that do not necessarily provide good features in complex tasks and are sensitive to noise. Therefore, this paper proposes the use of graph to represent the event flow of NVS, and uses residual graph convolutional neural network as the training model to achieve the accuracy of state-of-art with only 1/5 of ResNet50 convolution operations. The graph convolution used in this thesis is spatial based graph convolution (the other is spectral convolution). In addition, most OF the NVS datasets currently available are generated from simulators or recorded using NVS devices by playing back APS datasets through a display screen.
However, the data set obtained by the above method could not capture the scene reflection ratio change, so the authors also created and provided a 24-letter (A-Y, not including J) NVS recording data set of AsL.
SYSTEM OVERVIEW
- The classification network architecture based on NVS data is as follows: Firstly, a non-uniform sampling strategy is adopted to obtain a group of neuromorphic events for calculation and efficient memory processing, and then the sampling events are constructed into a radius neighborhood graph, which is processed by the residual graph CNNs proposed by us to carry out target classification. The diagram below:
-
The sequence of events of NVS can be expressed as: $\left{e_\right}=\left{x, y_, t_, p_\right}$\left(x, y_\right) \in \times W} $\left(x, y_\right) \in \times W} $ $t_ $indicates the time of the event, $p_ $indicates the polarity of the event (+1, -1), and N indicates the amount of time. To reduce the power consumption of storage and computation, uniform grid sampling is used to reduce the number of events. The sampling events are determined on the directed graph $\mathrm={\nu, \varepsilon, \mathrm} $, where ν is the vertex set, ε is the edge set, and U contains pseudo-coordinates that locally define the spatial relations between the connected nodes (the connectivity of nodes in the graph is defined based on the radius neighborhood graph strategy). Only if the weighted Euclidean distance between adjacent nodes $d_{I,j} $is less than the radius distance R is the Euclidean distance between them defined as the weighted spacetime distance: $ d_{i, J} = \ SQRT {\ alpha \ left (\ left | | x_ – x_ \ right ^ {2} + \ left | | y_ – y_ \ right ^ {2} \ right) + \ beta \ left | | t_ – t_ \ right ^ {2}} \ leq \ mathrm $. Where $\alpha$and $\beta$are weight parameters used to compensate for spatial and temporal grid resolution differences. Between two points in $I, j $connectivity is expressed as: $u (I, j) = \ left [\ left | | x_ – x_ \ right, \ left | y_ – y_ | \ \ right right] $.
$\left{e_\right} {{m}} $= mathbf-left {e_\right}^ $= mathbf-left {e_\right}^ $
-
Graph convolution can be divided into two kinds. One is spectral convolution, which requires the same graph input and processes the whole graph simultaneously, so it is not suitable for large graphs with the same input each time. The other is spatial convolution, which is similar to the traditional convolution in CNN. It uses the neighborhood information weighted by the convolution kernel to gather a new feature vector for each vertex. Therefore, for NVS data sets, the method of spatial convolution is used. The convolution operation for point I can be defined as:? (f \otimes g)(i)=\frac{1}{|\mathfrak(i)|} \sum_{j \in \mathfrak(i)} f(j) \cdot g(u(i, j)) ?
Where $f(I)$represents the feature of point I, $\mathfrak(I)$represents the set of adjacent points of point I, and $g(u(I, j)) $represents a convolution kernel parameter.
$\mathbf=\left(g_{1}, \dots, g_, \dots, g_{M_}\right) $represents multiple convolution kernels in the same layer, and $f_$represents multidimensional features of a point. Then the more normalized convolution operation is as follows:? (\mathbf \otimes \mathbf)(i)=\frac{1}{|\mathfrak(i)|} \sum_^{M_} \sum_{j \in \mathfrak(i)} f_(j) \cdot g_(u(i, j)) ? $g_(\mathbf)=\sum_{p \in p} w_{p, L}\ cdot \prod_^ N_{I, p_}\left(u_\right) $, so the convolution function is fully expressed as:? \begin \operatorname{l^{\prime}}=& \xi\left(\frac{1}{\mathfrak(i) |} \sum^{M_{\mathrm}} \sum_{j \in \mathfrak(i)} f_ &\left.\cdot \prod_(j) \cdot \sum_{p \in P} w_{p, l}\right.\^ N_{i, p_}\left(u_\right)+b_{l^{\prime}}\right) \end ? Where $l ^ {\ prime} = 1,…, M_ ${\ mathrm} said the first $l ^ {\ prime} $output. To accelerate convergence, batch normalization is used as follows: f_{l^{\prime}}^{\prime}=\frac{f_-E\left(f_{l^{\prime}}\right)}{\sqrt{\operatorname\left(f_{l^{\prime}}\right)+\epsilon}} \cdot \gamma+\beta ?
-
Pooling layer: firstly, the graph is divided into multiple clusters based on node coordinates, and the number of points in each cluster is fixed. Then, all nodes are aggregated in one cluster, and the new coordinates and features of the new nodes are calculated. For some $\ left (x_ ^ {\ prime}, y_ ^ {\ prime} \ right) \ \ mathbb in ^ {H ^ {\ prime} \ times W ^ {\ prime}} $, $\ boldSymbol {\boldsymbol} \times \ boldSymbol {\boldsymbol} $ $\ left [\ frac {H ^ {\ prime}} {s_} \ right] \ times \ left \ lceil \ frac {W ^ {\ prime}} ${s_} \ right]. Each cluster is pooled and called a point, and the pooled point coordinates are the coordinates of the points in each cluster and the mean, the eigenvector is the mean or the maximum. If there are connected nodes between two clusters, we assume that the pooled nodes in the two clusters have an edge connection.
-
Residual Graph CNN: a Residual is directly connected as a Graph convolution layer, and the kernel size of each dimension is 1, which is matched with the dimension of output future mapping. Add the elements of the directly connected features to the original node features, and then use the ReLU activation function. Similar to ResNet, this is not expanded.
-
Data sets: Most of the existing NVS data sets are replayed and shot on the basis of APS data sets or generated by simulation software, which lose some reflection ratio change information of scenes and cannot reflect the situation of the real world. So a large dataset was created: ASL-DVS: a dataset of 24 categories of gesture letters, taken in an indoor environment, with 4200 data collected for each letter, each 100 ms in length.
Evaluation
- Compared with other NVS data sets, six data sets including N-MNIST, MNIST-DVS, N-Caltech101, CIFAR10-DVS, N-cars and ASL-DVS are used. For each sample, extract a 30 ms stream of events, set k to 8 for the convolution kernel, radius R to 3, and the weight parameters $\alpha$and $\beta$to 1 and $0.5 \times 10^{-5} $. For N-MNIST and MNIST-DVS, the amount of sample data is small, so two layers of residual block are used. The rest of the data set uses a layer of residual block. Since the spatial resolution of each dataset is different, different pooled cluster sizes are used. G-cnns with no residual term and RG-CNNs with residual term are also compared. The dropout is then set to 0.5 and data enhancement is used, including image flipping, rotation, and so on. Using Adam optimizer, batch size is 64. Compared with other algorithms, including HOTS, H-FIRST, SNN and HATS; Compared with other convolution algorithms, including GCN, Cheb-ConV, MoNet and GIN, the results show that rG-CNNS proposed in this paper achieves the best results.
- Comparison with traditional Deep CNN: VGG 19, INCEP-TION V4 and ResNet 50 are compared. In order to construct data formats conforming to these CNN, NVS data is separated into 30-ms segments and aggregated into 2 channels with the same resolution as the NVS sensor, and each channel encodes the number of positive and negative events at each location. These data are enhanced, including magnification, random cropping, flipping and normalization. Using the gradient descent method, momentum is 0.9, L2 regularized sparsity is $(0.1 \times 10^{-4})$, and learning rate is $10^{-3}$. In spite of a lot of data enhancement and L2 regularization, the results obtained from traditional CNN are still not as good as those obtained by using NVS data sets directly, because the event converted into frame images contains less information. On the other hand, in almost all data sets, the accuracy of our method exceeds that of traditional frame-based CNN.
- Comparing The Times and time complexity of floating point operations, the traditional CNN calculation method is as follows:
?
\mathrm=2 H W\left(C_{\mathrm} K^{2}+1\right) C_{\mathrm}
?
Where H, W and $C_$represent the height, width and the number of channels of input feature respectively, K represents the size of convolution kernel, and $C_$represents the number of output channels.
For graph convolution, the number of floating-point calculations is as follows: \begin \mathrm &=N_{\mathrm}(m+1)^\left(3 C_{\mathrm} C_{\mathrm}+7 d\right)&+\left(N_{\mathrm}+N_{\mathrm}\right) C_{\mathrm} \end ? Contains three parts:
Contains three parts:
- In b-spline there are $N_ {\mathrm}(m+1)^ $threads calculating 7D floating-point computations (4 add-ons and 3 multiplications), where $N_$is the number of edges and d is the dimension of the coordinates in the graph
- $3N_{\mathrm} C_{\mathrm} C_{\mathrm} C_{\mathrm}(m+1)^$, 3 from each convolution kernel 1 addition and 2 multiplication, $C_$and $C_$are the number of input and output channels respectively
- $\ left (N_ N_ {\ text} + {\ text} \ right) C_ {\ text} $: $N_ $is the number of nodes in the graph
G-cnns and RG-CNNS have fewer weights and require less computation than traditional deep CNN. The main reason is that the graph representation is compact, reducing the amount of data that needs to be processed.
Conclusion
-
A graph-based neuromorphic spike event representation method is proposed to allow rapid end-to-end task training and reasoning.
-
Residual graph CNNs (RG-CNNS) is introduced into the classification task based on NVS. The results show that compared with the traditional CNN, they require less computation and storage, while the computation and storage on various data sets are superior to the existing CNN.
-
Created the largest NVS visual dataset under real-world conditions and made it open source to the research community.