Writing in the front

Recently, the computer vision summit ECCV 2020 has officially announced the paper acceptance results. This paper introduces a paper from iQiyi team. The researcher proposed the Boundary Content Graph Neural Network (BC-GNN), which used the Graph Neural Network to model the relationship between Boundary and Content prediction, to generate more accurate temporal boundaries and reliable Content confidence score.

Temporal Action Proposal Generation task needs to accurately locate the clips containing high quality action content in unprocessed long videos, which plays an important role in video comprehension. Most of the existing methods generate start-stop boundary first, then combine start-stop boundary into candidate action nomination, and then regenerate the content confidence of candidate time sequence fragment. This processing method ignores the relationship between boundary prediction and content prediction.

In order to solve this problem, IQiyi proposed Boundary Content Graph Neural Network (BC-GNN) to model the relationship between Boundary and Content prediction through Graph Neural Network. More accurate time sequence boundary and reliable content confidence score are generated by using the internal relationship between them.

In BC-GNN, the content of the candidate sequence fragment is taken as the edge of the graph, and the boundary (starting point and ending point) of the candidate sequence fragment is taken as the node of the graph. Then, an inference method is designed to update the features of edge and node. The updated features are used to predict the probability of the starting point and the confidence of the content, and finally generate a high-quality proposal. Finally, the method achieved the leading performance in both activityNet-1.3 and THUMOS14 public data sets for the generation of temporal action nominations and the detection of temporal behavior. \

Paper Links:

Arxiv.org/abs/2008.01…

The research methods

\

The figure above is the overall framework of BC-GNN, which mainly includes five processes, respectively:

1) Feature extraction (Feature Encoding)

2) Base Module

3) Graph Construction Module (GCM)

4) Graph Reasoning Module (GRM)

5) Output Module

Feature extraction module \

Researchers encode videos into features using two-stream networks, which have achieved good results in video behavior recognition. Two-stream consists of Two branch networks, spatial and temporal. The input of spatial branch network is a single RGB image, which is used to extract spatial features, while the input of temporal branch network is multiple optical flow images, which is used to extract motion features. For an unprocessed long video, the corresponding video frame is segmented into T snippets, each of which is encoded into a D-dimensional feature vector after two-stream. The d-dimension feature vector is spliced by the output of spatial and temporal branch network at the last layer, so the video is encoded into a TxD feature matrix, where T is the length of feature sequence and D represents the dimension of feature. Bc-gnn consists of four modules: basic module, graph building module, graph reasoning module and output module.

Basic module

The basic module consists of two layers of 1D convolution, which is mainly used to expand the receptive field and serve as the foundation of the whole network.

Diagram Building module

The graph builder module is used to build a boundary content graph, as shown in the figure above. The boundary content graph constructed by the researcher is a binary graph, which is a special kind of graph whose vertices are composed of two independent sets U and V, and all edges are connected by a point in U and a point in V. In the process of constructing the graph, the time corresponding to each snippet of the video processing unit can be regarded as the start and end points of the proposal, so that the start point set N_s and end point set N_e can be obtained, and N_s and N_e can be regarded as two mutually independent vertex sets of the boundary content graph. T_s, I, t_e,j are used to represent the corresponding time of any start point n_s in Ns, I and any end point N_e,j in N_e, where I,j=1,2… When t_e,j > t_s, I there is a edge connection between N_S, I and n_e,j, denoted by d_i,j.

The undirected graph shown in (a) can be obtained when the edge connecting the starting point and the ending point has no direction. Since the start point represents the start time of the proposal, and the end point represents the end time of the proposal, the edges connecting the start point and the end point should be directional, and the information of the edges from the start point to the end point is different from the information of the edges from the end point to the start point. Therefore, the researcher transformed the undirected graph shown in (a) into the directed graph shown in (b). The specific transformation process is to divide the undirected edge in the undirected graph into two directed edges with the same node and opposite direction.

Before the graph inference operation, the researchers assigned features to each node and edge in the constructed boundary content graph. In order to obtain the features of nodes and edges, the researcher connected three parallel 1D convolution behind the basic module to obtain three feature matrices, namely, starting point feature matrix F_s, ending point feature matrix F_e and content feature matrix F_c. The time dimension and feature dimension of these three feature matrices are the same, and the size is TxD. For any starting node n_s, I, the corresponding time is T_s, I, then the feature of this node is the feature vector corresponding to the i-1 row of F_s eigenmatrix. Similarly, for any end node n_E,j, its feature is the eigenvector corresponding to the j-1 row of Fe eigenmatrix. If there is edge connection between N_S, I and n_e and j, the corresponding feature acquisition process of edge d_i and j is as follows:

1) Firstly, the eigenmatrices corresponding to the F_c eigenmatrix from row I-1 to J-1 are linearly interpolated in the temporal direction to obtain the eigenmatrix NxD ‘with fixed size (N is a constant set artificially);

2) Then transform NxD ‘into (N·D’)x1;

3) A full connection layer is connected after (N·D ‘)x1 feature, and the feature vector with dimension D ‘is obtained, that is, the feature corresponding to edge d_i and j.

In a directed graph, before the features of nodes and edges are updated, the two edges connecting two nodes with different directions share the same feature vector.

Graph reasoning module

In order to realize the information exchange between nodes and edges, a new graph inference method is proposed, which can be divided into two steps: edge feature update and node feature update. The edge feature update step aims to summarize the attributes of the two nodes with edge connections, and the update process is as follows:

Where σ represents the activation function ReLU, θs2e and θe2s represent different trainable parameters, × represents matrix multiplication, and ∗ represents Element-wise multiplication.

The node feature update step aims to aggregate the attributes of the edge and its adjacent nodes, and the update process is as follows:

Where, e_(h,t) represents the feature corresponding to the edge of the head node H pointing to the tail node T, and K represents the total number of edges with h as the head node. In order to avoid the increase of the numerical scale of output features, the researcher normalized the corresponding edge features before updating node features, and then took the updated edge features as the weight of the corresponding head node features. σ represents the activation function ReLU, and θ_node represents trainable parameters.

The output module is shown in the overall frame diagram of BC-GNN. The candidate proposal is generated by a pair of nodes and edges connected to it, and the confidence degree of its starting point, ending point and content is generated based on the updated node features and edge features respectively. The specific process is shown as follows:

The experiment is introduced

Temporal motion nomination generation experiment

As can be seen from the above two tables, the researcher achieved leading results on both general data sets.

Time series behavior detection experiment

The time sequence behavior detection results were obtained by classifying the proposal. It can be seen from the above two figures that the method proposed by the researcher achieved leading results in both data.

In the BC-GNN algorithm, compared with the traditional GCN algorithm, the undirected graph is transformed into a directed graph, and the edge feature update step is added. In order to verify the effectiveness of these two strategies, the ablation experiment is performed on the sequential action nomination generation task of ActivityNET-1.3 dataset. As can be seen from the table below and the result curve, both strategies are beneficial to the improvement of results.

Compared with the current common algorithm that divides boundary prediction and content prediction into two steps, the method proposed in this paper uses graph neural network to model the relationship between boundary prediction and content prediction, and connects the process of boundary prediction and content prediction. High quality action content is beneficial to boundary adjustment, and accurate boundary location is beneficial to content confidence prediction. In addition, a new graph inference method is proposed, which incorporates boundary information and content information to update the corresponding node and edge information. The method proposed in this paper to model the two related steps can be applied to other similar tasks. At present, including this paper, most of the effective methods in the academic circle on the task of sequential behavior detection are to extract action nominations and then classify them. This two-stage method increases the complexity and computation of the whole process, and there will be more designs and explorations for this kind of problem in the future.

Maybe you’d like to see \

ICCV paper 2019 | iQIYI \ no tag data was used to optimize face recognition model

ICME 2019 paper parsed | iQIYI initiative \ real-time adaptive bit rate algorithm evaluation system

Scan the qr code below, more exciting content to accompany you! \