HOI Task: PPDM paper reading [Intensive Reading]

Abstract

A single-stage HOI detection method was proposed to represent SOTA. This is the first real-time HOI detection method. The traditional HOI detection method consists of two phases, but its effectiveness and efficiency are limited by sequential and independent architectures. In this paper, we propose a HOI detection framework for PPDM. In PPDM, HOI is defined as a point triplet<human point,interaction point,object point>, where human point and object point are the center of the detection frame, Interaction point is the midpoint between human point and object point.

PPDM consists of two parallel branches, namely point detection branch and point matching branch. The point detection branch prediction is three points. The point matching branch predicts the migration from interaction point to the corresponding human point and object point. If human Point and object Point come from the same Interaction point, they are considered to be matched.

In the author’s novel parallel framework, interaction points implicitly provide context and regularization for human and object detection. The detection accuracy of HOI will be increased by inhibiting isolated test boxes because they are unlikely to form meaningful HOI triplets. Moreover, boxes are used only in a limited number of filtered candidate interation points, saving a lot of computing costs. In addition, A new dataset HOI — A is established.

1. Introduction

The traditional HOI approach consists of two phases. The first stage is man-phenology selection test. At this stage, a large number of candidate regions (M×N) of human-object pairs can be obtained. The second stage is to predict each person-phenological selection interaction. The effectiveness and efficiency of this two-stage approach are limited by sequentiality and independence. The generation stage of candidate area is completely based on the confidence of object detection. Each person/thing is ready to be created alone. The possibility of combining the two candidate fields to form a meaningful HOItriplet is not considered in the second stage. Therefore, the resulting man-phenology selection may be of low quality, and in the second stage, the population-phenology selection requires linear scanning, which is expensive. Therefore, the author thinks that a non – sequential framework with high coupling degree is needed.

The first branch of PPDM estimates the center point (interation,human and object point), corresponding to the size, and the point detection of two local offsets (human and object point). Interaction points can be considered to provide contextual information for the detection of people and objects, that is, the estimation of interation points can implicitly enhance the detection of people and objects (Personal understanding: The estimation of interaction points needs to increase the receptive field, because the information of people and objects is needed, so the increase of receptive field can also use the detection of people and objects). The second branch is point matching, which estimates the offset from interation point to human point and object point.

The author makes three contributions :(1) HOI detection task is regarded as point detection and point matching problem, and single-stage PPDM is proposed. (2)PPDM is the first HOI detection method to achieve real-time performance of SOTA in HOCI — DET and HOI — A Benchmark. (3)HOI-A

2. Related Work

Have a little…

3. Parallel point dection and matching

3.1 the Overview

Figure 3. The author first applied the KEG-Point HeatMap predictive network to extract extracted features, such as Hourglass-104 or DLA-34. A) Point Detection Branch: Based on the extracted visual features, the author used three convolution modules to predict the interaction points, human center points and object center points in the Heatmap. In addition, the regression 2-D size and local offset of human and object were used to generate the final box. B) Point Matching Branch: The first step of this Branch is to return the offset from the interaction Point to the human center to the object center respectively. Based on the predicted points and displacements, the second step is that each interaction point matches the human center point and the object center point to generate a series of Tirplets.

3.2 Point Detection

The input image in Figure 3 is III, the feature VVV generated by the feature extractor. The human center is expressed as (xh,yh)(x^h,y^h)(xh,yh), and its corresponding size is (Wh,hh)(W ^h,h^h)(Wh,hh). The local offset is δ CH \delta C ^hδ CH to compensate for the discretization error caused by the output step. The low resolution point corresponding to the GT person center point (generated by Heatmap) is ‾ h (x, y ‾ h) = (XHD, yhd) (\ overline {x} ^ h, \ overline {} y ^ h) = (\ frac {x ^ h} {d}, \ frac {^ y h} {d}) (xh, yh) = (DXH, dyh) is down.

Point location loss. It is difficult to directly detect points, so the author uses the key Point estimation method to map points to Gaussian kernel heat map. So point detection is converted to the Heatmap estimation task. Three GT low-resolution points are mapped to three Gauss Heatmaps, 4, (4) my face is ‾ H \overline{C}^hCh, and my face is ‾ O \overline{C}^oCo C‾a\overline{C}^aCa is multi-channel. On the feature mapping VVV, three convolutional networks were added to generate three Heatmaps. The loss function is:

Size and offset loss. Four convolution modules are added to the feature mapping VVV to generate 2-D Size and local offset for people and objects respectively. LoffL_ Loff is {off}

3.3 Point Matching

The offset branch consists of two convolution modules.

Diaplacement loss:

Judging whether the human center point and object center point match depends on two aspects. One is whether the interaction point plus offset is close to the approximate human/object center point, and the other is that there is a high confidence degree.

! [] (gitee.com/weifagan/My… matching.png)

3.4 Loss and Inference

The final loss is:

! [] (gitee.com/weifagan/My… loss final.png)

In the reasoning stage, the author first uses a 3×3 max-pooing operation on the heatmap of predicted people, objects and interaction points, and then selects top K personal center points, object center points and interaction points according to the corresponding confidence degree, and finally triplets match. For each matched human center point, the final box is:

! [] (gitee.com/weifagan/My… box final.png)

Summary of 4 people

1. What problems does the article solve?

Solve the traditional two-stage HOI detection problem.

2. Explain the ideas in your own words

We propose a parallel single-stage HOI detection network, PPDM. PPDM firstly uses key-point Heatmap prediction network to extract features, and then has two parallel branches, namely point detection branch and point matching branch. In the point detection branch, three points (human center point, object center point and interaction point) are predicted based on corresponding size and local offset. In the point matching branch, the offset from interaction points to human center points and object center points is predicted, TOP K personal center points, object center points and interaction points are selected according to the confidence degree, and triplets are finally matched.

3. Key factors

It is difficult to predict points directly, so points are mapped to gaussian kernel heatmap and point detection is converted to heatmap estimation.
Traditional HOI detection consists of two sequential stages, candidate detection first and then predictive interaction, while PPDM is a parallel branch. One branch predicts human-object box and its interaction point, while the other predicts the offset of interaction point and human-object center point.
Traditional HOI human-object detection is isolated without considering the connection between them, while PPDM estimates human-centerpoint-interaction point-object-center point together. In order to better detect interaction points, the receptive field is added, which contains human-object context information, and the connection between them is taken into account.

4. For my own use

By using the key-point Heatmap network, you can convert direct point prediction to heatmap prediction.
Parallel branches of PPDM are responsible for different tasks.