By Ilias Mansouri

As if there were light

Introduction

With pose estimation, as the name suggests, we try to infer the pose of an object or person from an image. This involves identifying and locating key points on the body. Identification of key points is a very challenging task due to the small joints, occlusion and lack of context, rotation and orientation of the body. In cases where the rest of this paper will focus on human postural estimation, the major joints such as the knee, elbow, shoulder, and wrist represent these key points.

In terms of classification, attitude estimators can be divided into the following categories:

  • Dimensions (2D and 3D)
  • Single pose and multiple pose (detection of one or more objects)
  • Methodology (key points and examples based)

We can use a 2D pose estimator to predict the 2D positions of key points in an image or video frame, while a 3D pose estimator converts objects in the image into 3D objects by adding depth to the prediction. Obviously, using 3D is more challenging. Single attitude estimators usually detect and track one person or object as the target, while multiple attitude estimators detect and track multiple people or objects. In terms of methodology, broadly speaking, we find that the model tries to examine all instances of a particular key point, and then tries to group the key points into a skeleton. Instance-based pose estimators first use object detectors to detect instances of objects and then estimate key points within each clipping region. In the literature, this is often referred to as a bottom-up vs. top-down approach.

The top-down approach involves applying a person detector to the image, using a single-person pose estimator to infer critical points for each person detected. If your character detector fails, your posture will probably fail, too. In addition, the amount of processing required is proportional to the number of people. Bottom-up is less affected by these shortcomings, but it is still challenging to relate key point detection candidates to individuals.

Point of concern, focus on computer vision technology summary, paper sharing

DeepPose

Here, for the first time, we apply a deep neural network (DNN) to the challenge of human postural estimation. Below, we find the architecture used.

Using the input images, each body joint and its position can be regressed. By passing the original initial attitude estimate to the cascade of such DNN, the joint prediction can be further improved to achieve SOTA results.

Deep(er) Cut

Using DeepCut, the pose estimation problem of the unknown number of people in the image was reformulated as an optimization problem. The question is:

  • Create a set of all candidate body parts in the image and select a subset of them.
  • From this subset, classify each body part (for example, arms, legs, and head)
  • Bringing together the body parts of the same person.

Then we model it as an integer linear programming problem to solve these three problems.

To find all the body parts in the image, an adapted version of Fast R-CNN (AFR-CNN) was used. Specifically, adaptation involves replacing selective search suggestion generation with a deforming part model (DPM) and changing the detection size to allow the DPM to capture more context.

Starting in the 1970s, they dealt with the problem of describing visual objects, how do you find that object in an actual photograph? In true engineering form, an object consists of a set of parts arranged in a deformable configuration.

Part-based model of a human

Humans are represented by collections of parts arranged in a deformable configuration. Then the appearance of each part is modeled separately. A pair of parts is represented by a spring to introduce the necessary deforming capacity.

Suspect that using DPM may not be optimal (which it is), but rather training for intensive CNN builds based on VGG. The body parts were then reformulated as multi-label classifications. The model outputs a partial probability score graph for each candidate. In addition, similar to other segmentation tasks, an extended convolution with stride length of 8 is used for finer partial positioning.

DeeperCut is based on DeepCut’s denser -CNN, but uses ResNet backbone instead. Similar to the VGG trunk, the original step size of 32px is too large. However, due to memory limitations, using the void algorithm is not feasible. The ResNet architecture was adjusted by removing the last layer, reducing the steps of the first convolution layer to prevent downsampling. In the fifth convolution, all 3 by 3 convolution have holes added. Layers and deconvolution layers are used for upsampling.

DeeperCut also benefits from a larger receptive field to infer the location of other nearby parts. This insight, called the image condition pairwise term, allows the pairwise probability of components to be calculated.

Pair-to-part prediction: Trains logistic regression by calculating the cost per pair, and the offsets and angles of the regression are used as features to produce pair-to-part probabilities

DeepCut solved an ILP instance for all candidate body parts in the image. DeeperCut proposed an incremental 3 stage optimization where:

  • ILP solves the head and shoulders problem
  • The elbow/wrist was added to the phase 1 solution and the ILP was re-optimized
  • Add the remaining body parts to the phase 2 solution and re-optimize the ILP

Combined training of convolution networks and graphic models for human pose estimation

In this paper, detection pipeline is composed of convolution network and Markov random field (MRF). Similar to the previous ConvNet architecture, the ConvNet architecture was used to locate body parts. The architecture is shown in the figure below:

Multi-Resolution Sliding-Window With Overlapping Receptive Fields

The architecture processes the input image using a sliding window approach to produce a pixelated heat map representing the likelihood of each joint position.

There are 2 overlapping multi-resolution fields, one for 64×64 input (upper convolution) and the other for 128×128 input sampled down to 64×64, thus entering more “context” into the lower convolution path. Both are then normalized using local contrast normalization (LCN) before being passed into the network.

The main advantage of using overlapping fields, the authors mention, is the ability to see a larger part of the image with a relatively low weight increase. In addition, by using LCN, the overlapping spectral content between the two Windows is minimized. Since this requires considerable computing power, the model has been improved, as shown below.

The concept of multi-resolution (lower ConvNet) and sliding window (upper ConvNet) was retained. High context and low resolution inputs require half the step size of the sliding window model. Therefore, four sub-sampled images need to be processed. The feature map of the sliding window is copied, where the low resolution feature map is added and interleft, resulting in the output heat map being lower than the input.

The Partdetector will output many anatomically incorrect postures as there is no modeling of implicit constraints on key points of the body. This problem is cleverly addressed by using high-level spatial models to create constraints on the interconnections and anatomical consistency of postures. The spatial model is expressed as the MRF model. By first training a partial detector and then reusing the resulting heat map output to train the spatial model, we are able to train an MRF that will formulate joint dependencies in the graph model. Finally, the fine-tuning and back propagation are carried out on the Part-Detector + Spatial Model.

Efficient object location using convolutional networks

Based on the work mentioned earlier, this study implemented a multi-resolution ConvNet to estimate the joint offset positions within a small area of the image. Below, we find the architecture, and it’s easy to see the similarities to the architecture discussed earlier.

In addition, a Spatial Dropout layer has been added. It is found that due to the strong spatial correlation in the feature map, applying standard Dropout does not prevent overfitting. The solution is to remove the entire one-dimensional feature map to promote independence between the feature maps. As before, the (coarse) heat map is transferred to the MRF, which filters out anatomically infeasible postures.

The next step is to recover the spatial information lost due to pooling. This was achieved by using another ConvNet to refine the results of the coarse heat map.

Convolution pose machine

Convolutional Pose Machines (CPM) inherit and build on the posture machine (PM) architecture, which integrates rich spatial interactions from body parts and different scales into a modular and sequential framework. As we will see, CPM further advances PM by leveraging a convolutional architecture that learns the characterization of images and spatial contexts.

As we will see below, PM is a sequential prediction algorithm that simulates message-passing mechanisms to predict the confidence level of each body part. The basic principle is that the estimated confidence of each body part is improved iteratively at each stage. Message passing can be understood as a sequence of probabilistic classifications in which the output of a predictor (multi-class classifier of any type) becomes the input to the next predictor.

Architecture of a 1 Stage Pose Machine (a) and a 2 Stage Pose Machine (b)

At each stage, the classifier predicts the position of each body part with confidence based on the output of previous classifiers and the features of the image. Then, for each stage, the predictions are refined. Finally, we can observe that hierarchical representations are created by reusing images in different proportions for each image.

Level 1, as seen in the image, is a rough representation of the entire model, while level 2 represents the composition of the body parts, and finally level 3, the most delicate representation, consists of the areas around the key points. A single multi-class predictor for each stage is trained at all hierarchical levels. This means training each predictor to output a set of confidence for each key point from feature vectors, which can come from any level.

Below line (a), we can look at how the spatial correlation between the confidence levels of each body part can be constructed by connecting the confidence scores in position Z, thus generating vectoring patches. For remote interaction, non-maximum suppression is applied to obtain a list of peaks (high confidence positions) for each key point/body part from which offsets in polar coordinates can be calculated.

Replace the prediction and feature extraction parts with CNN, and then produce our CPM, an end-to-end architecture.

Architecture of the Pose Machine (a & b) and Convolutional Pose Machine (c & d)

The first phase of the architecture creates feature maps from a growing sensory field based on input images. Subsequent stages refine the predictions for each body part using input images and feature maps from the previous stage. The use of an intermediate loss layer prevents the gradient from disappearing during training.

As described in the paper, subsequent predictors can help eliminate false estimates by using previous feature maps as strong clues as to where certain parts should be. By gradually increasing the receptive field, the model can learn to incorporate contextual information in the feature map, thus enabling it to learn complex relationships of body parts without having to model any graphical model representing the human body.

Stack hourglass network

Due to the need to capture information at every scale, a novel CNN architecture has been developed in which features at all scales are processed to capture the spatial relationships of the human body. Local information is necessary to identify body parts, and anatomical understanding is better recognized at different scales.

Architecture of an hourglass module

In the figure above, we can immediately see symmetric partitions that are processed from the bottom up and top down. This type of architecture has previously been discussed for semantic segmentation, except that it is referred to as the ConV-deconv or Encoder – Decoder architecture.

In general, a set of convolution and maximum pooling layers process input characteristics. After each maximum pooling layer, we branch out the network and apply another set of convolution and maximum pooling layers to the original feature input. In the figure above, each block consists of a set of convolution and maximum pooling layers. The precise configuration of the CONV layer is very flexible.

Judging from the success of ResNets, the authors ended up implementing one residual module per block. Once the minimum resolution is reached, a decoder or top-down approach is activated, in which the network effectively combines features of different scales. Finally, not visible in the image, two 1×1 convolution is applied to generate a set of heat maps, where each heat map predicts the probability of the presence of key points.

By creating a series of hourglass modules, where the output of one feeds the input of the other, a mechanism for reevaluating features and higher-order spatial relationships is obtained. As before, it provides the key to having an intermediate loss function. As is, loss (or supervision) can only be provided after the upsampling phase. Therefore, these characteristics cannot be reassessed in the larger global context.

This means that if we want the network to improve the predictions, these predictions must not only be local scale, but also have a larger scale so that the predictions can be correlated in a larger image context. Below, we can observe the proposed solution:

An overview of the intermediate supervision process for applying losses to generated heat maps (blue)

Intermediate heat maps are generated, losses are applied to them, and then these heat maps are remapped to features using 1X1 CONV and combined with features from the previous hourglass module output.

Training is done on a sequence of 8 hourglass modules, which do not share weights with each other. Using mean square losses on the heat map, each module uses the same loss function and basic facts.

OpenPose

OpenPose is also the first open source library for real-time critical detection and is an improved CMUPose. In CMUPose, the first bottom-up pose estimator using Part Affinity Fields (PAF) is proposed.

Given the input image, a heat map representing the probability that a key point appears at each pixel is, and a vector field of partial affinity is generated. Both are generated by the 2-branch multi-level CNN observed below.

The input image is generated from the feature F by fine-tuning the first 10 layers of the VGG. This feature figure F is then used as the input for the first stage of each branch. Branch 1 (top branch) predicts the confidence graph for key points, while branch 2 predicts the partial affinity field. The confidence graph and affinity fields are refined by linking the previous predictions and the feature graph F from the two branches. At the end of the phase, L2 loss is applied between the estimated and true values.

As you’ve often seen before, a confidence map is a 2D heat map that expresses the belief that key points exist at a given pixel. Partial affinity fields are two-dimensional vector fields that encode the direction from one part of the limb to another. The advantage of this feature representation is that it retains information about the position and direction of the limb support area. Performing non-maximum inhibition, we obtain a set of candidate body part positions. Each of these can then be assigned to several people. Using line integral calculations to quantify the effect of fields along the curve on the affinity field, body parts match humans.

On the basis of CMUPose work, OpenPose only uses PAF for pose estimation task, thus eliminating the confidence of body parts. Below, we can observe that THE PAF is first encoded, which represents the part-to-part association, and then input into CNN to infer and detect the confidence graph.

Architecture of multi-stage OpenPose

Network depth is increased by replacing the 7×7 convolution layer with 3 consecutive 3×3 kernels, which output connections. Computationally, processing is halved because there is no longer a need to refine the PAF and confidence graph at each stage. Instead, first refine the PAF and pass it to the next stage, and then refine the confidence graph. If PAF is processed, the location of body parts can be inferred, but not vice versa.

(Higher)HRNet

Discusses a novel architecture in which the sub-networks from high to low resolution are connected in parallel, rather than in series, as with most existing solutions, which maintain a high resolution representation.

HRNet architecture

Rich high-resolution features are obtained through multi-scale fusion across subnetworks so that each high-to-low resolution representation receives information from other parallel representations. The downsampling occurs by using step convolution, while the upsampling occurs by 1×1 convolution and nearest neighbor upsampling. Heat map from the main high resolution branch regression.

Based on this preliminary work, Higher HRNet has addressed two major challenges:

  • How to improve the reasoning performance of the little guy without sacrificing the reasoning performance of the big guy?
  • How to generate a high resolution heat map for key point detection of the little guy?

Using HRNet as the backbone, HigherHRNet (below) adds a deconvolution module where the heat map is predicted from the higher resolution feature map.

Stem is a sequence of 2 3×3 conv layers, reduced resolution by a quarter, and then input through the HRNet trunk. The 4×4 deconvolution layer, followed by BatchNorm and ReLU, takes the feature and predicted heat maps as inputs and generates a feature map that is twice the size of the input.

Residual blocks (4) were added after the de-Conv layer to refine the high-resolution feature map. Finally, bilinear interpolation is used to upsample the low-resolution feature maps to aggregate the heat maps of the feature pyramid, and the final prediction is obtained by averaging all the heat maps.

PifPaf

PifPaf’s development goal is to estimate human posture in crowded crowds in urban environments and make it suitable for self-driving cars, delivery robots and more. Below, we observe that the ResNet trunk is used with 2 heads: The Part Intensity Field (PIF) predicts the location, size and confidence of key points, while the Part Association Field (PAF) predicts the correlation between key points.

PifPaf Architecture

More specifically, PIF outputs a confidence level, a vector component pointing to the nearest key point, with an expansion factor and a scale. As shown below, the confidence graph is very rough. Therefore, the positioning of the confidence map can be improved by fusing it with the vector field that generates the confidence map with higher resolution. Then you can learn from this domain the scale or the spatial scope of the joint. This scale and the above spread could help improve postural estimation performance in people of different body types.

Left: confidence map, Middle: vector field, Right: fused confidence map

By attempting to connect a pair of key associations, PAF is used to connect joint positions into postures from the bottom up. Examples of these 19 associations are:

  • Left ankle to left knee
  • Left hip to right hip
  • Nose to right eye

PAF is associated with the left shoulder and left hip

For a given feature graph, at each position, the origin of the two vectors associated with the key points is predicted by PAF as confidence (top left). Associated confidence above 0.5 is shown on the right.

Finally, the decoder takes the two fields (PIF and PAF) and converts them into a set of coordinates (17) representing the human skeleton. The greedy algorithm creates priority queues for all key types by reducing confidence. These points serve as candidates (seeds) that pop up from the queue and add connections to other joints with the help of PAF fields. PAS associations are scored because of the potential for double connections between the current and next key points. Finally, non-maximum inhibition is applied to each critical type to generate human bone.

DirectPose

The first multi-person pose estimator is proposed, where key point annotations are used for end-to-end training, and for reasoning, the model is able to map input to the key points of each individual instance without any box detection. Based on the emergence of anchor-free object detection, which immediately regents both corners of the target bounding box, the researchers solved the question of whether this detection technique can be used to detect critical points.

The basic principle is that a detection task can be restated as a special bounding box with more than two corners. They showed that it performed poorly, mainly because there was only one eigenvector used to regression all the key points. They address this challenge by extending the full convolution single stage Object detection (FCOS) architecture using an output branch for critical point detection.

FCOS Architecture

FCOS reformulates the target detection task on a per-pixel basis. Similar to semantic segmentation, FCOS treats the pixels on the input image as training samples, rather than anchor boxes in an anchor-based detector. The pixels that fall into the bounding box basic fact are considered positive and get the following:

  • Class tags for basic facts
  • A 4D vector representing the distance from the position to the four edges of the bounding box, used as the regression target for that position

Using feature pyramid networks (FPN) ensures better robustness to object sizes of different scales. The feature map generated by the trunk (ResNet50) is convolved with 1×1. The step sizes of feature layers P3, P4, P5, P6 and P7 are 8, 16, 32, 64 and 128 respectively. Except for P6 and P7, the respective horizontal and top-down paths are merged by addition. Multilevel prediction also deals with the possibility of two different bounding boxes of different sizes overlapping each other.

FCOS limits regression at different feature level using the following thresholds: 0, 64, 128, 256, 512 and infinity for all feature levels (P3 through P7). These thresholds represent the maximum distance that the characteristic level Pn needs to regression. If overlapping bounding boxes are still present, select the smallest bounding box.

Because different feature levels are regressive to different size ranges, different heads are required. Finally, due to the many low quality prediction bounding boxes far from the center of the object, the author introduces the concept of centrality. This header predicts normalized distances based on the positions of the four edges of the bounding box.

DirectPose treats key points as very special bounding boxes with K corners. However, poor performance was observed in their experiments due to the lack of alignment between features and predicted key points. This is because many key points are far from the center of the receptive field of the eigenvectors. As the input signal deviates more and more from the center of the receptive field, the response intensity of the feature to the input decreases gradually.

Therefore, a Keypoint Align Module (KPAM) is proposed. Taking a 256-channel feature as input, KPAM will slide this feature intensively.

The locator, as the name implies, locates feature vectors that predict the index of the locations of key instances from which feature samples are sampled with length of 256 feature vectors. For the NTH key point, the NTH convolution layer will be used as the input NTH eigenvector and will predict the coordinates of the position relative to the sampling eigenvector.

By summing K offsets from Locator and from KPAlign, we obtain coordinates that need to be scaled to match the original feature map. Finally, a small adjustment is used in which the key points that always exist in one region (nose, eyes and ears) are grouped and the same feature vectors are used.

Finally, we can see how KPAM replaces the bounding box module of the FCOS architecture mentioned earlier. We did observe an additional heat map branch that was used as an auxiliary task/loss to make regression based tasks more feasible.

DirectPose Architecture

Conclusion

Clearly, the task of estimating poses poses considerable challenges. The bottom-up approach has been proven over and over again to be superior to the top-down approach, but the key points need to be connected to people. This grouping or assembly process that generates the final instance perception key points can be accomplished using heuristics, human skeleton modeling (graphical structures), and/or stacked confidence diagrams. Moreover, complexity explodes when it is assumed that an unknown number of people can appear anywhere and at any scale on an image. Human-computer interaction, joints, and of course occlusion complicate the critical assembly process.

Pose estimation has important applications in human-computer interaction, motion recognition, surveillance, picture understanding, threat prediction, robotics, AR and VR, animation and gaming.

Medium.com/@ilias_mans…

This article comes from the public account CV technical guide overview series.

Welcome to the public account CV technical guide, focusing on the technical summary of computer vision, the latest technology tracking, classical paper interpretation.

To get a PDF summary of the following articles, reply to the official account with the keyword “Technical summary”.

Other articles

Self-attention in computer vision

Classic paper series — Capsule Networks: New deep learning networks

Review column | attitude estimation were reviewed

Gossip about CUDA optimization

Why IS GEMM central to deep learning

Why is 8 bits enough to use deep neural networks?

Classic paper series | target detection – CornerNet & also named anchor boxes of defects

What about the AI bubble

Use Dice Loss for clear boundary detection

PVT– Backbone function without convolution dense prediction

CVPR2021 | open the target detection of the world

Siamese network summary

Past, present and possibility of visual object detection and recognition

What concepts or techniques have you learned as an algorithm engineer that have made you feel like you’ve grown tremendously?

Summary of computer vision terms (1) to build a knowledge system of computer vision

Summary of underfitting and overfitting techniques

Summary of normalization methods

Summary of common ideas of paper innovation

Summary of methods of reading English literature efficiently in CV direction

A review of small sample learning in computer vision

A brief overview of knowledge distillation

Optimize OpenCV video read speed

NMS summary

Technical summary of loss function

Technical summary of attention mechanisms

Summary of feature pyramid technology

Summary of pooling technology

Summary of data enhancement methods

Summary of CNN structure Evolution (I) Classic model

Summary of CNN structure evolution (II) Lightweight model

Summary of CNN structure evolution (III) Design principles

How to view the future of computer vision

Summary of CNN Visualization Technology (I) Visualization of feature map

Summary of CNN visualization technology (2) Visualization of convolution kernel

Summary of CNN Visualization Technology (III) Class visualization

CNN Visualization Technology Summary (IV) Visualization tools and projects