Abstract: In deep neural networks used for human pose estimation, the mean squared error (MSE) is commonly used as the loss function. While MSE is computationally simple, it cannot guarantee consistency with the prediction results; that is, different predicted heatmaps output by the neural network can result in the same calculated MSE. To address this issue, this paper proposes an asymmetric mean squared error (AMSE) loss function based on MSE. This function adds a penalty term to the predicted heatmap, penalizing larger predicted output values to ensure consistency between the MSE and the prediction results. Experimental results on the COCOval2017 dataset demonstrate that the proposed AMSE performs better than MSE in prediction.
Keywords: Human pose estimation; mean square error; asymmetric mean square error
1. Introduction
Multi-person pose estimation is one of the fundamental challenges in many computer vision applications, such as behavior recognition and human-computer interaction [1-3]. Its main purpose is to identify and locate key points of different human bodies in an image.
Since Toshev et al. applied deep learning to human pose estimation [4], human pose estimation methods have gradually shifted from traditional methods to deep learning [5-8]. Toshev et al. directly regressed the coordinates of human key points through neural networks, while Tompson et al. used multi-resolution images as input to extract multi-scale features of the images and used them for human key point heatmap prediction [9]. The current human pose estimation framework is divided into two categories. The first is a two-stage method [10-12], which first finds the frame of each person in the image and then locates the key points of the human body in each frame. The second is a method based on each key point [13-15], which first locates all the key points in the image and then combines the located key points to obtain the key points of multiple people. Usually, the two-stage method is more effective because it can better utilize the global semantic information of the image.
Current state-of-the-art human pose estimation methods focus on research into novel network models, such as CPN and SBN.
CPN solves the problem of difficult keypoint detection by integrating multi-level features into a pyramid network. SBN provides a simple and efficient method for human pose estimation, achieving good results by adding only a deconvolution layer to the last layer of ResNet [16]. These methods all use heatmaps for prediction and calculate the loss function value between the predicted heatmap and the labeled heatmap using MSE. However, using MSE as the loss function has inherent shortcomings. First, there is an inconsistency between the MSE value and the mAP metric between different predicted heatmaps and labeled heatmaps. This means that two predicted heatmaps with the same MSE will produce different error rates, which we call the inconsistency problem. To solve the above problems, this paper proposes Asymmetric Mean Squared Error (AMSE) to guide the model to select a better output to maintain consistency. Experiments show that, with only a small increase in computation, the model trained using AMSE performs significantly better than the model trained using MSE.
In summary, the main contributions of this paper are as follows:
This paper analyzes the inconsistency issues arising from the calculation of MSE values using predicted heatmaps and labeled heatmaps in human pose estimation tasks.
Asymmetric mean square error (AMSE) is proposed as an improved loss function to address the inconsistency problem.
2. Asymmetric mean square error
2.1 Mean Square Error
The human pose estimation method based on heatmap representation takes a color image of a certain size as input and outputs a set of 2D heatmaps representing the localization of human body parts, as shown in Figure 1.
Figure 12D heat map
Where S = (S1, S2, ..., SJ) represents J heatmaps, each representing a key point. The formula for calculating the MSE value between heatmaps is defined as follows:
(1)
Where M = J´W´H, GjÎRW´H represents the labeled heatmap of the j-th keypoint, which is an image generated by applying Gaussian spots to the keypoint location. For the heatmap Sj of the j-th predicted keypoint, the final keypoint coordinates Kj are determined by the position of the maximum value in the heatmap:
(2)
The best current human pose estimation methods all use MSE as the loss function [17]. However, MSE cannot ensure the consistency of prediction results. When the model predicts heatmaps with the same MSE value, different prediction results will appear. This problem is called the inconsistency problem.
2.2 Problem Analysis
For a given labeled heatmap G0 and MSE value, there exist multiple predicted heatmaps S* that satisfy the following formula:
(3)
Different heatmaps predict different S* values, yet all yield the same MSE value. To simplify this problem, assume S* satisfies the following condition:
(4)
As shown in formula (4), for each point on the predicted heatmap, there are only two possibilities: larger or smaller than the target value. Taking a one-dimensional heatmap as an example, assuming the labeled heatmap is [0.5,1,0.5]T, there are 8 predicted heatmaps that satisfy formula (4), as shown in Figure 2. In the figure, bold text indicates that the value is 0.5 larger than the target value at the corresponding position, and non-bold text indicates that the value is 0.5 smaller than the target value at the corresponding position.
Figure 2. Predictive heatmap with the same MSE
As shown in Figure 2, under the same MSE value, the predicted heatmaps (a)-(e) and (f)-(h) will produce an error of one pixel position after the maximum/minimum value operation of formula (2), resulting in different final results. This is the inconsistency problem of MSE. In fact, the inconsistency problem is mainly caused by the operation of formula (2), because this operation is not sensitive to the absolute value of the predicted heatmap, but the relative order of the values in the predicted heatmap is very important for the consistency of prediction. Only when the relative order of the values in the predicted heatmap is the same as that in the labeled heatmap can a consistent result be obtained through formula (2). The purpose of using MSE is to reduce the absolute difference between the prediction and the target. The resulting mismatch with formula (2) causes the inconsistency problem.
To address this problem, the image processing field proposed the structural similarity index SSIM[18]. Under the condition of the same MSE, the visual effect of the human eye can be improved by increasing the structural similarity of the image. In the human pose estimation task, the annotation heatmap is generated by Gaussian spots applied to the key points. Each 64×64 annotation heatmap only has values in the part where the Gaussian spots are generated. Therefore, the annotation heatmap not only lacks rich edge texture information, but is also very sparse. It is not suitable to use SSIM. If the range of the Gaussian spots is increased to make the texture more obvious, it will lead to inaccurate localization of the key points.
Therefore, MSE remains one of the most widely used loss functions. To address the existing problems, this paper proposes an asymmetric mean squared error (AMSE) to improve it.
2.3 Asymmetric mean square error
Because the MSE values of all heatmaps in Figure 2 are the same, the MSE cannot distinguish the differences between the heatmaps, but the predicted heatmaps will have different prediction results. As shown in Figures 2(a) and 2(b), the prediction effect is best when all values of the predicted heatmap are greater than or less than the values of the labeled heatmap. Therefore, forcing the model to output values similar to those in Figures 2(a) and 2(b) may improve the model's performance. By adding the square of the model's output value to the original MSE loss function, the model can be guided to output smaller predicted values similar to those in Figure 2(b). The formula is defined as follows:
(5)
In the formula, M = J´W´H, GjÎRW´H, SjÎRW´H, where Gj and Sj represent the labeled heatmap and predicted heatmap of the j-th keypoint, respectively. When = 0.01, this loss function is called Regularized Mean Squared Error (RMSE). RMSE penalizes larger values in the predicted heatmap by adding an L2 penalty. However, due to the presence of the squared term, even if the predicted value is the same as the target value, the loss function cannot be equal to 0, and the predicted value is always penalized. The RMSE curves for target values of 1, 0.5, and 0.25 are shown in Figure 3. The point where RMSE reaches its minimum value is not equal to the target value, which leads to poor prediction results. However, the method of adding a penalty term to larger predicted values does indeed make the model biased towards outputting smaller values, which is beneficial to prediction performance.
Figure 3 RMSE Prediction Curve
If a loss function not only minimizes the loss when reaching the target value but also imposes a greater penalty on larger predicted values, then it can better guide the model to output smaller values and avoid the drawbacks of RMSE. For this purpose, we propose Asymmetric Mean Squared Error (AMSE), defined as follows:
(6)
In the formula, Wj is the constant term matrix generated by the model and used as a constant for backpropagation. The purpose of using Wj is to minimize AMSE when it equals the target value. When Wj equals the target value, formula (6) takes the following form:
(7)
Although formula (7) is the same as MSE, it reaches its minimum when the predicted value equals the target value, but it can penalize the output value of larger predictions. Therefore, Wj is a feasible option. The curves of AMSE with target values of 1, 0.5, and 0.25 are shown in Figure 4.
Figure 4 AMSE Prediction Curve
As shown in Figure 4, the AMSE prediction curve exhibits linear asymmetry and reaches its minimum when the predicted value equals the target value. Experiments demonstrate that Wj does not need to be equal to the target value, and different forms of Wj are equally effective.
3. Experiment and Analysis
3.1 Experimental Data and Model
This experiment will be conducted on the COCO Keypoint Challenge dataset [19] to predict the coordinates of keypoints of multiple human bodies in an uncontrolled environment. The dataset contains more than 200,000 images and 250,000 labeled human instances, of which 150,000 instances are publicly available as validation and training sets. Similar to the reference [10], the experiment will be trained using only the COCOtrain2017 dataset without using any additional data, and the test experiment will be conducted on the val2017 dataset. After the experiment is completed, the target keypoint similarity (OKS) will be used as a metric. Similar to the object detection metric IoU, the distance between the predicted point and the target point after human scale standardization will be used to calculate the OKS.
Although the complexity of current neural network structures and experiments is constantly increasing, SBN remains one of the best human pose estimation methods, remaining simple and effective. Therefore, we intend to use SBN as the experimental benchmark to verify its AMSE performance. ResNet is a commonly used backbone network for image feature extraction; SBN simply adds several deconvolutional layers to the last layer of ResNet. Similar to SBN, we will add three deconvolutional layers to the last layer of ResNet, using batch normalization and ReLU activation functions. The deconvolutional layers have 256 4×4 filters with a stride of 2. Finally, we adjust the output channels through 1×1 convolutions to obtain the predicted heatmap. The labeled heatmap is generated by adding 2D Gaussian spots at keypoint locations.
3.2 Model Training and Testing
The experimental backbone model ResNet was initialized using the ImageNet classification task.
The pre-training was completed. During training, the bounding boxes of the labeled human body were locked to a certain ratio. The ratio was fixed to 4:3 by changing the length of the bounding box. Finally, the bounding boxes of the fixed ratio were cropped from the image and scaled to the same 256´192 resolution as the SBN experiment for comparison. The experimental data augmentation included image flipping, 30% image scale transformation and 40° image rotation. The model training used 4 GPUs and trained for 140 generations. The training learning rate was set to 0.001 and reduced to 0.0001 and 0.00001 in generations 90 and 120, respectively. The batch training size was set to 128. The optimizer was Adam[20]. The model experiments of ResNet-50 and ResNet-101 were all completed by PyTorch. Unless otherwise stated, ResNet-50 was used as the default backbone model.
Similar to references [10,11], the experiment adopted a two-stage approach and used a pre-trained mask-RCNN for the first stage of single-person human frame detection [21]. The detector achieved an accuracy of 56.4 mAP on COCOval2017. As with conventional methods [22], the predicted heatmaps of the original and flipped images were averaged and then used to predict the keypoint locations. The final keypoint locations were obtained by applying a quarter offset in the direction from the highest response to the second highest response.
3.3 Experimental Results and Analysis
The experimental results for different hyperparameters are shown in Table 1. When β=0, AMSE degenerates into MSE, and this result can be used as a benchmark for comparison. When β=0.01, the experimental result is 0.6 points higher than the benchmark result, reaching 73.0 AP. The experiment also shows that AMSE is not sensitive to the selection of the hyperparameter β. Good results can be obtained when the value is between 0.01 and 0.1. Unless otherwise stated, β=0.01 is assumed to be the default value in the experiment.
Table 2 shows a comparison of AMSE and MSE under different backbone networks. The ground truth boxes (gt-boxes) indicate whether or not a labeled frame is used. AMSE consistently outperforms MSE across all backbone networks, regardless of whether a labeled frame is used. Furthermore, when a labeled frame is used during testing, AMSE further improves the experimental results when using ResNet-101 as the backbone. These results demonstrate that AMSE is more effective at enhancing model performance. Compared to MSE, when the backbone network is ResNet-50, AMSE improves the experimental results by 0.6 and 0.2 points with and without a labeled frame, respectively. This indicates that testing AMSE with an accurate labeled frame yields greater improvements in experimental performance.
Table 3 compares this method with Hourglass, CPN, and SBN. The AP of the human frame detector of SBN is 56.4, which is the same as that of this method. The AP of the human frame detector of CPN and Hourglass is 55.3. OHKM indicates whether hard example mining is used [23]. The SBN effect in this experiment is consistent with the effect of the public code, so it can be directly compared with the results listed in the SBN paper. As shown in Table 3, although the experimental results of SBN are better than those of Hourglass and CPN, AMSE can still improve the final result by 0.2 and 0.4 points, respectively. The cost of using AMSE as the loss function is only a small increase in computation during training. AMSE has achieved good results on the best lightweight method SBN. This method should also be applicable to other human pose estimation methods with average results. The prediction illustration is shown in Figure 5.
Figure 5 Example of a predicted heatmap
Table 1 Experimental results with different hyperparameters
Table 2. Experimental results under different backbone conditions.
Table 3 Experimental results under different models
4. Conclusion
This paper introduces the inconsistency problem in calculating the Mean Squared Error (MSE) value between the predicted heatmap and the labeled heatmap in human pose estimation tasks, and provides a detailed analysis of this problem. To address this issue, this paper proposes a novel and efficient Asymmetric Mean Squared Error (AMSM) loss function, which adds a penalty term to the predicted heatmap based on the MSE, thereby resolving the problem. Experimental results on the COCOval2017 dataset show that using labeled frame data for model testing can improve the final performance of AMSM by approximately 0.5 points. Although this method is proposed for human pose estimation tasks, it should also be applicable to any task that uses MSE as the loss function and is sensitive to the order of relative values.
References: