1 Introduction
Object detection is a common task in computer vision. Based on whether or not candidate regions are extracted, object detection methods are generally divided into one-stage and two-stage detection networks. One-stage detection methods directly regress the object's class probability and location coordinates. Common one-stage algorithms include YOLOv1, YOLOv2, YOLOv3, SSD, DSSD, and Retina-Net. Two-stage detection methods involve extracting candidate regions in the first stage and then feeding these regions into a classifier for classification and detection in the second stage. Common two-stage algorithms include R-CNN, SPP-Net, Fast R-CNN, Faster R-CNN, Mask R-CNN, and Cascade R-CNN. Compared to one-stage detection networks, two-stage detection networks offer higher accuracy but are slower.
Furthermore, based on whether anchor boxes are used to extract candidate bounding boxes, object detection frameworks can also be divided into anchor-based methods, anchor-free methods, and methods that combine both. Anchor-based algorithms include Fast R-CNN, SSD, YOLOv2, and YOLOv3; anchor-free algorithms include CornerNet, ExtremeNet, CenterNet, and FCOS; and methods that combine anchor-based and anchor-free branches include FSAF, GA-RPN, and SFace.
Currently, all mainstream detectors, such as Faster R-CNN, SSD, YOLOv2, and YOLOv3, rely on a set of predefined anchor boxes. Among them, the use of anchor boxes is considered to be the key to the success of the detector. Despite the great success of these mainstream detectors, anchor box-based methods still have some drawbacks: (1) Even with careful design, the detector will encounter difficulties in handling candidate objects with large shape changes, especially for small objects, since the scale and aspect ratio of the anchor boxes are pre-set, which undoubtedly hinders the generalization ability of the detector; (2) In order to achieve a high recall rate, anchor boxes need to be densely placed on the input image (e.g., for an image with a short side of 800, more than 180k anchor boxes are placed in the Feature Pyramid Network (FPN), but most anchor boxes are labeled as negative samples during training, and too many negative samples will exacerbate the imbalance between positive and negative samples during training; (3) Anchor boxes involve complex calculations, such as calculating the intersection over union (IoU) with the ground-truth.
To overcome the shortcomings of anchor-box-based methods, CornerNet adopts a keypoint detection-based method for extracting candidate regions: it uses a single convolutional neural network to detect a target bounding box with the top left and bottom right corners as a pair of keypoints. By detecting targets as pairs of keypoints, it eliminates the need for manually designing anchor boxes, which is common in traditional detectors. However, CornerNet also has some problems: (1) CornerNet's ability to perceive information inside objects is relatively weak, which limits its performance. (2) When pairing keypoints, CornerNet believes that keypoints belonging to the same category should be as close as possible, and keypoints belonging to different categories should be as far apart as possible. However, in the experiment, it was found that by calculating the distance between the embedding vectors of the top left and bottom right corners to decide whether to combine the two points, pairing errors often occur. (3) Using keypoint pairing to determine the candidate region of a target will generate a large number of false detection target candidate regions, which will not only reduce the detection accuracy but also take a long time. This paper proposes a new anchor-box-less two-stage target detection algorithm to optimize the above three problems.
2 Key Point Target Detection Method
This paper uses CornerNet as a benchmark and proposes a two-stage object detection network method based on the detection of three key points without anchor boxes. As shown in Figure 1: In the first stage, the method based on the detection of key points without anchor boxes is used to detect corner points and center key points respectively. At the same time, it is determined whether the center point falls within the center region to remove false candidate regions, that is, to extract candidate regions; In the second stage, the candidate regions retained after the filtering in the first stage are sent to a multi-classifier for classification and detection.
Figure 1. Network framework of the two-stage object detection method based on keypoint detection
2.1 Detection of 3 key points based on anchorless frame
To detect corner points, this paper first uses a CornerNet-based keypoint detection method to locate the top-left and bottom-right corner points; then, corner pooling is used to generate two heatmaps for the top-left and bottom-right corners to represent the positions of different categories of keypoints; finally, the offset of the corner keypoints is corrected.
Furthermore, to enhance the network's ability to perceive internal information of objects, this paper adds a central keypoint detection branch and employs center pooling to strengthen the features of the central point. Simultaneously, the concept of object centrality is defined—a centrality greater than 0.7 is considered to indicate that the central keypoint falls within the central region, effectively resolving the determination of the central region for objects of different sizes. Ultimately, only objects whose central points fall within the central region of the predicted bounding box are retained; otherwise, they are removed. It should be noted that when a central keypoint falls within multiple different predicted bounding boxes, the predicted bounding box with the highest centrality is retained, and redundant predicted bounding boxes are discarded to reduce the probability of false positives. This is illustrated in Figure 2.
Figure 2. Filtering false positive candidate regions using center key points
2.1.1 Corner and Key Point Detection
Regarding the detection of corner keypoints, this paper draws on CornerNet to locate the two corner keypoints of the object being detected—located at its top left and bottom right corners, respectively. Three heatmaps are calculated (i.e., the heatmap of the top left, the heatmap of the bottom right, and the heatmap of the center point; each value on the heatmap represents the probability that a corner keypoint appears at the corresponding position), and its resolution is reduced to 1/4 of the original image resolution. Among them, the heatmap has two losses: one for locating the top left corner keypoint on the heatmap, and another for locating the bottom right corner keypoint on the heatmap, as well as an offset loss, as shown in formulas (1) to (3). After calculating the heatmaps, a fixed number of keypoints are extracted from all the heatmaps (k from the top left and k from the bottom right), and each corner keypoint is assigned a class label.
(1)
Where C represents the target category; H and W are the height and width of the heatmap, respectively; p <sub>cij</sub> is the score of class c in the predicted heatmap at position (i, j); y <sub>cij </sub> is the heatmap with added non-normalized Gaussian; N is the number of objects in the image; α and β are hyperparameters controlling the contribution of each point.
(2)
(3)
Where 0<sub> K</sub> is the offset; represents the precision information lost during rounding calculation; x <sub>k</sub> and y <sub>k</sub> are the x and y coordinates of angle k; (x <sub>k</sub> , y<sub> k</sub> ) is mapped to the heatmap as , n is the downsampled value, which is 4 in this paper; represents rounding down. Specifically, a set of offsets shared by the top-left corners of all classes and another set of offsets shared by the bottom-right corners are predicted, using Smooth L1 Loss during training.
When performing keypoint pairing, CornerNet assumes that keypoints belonging to the same category should be as close as possible to each other, and keypoints belonging to different categories should be as far apart as possible. However, errors may occur during keypoint pairing in experiments. Furthermore, to fully utilize the internal information of objects, this paper abandons this mechanism and leaves the keypoint pairing problem to a multi-class classifier in the two-stage process.
2.1.2 Centrality – Definition of a central region
To effectively eliminate a large number of false positive candidate regions, this paper addresses this issue by determining whether the central key point falls within the central region of the target bounding box. Since each bounding box has a different size, the central region cannot be set to a fixed value. This paper proposes a scale-adjustable central region definition method as shown in Equation (4), introducing a new quantitative indicator, centrality.
(4)
Where l is the distance from the center point to the left side of the prediction box; r is the distance from the center point to the right side; t is the distance from the center point to the top border; and b is the distance from the center point to the bottom border, as shown in Figure 3.
2.1.3 Centralized Pooling
Figure 3. Centrality calculation
The center pooling operation references CornerNet's two corner pooling modules—top-left corner pooling and bottom-right corner pooling—to predict top-left and bottom-right keypoints, respectively. Each corner module has two input feature maps, with the width and height represented by W and H, respectively. Suppose we want to perform top-left corner pooling on the feature map at point (i, j), which involves calculating the maximum value from (i, j) to (i, H) (max pooling) and simultaneously calculating the maximum value from (i, j) to (W, j) (max pooling). These two maximum values are then added together to obtain the value for point (i, j). The bottom-right corner pooling operation is similar, except that the maximum value is calculated from (0, j) to (i, j) and from (i, 0) to (i, j).
The geometric center of an object does not necessarily have obvious visual features. For example, the human head contains strong visual features, but the central key point is often located in the middle of the body. To address this issue, this paper employs center pooling to capture richer and more recognizable visual features. Figure 4 illustrates the principle of center pooling: the feature extraction network outputs a feature map (width and height are represented by W and H, respectively). Center pooling can be achieved through a combination of corner pooling in different directions. Specifically, the operation of maximizing the value in the horizontal direction can be achieved by concatenating left pooling and right pooling. Similarly, the operation of maximizing the value in the vertical direction can be achieved by concatenating top pooling and bottom pooling.
Figure 4 Schematic diagram of central pooling structure
Note: AP50 and AP75 represent the accuracy at single IoU thresholds of 0.50 and 0.75, respectively; APs, APm, and APl represent the detection accuracy for small, medium, and large targets, respectively. (The same applies to Tables 2 and 4 below.)
Table 1. Accuracy comparison of our proposed method and state-of-the-art detection frameworks on COCO test-2017.
To determine whether a pixel in a feature map is a central keypoint, we need to find its maximum value in both the horizontal and vertical directions using center pooling, and then sum these two values. This helps in better detecting central keypoints. Specifically, the two branches of the feature map are each passed through a 3×3 convolutional layer, a Batch Normalization (BN) layer, and a ReLU activation function, performing corner pooling in both the horizontal and vertical directions, and then summing the results. For example, to perform right-side pooling on point (i, j) in the horizontal direction, we calculate the maximum value from (i, j) to (W, j) (max pooling). Similarly, we calculate left-side pooling and then concatenate and sum these two values to obtain the horizontal value for point (i, j). Likewise, we find the vertical value and sum the horizontal and vertical values to obtain the final value for point (i, j).
2.2 Classification
While extracting candidate regions using keypoint detection solves the problem of manually setting hyperparameters such as anchor frame size and aspect ratio, greatly improving detection flexibility, it also introduces two issues: a large number of false positive candidate regions and the high computational cost of filtering out these false positives. Therefore, the solution proposed in this paper mainly includes two steps:
(1) First determine whether the corner point and the center point belong to the same category, and then filter out a large number of erroneous candidate regions by calculating whether the centrality of the center point is greater than 0.7.
(2) The candidate regions remaining after the first step of screening are sent to the subsequent multi-class classifier to sort the target scores of those still belonging to multiple categories. Specifically, RoIAlign is used to extract features from each candidate region, and a 256×7×7 convolutional layer is used to obtain a vector representing the category, thus establishing a separate classifier for each surviving candidate region. The loss function Lclass is Focal Loss:
(5)
Where M and N are the number of candidate regions retained and the number of positive samples among them, respectively; C is the number of categories in the dataset that intersect with it; IoUnc is the maximum IoU value between the nth candidate region and all ground truth boxes in the cth category; τ is the IoU threshold (set to 0.7); is the classification score of the cth category in the nth object; ∂ is the hyperparameter of the smoothing loss function (set to 2).
3. Experiment
3.1 Dataset and Evaluation Metrics
MS-COCO is one of the most popular benchmark datasets for object detection, containing 120,000 images and over 1.5 million bounding boxes covering 80 object categories, making it a very challenging dataset. This paper uses `trainval35k` to train a two-stage object detection network model based on keypoint detection and evaluates it on the MS-COCO dataset. `trainval35k` is a joint set consisting of a subset of 80,000 training images and 35,000 validation images.
This paper uses Average Precision (AP) as defined in MS-COCO as a metric to characterize the performance of the network model and its competitors. The AP is recorded every 0.05 increments from a single IoU threshold of 0.5 to 0.95, and the average is taken (i.e., 0.5:0.05:0.95). Other important metrics are also recorded in the experiments, such as AP50 and AP75, which represent the precision calculated at single IoU thresholds of 0.50 and 0.75, respectively; and APs, APm, and APl, which represent the precision calculated at different target scales (small objects with an area less than 32×32, medium objects with an area greater than 32×32 but less than 96×96, and large objects with an area greater than 96×96). All metrics are calculated with a maximum of 100 candidate regions allowed per test image.
3.2 Training and Testing of the Grid
This paper uses CornerNet as a baseline, and partially references the code of CornerNet and FCOS. The feature extraction network still uses the 52/104 layer Hourglass network used in CornerNet, and the algorithm is implemented with the help of PyTorch.
The network is trained from scratch, with an input image resolution of 511×511 and an output heatmap resolution of 128×128. Adam is used to optimize the training loss, and the overall network loss function L is:
(6)
Focal Loss was used for training the network to detect corner points and center keypoints, respectively; Smooth L1 Loss was used for training the network to predict the offsets of corner points and center keypoints, respectively. The models were trained on eight NVIDIA 2080-Ti GPUs with a batch size of 48 (6 samples per GPU). The learning rate was set to 2.5 × 10⁻⁴ for the first 250k iterations, and then decreased to 2.5 × 10⁻⁵ for the next 50k iterations. The training times for Hourglass-104 and Hourglass-52 were 9 days and 5 days, respectively.
4 Results and Discussion
This paper tests the accuracy of commonly used anchor-box-based and anchor-free keypoint detection frameworks on the general detection dataset COCO test-2017. The results are shown in Table 1. Table 1 shows that our proposed two-stage method for anchor-free keypoint detection achieves a 3.2% accuracy improvement over the anchor-box-based two-stage method YOLOv4; it also achieves 5.2% and 1.8% accuracy improvements over anchor-free one-stage methods such as FCOS and CenterNet, respectively, and a 6.2% accuracy improvement over CornerNet. The accuracy improvement is more significant when detecting objects with specific dimensions and aspect ratios. This indicates that the anchor-free method is more advantageous for extracting candidate regions. In single-scale testing, the original resolution image and a horizontally flipped image are input into the network. In multi-scale testing, the resolution of the original image is set to 0.6, 1, 1.2, 1.5, and 1.8 times, respectively. Furthermore, a flip variable is added to both single-scale and multi-scale evaluations. In the multi-scale evaluation, the prediction results of all scales (including the flipped variable) are merged into the final result. Then, soft-NMS is used to suppress redundant bounding boxes, and the 100 bounding boxes with the highest scores are retained as the final evaluation. The results are shown in Table 2.
Table 2 Multiscale Testing
The recall of three different detection frameworks and the detection method of this study was evaluated on the COCO dataset. The average recall (AR) of targets with different aspect ratios and sizes was recorded. The results are shown in Table 3.
Note: X is ResNeXt[29]; AR1+, AR2+, AR3+, and AR4+ represent the recall rates when the bounding box area is (962, 2002], (2002, 3002], (3002, 4002], and (4002, ∞), respectively; AR5∶1, AR6∶1, AR7∶1, and AR8∶1 represent the recall rates when the aspect ratio of the object is 5∶1, 6∶1, 7∶1, and 8∶1, respectively.
Table 3. Comparison of Average Recall (AR) between Anchor-Frame-Based and Anchor-Free Detection Methods
Generally, objects that are very large, such as those larger than (400×400, ∞), are easier to detect. Compared to other anchor-bound methods, the anchor-bound method Faster R-CNN did not achieve the expected high recall. However, when the object's aspect ratio is unusual (e.g., 5:1 and 8:1), the anchor-bound detection method outperforms the anchor-bound method. This is because the anchor-bound method is free from the constraint of manually setting the aspect ratio of the anchor boxes. The method presented in this paper inherits the advantages of FCOS and CornerNet, making object localization more flexible, especially for objects with special aspect ratios.
This paper compares the CornerNet algorithm with the original algorithm by adding a center keypoint detection branch, conducting ablation experiments. The feature extraction network used is Hourglass-52, and the results are shown in Table 4. Analysis of the data shows that the accuracy improved by 3% after introducing the center keypoint detection branch, with a 5.8% improvement in small target detection accuracy and a 3.6% improvement in large target detection accuracy. This indicates that the center keypoint detection branch removes more false positive candidate regions for small targets. This is because, probabilistically, small targets are easier to center due to their smaller area, thus those false positive candidate regions are more likely to be not near the center point.
Table 4 Ablation experiments with added central key point branches
Figure 5 shows a visual comparison of the detection results between Faster R-CNN based on anchor boxes and methods based on anchor box-less keypoint detection. It can be seen that the method presented in this paper does not require manual setting of anchor box size and aspect ratio, and exhibits better detection performance for small targets and objects with unique shapes.
Figure 5. Visual comparison of Faster R-CNN based on anchor boxes and methods based on anchor box-less keypoint detection for keypoint detection tasks.
5. Conclusion
This paper proposes a two-stage object detection framework without anchor boxes, which extracts corner keypoints and object center keypoints separately and combines them into candidate regions. By determining whether the object's center point falls within the central region, a large number of false positive candidate regions are filtered out. At the same time, it abandons the corner keypoint combination method used in CornerNet and adopts a two-stage approach to feed the retained candidate regions into a multi-class classifier for classification and regression.
Through the two stages described above, the recall and accuracy of the network model detection presented in this paper are significantly improved, and the results outperform most existing object detection methods, achieving excellent performance in both recall and detection precision. Most importantly, the anchor-free method is more flexible in extracting candidate regions, overcoming the drawback of anchor-based methods that require manual setting of anchor hyperparameters.
Authors: Wang Hongren1,2, Chen Shifeng1
1. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
2. Reprinted from "Integrated Technology" by the Shenzhen Institute of Advanced Technology, University of Chinese Academy of Sciences