Candidate region box extraction method for pedestrian detection

introduction

The goal of pedestrian detection is to detect pedestrians in images and determine their locations. With the development of artificial intelligence technology, more and more researchers are focusing on this task and have conducted extensive research. Accurate pedestrian detection methods can be applied to many fields, such as intelligent assisted driving, intelligent video surveillance, and intelligent robots.

In recent years, Region Convolutional Neural Networks (R-CNN) models have been widely applied to general object detection tasks. One proposed model, Fast R-CNN, has achieved significant results in object detection across 21 classes. This model first uses the SelectiveSearch method to predict the possible locations of objects, and then uses a convolutional neural network to further refine the classification and localization of these candidate regions. Inspired by its success in general object detection, we attempted to apply this approach to pedestrian detection. However, SelectiveSearch is not a candidate region extraction method specific to a single class; it predicts the possible locations of all object types, including vehicles and buildings. Therefore, the generated candidate regions contain a lot of redundancy, reducing the quality of the trained classifier. Furthermore, redundant candidate regions consume significant computational resources, slowing down the training and testing of the convolutional neural network. In pedestrian detection, generating candidate regions only for the pedestrian category and using these regions to train and test the convolutional neural network theoretically yields excellent detection results.

Candidate region extraction can be viewed as a coarse detection of objects to some extent. We can extract features from an image and train a simple classifier to identify pedestrians, then use the classifier to generate candidate region boxes. This allows us to extract candidate region boxes only for pedestrian categories. Based on this idea, this paper proposes a candidate region extraction method suitable for pedestrian detection. We combine this candidate region extraction method with a convolutional neural network model and apply it to pedestrian detection. This detection method mainly consists of two steps: 1) generating candidate region boxes for each image using the candidate region extraction method; 2) inputting the image and its candidate region boxes into a convolutional neural network. The network contains two output layers. One outputs a probability estimate of the pedestrian's category, and the other outputs four real numbers representing the position of the pedestrian's bounding box.

Our model achieves superior detection performance compared to other pedestrian detection methods. On the INRIA, PKU, and ETH datasets, it achieves false negative rates of 14.1 %, 15.3 %, and 45.6 %, respectively. Experimental results demonstrate that our candidate bounding box extraction method is more efficient than SelectiveSearch in pedestrian detection tasks. Furthermore, our method removes redundant candidate bounding boxes, improving the training and testing speed of the convolutional neural network.

background

1. Classification of existing pedestrian detection algorithms

Existing pedestrian detection algorithms are generally divided into two categories. The first category is called traditional algorithms, which extract hand-designed features from images and train a Support Vector Machine (SVM) or boosting as a classifier. These hand-designed features include Haar, gradient histograms, and local binary patterns, which have shown good performance in pedestrian detection. DPM considers local region features and deformations between regions in detection. Some literature incorporates contextual information into the model. In addition, aggregated channel features fuse gradient histograms and LUV color space features for pedestrian detection. Literature proposes an effective feature transformation method to remove the correlation between local features.

Another type of pedestrian detection method is the sampling depth model. Depth models can learn features from the original image, greatly improving the performance of pedestrian detection algorithms. By learning features from different body parts of the pedestrian to handle occlusion issues, convolutional network methods employ unsupervised pre-training of convolutional neural networks using convolutional sparse coding, optimizing pedestrian detection performance through semantic features.

2. Candidate Box Extraction Method

Since objects can be of arbitrary size and can appear anywhere in an image, the entire image needs to be searched for classification and localization. The sliding window method can obtain all possible object locations, but its computational complexity is very high. Recently, researchers have proposed several other candidate box extraction methods, such as Selectivesearch, Bing, and Edgeboxes. Selectivesearch extracts candidate regions by segmentation and similarity calculation, resulting in high-quality regions but a slow speed. Bing uses regularized gradient information and binary search to generate candidate regions, which is fast but of poor quality. Edgeboxes is an algorithm that strikes a balance between quality and speed.

These methods generate candidate region boxes that encompass all categories, making them suitable for general-class detection, but they cannot extract candidate boxes for a single class. Redundant candidate region boxes degrade the performance of convolutional neural networks and consume more computational resources. Pedestrian detection only requires generating candidate region boxes for the pedestrian category without redundant information about other objects. This paper implements a candidate region box extraction method based on pedestrian detection algorithms. We combine this optimized candidate region box extraction method with a convolutional neural network and apply it to pedestrian detection.

The proposed method

1. Method Overview

The proposed method consists of two parts. The first part is the extraction of candidate region boxes, and the second part is a convolutional neural network model. The candidate box extraction method employs Aggregated Channel Features (ACF), and the convolutional neural network model is based on a deep network structure from previous literature. The network input consists of the original image and candidate region boxes. The model first extracts convolutional features from the image through convolution and pooling. These features are then mapped to fixed-length feature vectors via Region of Interest (RoI) pooling layers and fed into fully connected layers. Following the fully connected layers are two parallel output layers that output the confidence scores and coordinates of the pedestrian detection boxes.

2. Candidate Region Box Extraction

This candidate bounding box algorithm extracts handcrafted features from 10 channels of an image and trains an AdaBoost classifier. Channel features include normalized gradient magnitude, gradient direction (6 bins), and LUV color channels. The algorithm constructs a feature pyramid by calculating channel features at different scales. Features at different sizes are not calculated directly, but rather approximated using features from adjacent sizes, the detailed process of which is described below.

For image I, let's assume an arbitrary low-level rotation-invariant feature calculation method. One channel of the image is calculated using the following method: Channel C represents pixel-level features, where each pixel in C is calculated from a corresponding image patch of image I. Let represent the resampling of image I at size s, and R represent the sampling function. When calculating multi-size image features, image I is first resampled at size s, and then the channel features are obtained through approximate calculation. The approximate calculation method is as follows:

These are transformation factors between different sizes, one for each channel. Each feature type corresponds to one. Common feature pyramid methods typically calculate them for each size. This approximate calculation method slows down bounding box extraction. In the candidate region bounding box extraction process, this paper first extracts 10-channel features from the image, then uses approximate calculation to obtain features at different image sizes to construct a feature pyramid. Finally, an Adaboost classifier consisting of 2048 classification trees of depth 2 is trained to generate candidate region bounding boxes. To obtain sufficient candidate region bounding boxes, we lower the detection threshold.

Figure 1. Convolutional Neural Network Structure

Network Structure

In this section, we first introduce the structure of the deep network model used, and then explain the model's loss function.

The network structure presented in this paper is shown in Figure 1. The network contains five convolutional layers. Each convolutional layer has 96, 256, 384, 384, and 256 kernel functions, respectively. Rectified Linear Unreal Engine (ReLU) is used as the activation function. Each convolutional layer is followed by a spatial max-pooling layer. The network can accept images of arbitrary size as input. After convolution and pooling, the convolutional features of the image are obtained. Before the convolutional features are passed to the fully connected layers, the region-of-interest (ROI) pooling layer maps the convolutional features into fixed-length feature vectors. The weights of the fully connected layers used for classification and bounding box regression are initialized using Gaussian distributions with standard deviations of 0.01 and 0.001, respectively. The bias is initialized to 0. The learning rate for each layer's weights is 0.001 , and the learning rate for the bias is 0.002 .

Two parallel output layers are connected after the fully connected layer. The first output layer outputs the probability values for the pedestrian and background classes, denoted by p. Here, p and p represent the probabilities of the object being the background and pedestrian, respectively. Typically, p is calculated by adding softmax to the two outputs of the fully connected layer. The second output layer is the bounding box regression compensation for the pedestrian class, denoted by v. Each trained candidate region box has a class label u and a bounding box target v. We use a multi-task loss function L to train both classification and bounding box regression simultaneously:

Here, is the logarithmic loss function for category u. The loss function for the second task is defined on the bounding box of category u. At that time, Iverson brackets indicate a function value of 1, and others are 0. By convention, the general background class is labeled u=0. Since the candidate region boxes for the background class have no specific labels, they are ignored in the loss function. For boundary regression of human pedestrians, the following loss function is used:

The parameters control the balance between the losses of the two tasks. The labeled regression objective v is normalized to zero mean and unit variance. This setting is used in all experiments. The paper employs stochastic gradient descent to minimize the loss function.

Conclusion

This paper proposes a model combining a single-class candidate box extraction method with a convolutional neural network. The proposed candidate box extraction algorithm extracts hand-designed features from images and trains an AdaBoost classifier. Unlike general candidate box extraction methods, the proposed method can generate candidate region boxes only for the pedestrian category. The paper also elaborates on the specific details of the candidate box extraction algorithm and the network structure. Experimental results show that the proposed method improves the quality of candidate box extraction, achieves excellent results in pedestrian detection, and shortens the network training and testing time.

Candidate region box extraction method for pedestrian detection

Read next

CATDOLL 123CM Olivia TPE

CATDOLL Milana Hard Silicone Head

CATDOLL Coco 95CM TPE

CATDOLL 139CM Sasha (TPE Body with Soft Silicone Head)