Abstract: This paper proposes a deep learning-based image semantic segmentation method. By using a network model trained with relative depth point pairs, depth image prediction based on color images is achieved, and this prediction, along with the original color image, is input into a fully convolutional neural network containing dilated convolutions. Considering that color and depth images represent different object attributes, a merge connection operation is used on the feature maps instead of the traditional addition operation to fuse them, preserving the difference between the two representations when providing feature map input to subsequent convolutional layers. Experimental results on two datasets demonstrate that this method effectively improves the performance of semantic segmentation.
Keywords: Semantic segmentation; Deep learning; Deep image Classification number: TG 156 Document code: A
1. Introduction
Semantic segmentation of images is a fundamental problem in computer vision. As a crucial component of image understanding, it plays a key role in practical applications such as autonomous driving systems, geographic information systems (GIS), medical image analysis, and robotic object grasping. In GIS, satellite remote sensing images can be used to automatically identify roads, rivers, buildings, plants, and other features through semantic segmentation. In autonomous driving systems, images acquired by onboard cameras and LiDAR can be semantically segmented to detect pedestrians and vehicles ahead, assisting driving and obstacle avoidance. In medical image analysis, semantic segmentation is primarily used for tumor image segmentation and dental caries diagnosis.
Image semantic segmentation is a task that assigns a semantic category to each pixel of an input image to achieve pixel-level classification. Traditional semantic segmentation mainly uses hand-designed features and methods such as support vector machines and probabilistic graphical models. With deep convolutional neural networks breaking records in computer vision tasks, including image classification [1-3] and object detection [4-6], deep learning methods are also widely used in semantic segmentation tasks [7-9].
Convolutional neural networks have a certain degree of invariance to local image transformations, which can solve image classification problems well. However, in semantic segmentation tasks, classification is required while obtaining accurate location, which contradicts the invariance of local image transformations. In typical image classification models, multi-layer networks form a pyramid structure from local to global. Among them, the top layer feature map has the lowest resolution. Although it contains global semantic information, it cannot achieve accurate localization. Fully convolutional neural networks [7] are trained using an end-to-end, pixel-to-pixel method. For the problem of insufficient localization of the top layer feature map, a skip structure is adopted to integrate the shallow fine appearance information and the deep coarse semantic information.
Chen et al. [8] used another approach, directly reducing the downsampling operation in the network structure to obtain higher resolution, and utilized dilated convolution to increase the receptive field of the convolution kernel without increasing the number of network parameters, thereby obtaining more contextual information about image pixels. In the field of signal processing, similar methods were initially used for efficient computation of non-sampled wavelet transform [10]. In addition, the fully connected conditional random field method [11] was used to post-process the output of the convolutional neural network to achieve more refined segmentation results.
Zhao et al. [12] proposed a pyramid pooling module based on the network model with dilated convolution. This study uses the result of global average pooling (GAP) as a global contextual information representation, which is connected with the previous feature map so that the combined feature map contains both global contextual information and local information. It is one of the best methods for segmentation results on the Pascal VOC 2012 dataset [13].
Estimating physical properties (such as depth, surface normals, and reflectivity) in images is a mid-level visual task and can be helpful for high-level visual tasks. Currently, many data-driven depth estimation methods have been proposed [14-17], but these methods are limited by the image datasets acquired by depth sensors. Although consumer-grade depth image acquisition devices such as Microsoft Kinect, ASUS Xtion Pro, and Intel RealSense have been widely used in recent years, they are still mainly limited to indoor scenes. For specular reflections, transparent or dark objects, the results often fail. Therefore, it is difficult to obtain reliable depth images with depth sensors in unrestricted scenes. For semantic segmentation tasks, clear and distinct edges are more important than accurate depth measurements themselves. There is empirical evidence that humans are better at estimating the order relationship between two points than the measurement of a point in a scene [18]. For the depth of two points in an image, the three relationships of "equal", "deeper", and "shallower" are invariant to monotonic transformations and are labeled by humans, so there is no problem of scene limitation. Chen et al. [19] constructed a human-annotated "relative depth" point pair dataset and proposed a method to train an end-to-end convolutional neural network using this annotation to predict depth images from color images, which significantly improved single-image depth perception in unrestricted scenes. This paper proposes to integrate the depth images predicted from color images into a convolutional neural network for semantic segmentation, and to improve segmentation performance by utilizing the characteristics of depth images.
The main innovations of this paper are: (1) using depth images predicted from color images as input to the semantic segmentation network; and (2) improving semantic segmentation performance by using a multi-branch input and feature map merging connection to fuse depth image features. Experimental results show that fusing depth image features can significantly improve semantic segmentation performance.
2. Semantic segmentation based on fused depth images
2.1 Convolutional Neural Networks for Semantic Segmentation
A typical convolutional neural network (CNN) for classification tasks mainly consists of convolutional layers, activation functions, pooling layers, and fully connected layers. An input image passes through the network, and the fully connected layers output a one-dimensional vector, which is then normalized using the Softmax function to serve as the object classification score. For semantic segmentation, the CNN utilizes weight parameters pre-trained from the classification network, employing a fully convolutional network structure to directly train end-to-end on the input three-channel color image and pixel-level labeled masks. Because the fully connected layers are eliminated, it can adapt to input images of any size and output segmentation results of the same size.
The position of a pixel in the feature map output by a layer of a convolutional neural network corresponds to its position in the original image; this is called the "receptive field." Due to the downsampling operations of pooling layers or convolutional layers in the network structure, the resolution of the feature map output by the last convolutional layer is often very low. If we reduce the downsampling operations to increase the resolution of the feature map of the last convolutional layer, it will make the receptive field of the convolutional kernel smaller and bring greater computational cost. However, the dilated convolution operation can increase the receptive field of the convolutional kernel without changing the number of network weight parameters. Figure 1(a) shows a normal convolution operation with a kernel size of 3. Figure 1(b) shows a dilated convolution operation with a ratio parameter r of 2, which processes and outputs a higher resolution feature map with the same number of parameters as Figure 1(a).
Given a one-dimensional signal input x[i] and a convolution kernel w[k] of length k with a ratio parameter r, the output y[i] of the dilated convolution is defined as follows:
The ratio parameter r represents the sampling step size of the input signal. Ordinary convolution can be regarded as a special case of ratio parameter r = 1. The semantic segmentation network used in this paper performs global average pooling operation on the basis of using dilated convolution. Its significance lies firstly in merging all the information of the feature map into a single point of multiple channels to form a global contextual prior information; then, it is scaled back to the original feature map size and connected with the original feature map to form a feature map with double the number of channels, and the segmentation result is output after several convolutional layers. Since the feature map integrates such global contextual information, the segmentation result can be significantly improved [12]. Figure 2 is the network structure of the semantic segmentation model used in this paper. Among them, the "color image network" uses VGG-16 [2] as the basic model, replaces conv5 with 3 dilated convolutional layers with a ratio parameter of 2, conv6 is a dilated convolutional layer with a ratio parameter of 12, and finally outputs a feature map with 256 channels. The "Deep Image Network" branch contains only three ordinary convolutional layers with a kernel size of 3, and the number of channels are 64, 128, and 256, respectively. The two branches perform global average pooling, scaling to the original size, and merging connections, respectively, to obtain a feature map with 512 channels. The roles of other parts of the network are described in the following sections.
2.2 Predicting Depth Images from Color Images
There are currently two main methods for learning and predicting dense depth images using sparse "relative depth" annotations, proposed by Zoran et al. [20] and Chen et al. [19], respectively. Zoran et al. [20] first train a classifier that predicts the depth order between the superpixel centers of the image, then recover the overall depth using the energy minimization method to make these order relationships consistent, and finally interpolate in the superpixels to obtain pixel-level depth images. Chen et al. [19] directly used a fully convolutional neural network to implement end-to-end training from color images to depth images and proposed a method to train the network using relative depth annotations. For relative depth annotations, a suitable loss function needs to be designed, based on the principle that when the true depth order is "equal", the difference in predicted depth values should be as small as possible; otherwise, the difference should be as large as possible. Assume that the image in the training set is I, and the K queries on it are R={(i k ,j k ,r k )},k-1,...,k. Where i k ,j k are the positions of two points in the k-th query. r <sub>k</sub> {+1,-1,0} is a label representing the depth order of two points. If the predicted image depth is z, then the depth values corresponding to ik and jk are a <sub>ik</sub> and z <sub>jk</sub> , respectively. The loss function is defined as follows:
Where φ <sub>k</sub> , (I, i <sub>k</sub> , j <sub>k</sub> , r, z) is the loss for the k-th query.
For human-annotated relative depth point pairs, this loss function can be used directly. For depth images acquired by depth sensors, random sampling of several point pairs can convert them into the same form. This paper uses the “relative depth” network model of Chen et al. [19] to predict depth images from color images. The model uses an hourglass-shaped network structure [21]. It is first pre-trained on a depth image dataset acquired by a depth sensor, and then fine-tuned on a relative depth point pair dataset. The predicted depth image is shown in Figure 3(b).
The selection of relative depth annotation points greatly affects the results of network training. Randomly selecting two points in a two-dimensional plane can cause a serious bias problem [19]: if an algorithm simply assumes that the bottom point is closer in depth than the top point, there is an 85.8% probability that it will be the same as the result of human annotation. A better sampling method is to randomly select two points on the same horizontal line, but this will also result in an algorithm that simply assumes that the center point is closer in depth having a 71.4% probability of being the same as the result of human annotation. Therefore, a suitable sampling strategy is to randomly select two points on a horizontal line that are symmetrical to the center of the horizontal line, so that the probability that the left point is closer in depth than the right point is 50.03%.
2.3 Fusion of Color and Depth Image Features
After obtaining the estimated depth image, how to fuse the features of the depth image and the color image is also an important problem. A simple method is to stack the three channels of the color image with the one channel of the depth image to form a four-channel input. However, the geometric meaning of the depth image to the object is not the same as the optical meaning represented by the color image. The experiment of Long et al. [7] also showed that this method does not significantly improve the performance. Gupta et al. [22] proposed a representation called HHA derived from the depth information, which consists of horizontal parallax, height above the ground and the angle between the local surface normal and the direction of gravity, and achieved better results. However, this representation is too complex and does not contain more information than the depth image itself [23]. The fusion method proposed in this paper is: first, use two
Each network branch processes the color image and the depth image to obtain feature maps with a and b channels. Then, a merge connection operation similar to the pyramid pooling module in PSPNet[12] is used to merge the feature maps of the two branches into a+b channel feature maps. Finally, the segmentation result is output after several convolutional layers. Compared with the addition operation commonly used for feature map fusion, the merge connection operation can make the features output by the two branch networks more independent, rather than just providing feature maps with the same representation form for subsequent convolutional layers. As shown in Figure 2, the two feature maps with 512 channels output by the color image and depth image branches are merged and connected to obtain a feature map with 1024 channels.
Preliminary experiments revealed that using a lower-resolution depth image of the same size as the final convolutional layer output and a small number of convolutional layers yielded better results than using a higher-resolution depth image and more convolutional and pooling layers. This is because the depth image itself has a lower resolution in the network's prediction output, and the higher-resolution depth image is simply obtained through scaling. Furthermore, omitting pooling layers is more beneficial for the positional correspondence between the network's input and output pixels.
3. Experiment
3.1 Dataset
This paper conducts experiments on the Pascal VOC 2012 dataset and the SUN RGB-D dataset [24]. The Pascal VOC 2012 dataset contains images of 20 categories of objects and one background category. The semantic segmentation dataset is divided into three parts: training set (1,464 images), validation set (1,449 images), and test set (1,456 images). The validation set and test set do not contain images from the training set. We follow the convention of using an additional labeled dataset containing 10,582 training images [25] to perform validation on 1,449 images. The SUN RGB-D dataset is a dataset suitable for scene understanding. It contains color and depth images acquired by four different sensors, including NYU Depth v2[26], Berkeley B3DO[27] and SUN3D[28], etc. It contains 10,335 RGB-D images and their pixel-level semantic segmentation annotations, including 5,285 training images and 5,050 test images.
3.2 Dataset Processing
This paper adopts common data augmentation methods suitable for natural images for two datasets: random scaling, mirroring, and cropping padding. Among them, (1) random scaling: the image is randomly scaled to 0.5 to 1.5 times its original size; (2) mirroring: the image is horizontally flipped with a 50% probability; (3) cropping padding: the image is cropped or filled with a fixed size of 500×500 (if the size is insufficient, it is filled with gray).
The network input includes color images and depth images. Since the Pascal VOC 2012 dataset does not contain depth images acquired by a depth sensor, this paper uses depth images predicted from color images as input. For the SUN RGB-D dataset, this paper conducted experiments using both depth images acquired by a depth sensor and depth images predicted from color images as input.
3.3 Experimental Procedure and Parameters
This paper uses the network structure shown in Figure 2. First, a depth prediction network is used to predict the depth image from the color image. Then, the color image and the depth image are respectively input into two convolutional neural network branches. The color image branch is a network with dilated convolutions based on the VGG-16 model. The weights are initialized by the weights of VGG-16[2] pre-trained on ImageNet[29]. The other convolutional layers are randomly initialized by Xavier[30]. After the two network branches are merged and connected, the segmentation result is output through two convolutional layers.
The batch size parameter for network training was 10. The input color image size was 500×500, and the depth image and grayscale image used for contrast were 63×63. The initial learning rate was 0.0001 (0.001 for the last layer), decaying according to a polynomial function, and training stopped after 20,000 iterations. The momentum parameter was 0.9, and the weight decay parameter was 0.0005. All experiments were conducted on an NVIDIA GeForce TITAN X GPU. Segmentation performance was evaluated using the average pixel intersection-over-union (IoU) score for each class. This paper designed 5 experiments on two datasets, dividing the input images into:
(1) VOC dataset, color images and predicted depth images;
(2) VOC dataset, including color and grayscale images;
(3) SUN dataset, color images and predicted depth images;
(4) SUN dataset, color images and depth images acquired by depth sensors;
(5) SUN dataset, color images and grayscale images.
In this system, grayscale images are converted from color images and are used to replace depth images as input to the network as a control.
4. Experimental Results
4.1 Comparison of experiments using the Pascal VOC dataset
To compare the effects of having and not having depth image information, we compared the segmentation performance of different categories in experiments (1) and (2), and the results are shown in Table 1. Table 1 shows that for most categories, the fusion of predicted depth image features effectively improves segmentation performance. Only the category of potted plants (plants) with obvious color features and small size in the image showed a 0.1% decrease. This is because the output resolution of the depth prediction model is low, resulting in poor depth prediction for small objects in the image. Among these, objects with obvious structural features and large size in the image showed significant improvement, such as airplanes, boats, and sofas, which aligns with the physical meaning of the depth image and confirms the effectiveness of the method. The segmentation results on the Pascal VOC dataset are shown in Figure 3. Figure 3 shows that even for outdoor scenes, the depth image can still capture clear object outlines. With the input of a depth image, the segmentation at object boundaries also achieves better results due to the clearer edges of the depth image.
4.2 Experimental Comparison with SUN RGB-D Dataset
Table 2 compares the segmentation results in three cases: predicted depth images on the SUN RGB-D dataset, depth images acquired using sensors, and no depth information, i.e., experiments (3), (4), and (5). Figure 3 shows that the segmentation results using depth images are better, and the results using predicted depth images are slightly better than those using sensor-acquired depth images. This indicates that for semantic segmentation tasks, predicted depth images can serve as a substitute for sensor-acquired depth images.
The segmentation results on the SUN RGB-D dataset are shown in Figure 4. As can be seen from Figure 4, the first row of depth images clearly distinguishes the chair legs, indicating that the experiment using depth images achieves good segmentation results for the chair legs. The second and third rows of sensor depth images contain some regions with missing pixel values and noise, while the predicted depth images, although the depth measurements are not as precise, maintain a relatively complete object shape. This is one reason why the predicted depth images achieve slightly better segmentation results.
5. Discussion
The semantics and depth of objects in an image are closely related. Acquiring and utilizing depth images can greatly assist semantic segmentation tasks. However, acquiring depth images in unrestricted environments is a challenge. Depth image datasets acquired by depth sensors are limited to indoor environments and fixed scenes (such as highways), and there are still many shortcomings in the current methods for utilizing depth information in semantic segmentation tasks [22,23]. This paper uses convolutional neural networks to predict depth images from color images. Based on a semantic segmentation network with dilated convolution, a multi-branch network is designed, and the features of color images and depth images are fused by feature map merging and connection to perform semantic segmentation. Dilated convolution increases the receptive field of the convolution kernel without increasing the number of network parameters, allowing it to contain more image context information, thereby improving segmentation performance [8].
Under the same conditions, the network with depth image information and merging connection operation proposed in this paper improves the mean intersection-union ratio (mIoU) by 1.1% on the Pascal VOC dataset compared with the network without depth image information (using grayscale images as a substitute). The segmentation results on the SUN RGB-D dataset show that the network trained with predicted depth images performs similarly to the network trained with depth images acquired by the sensor, and both are better than the network without depth images. This shows that predicted depth images can replace depth images acquired by the sensor to improve the results of semantic segmentation. However, the current scheme uses a small number of relative depth points labeled on the dataset, and there is still a lot of room for improvement in the network model [19]. The use of depth images in convolutional neural networks is still a very worthwhile research problem.
6. Conclusion
This paper proposes a multi-branch network and feature map concatenation method to fuse depth image features, using depth images predicted from color images to address the difficulty of acquiring depth images in unconstrained scenes. The method utilizes the merge connection operation used in the pyramid pooling module to connect the feature maps of the color and depth images, making the two types of features complementary yet maintaining independent representations. Segmentation results on two datasets demonstrate that this method can refine object edges using depth images, improving semantic segmentation performance. Currently, there is still no good method to fully utilize depth images in convolutional neural networks. Future work will attempt to improve the loss function and network structure of semantic segmentation models.
References
[1]Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks [C] // Proceedings of the 25th International Conference on Neural Information Processing Systems, 2012: 1097-1105.
[2]Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition [J]. Computer Science, 2014, arXiv:1409.1556.
[3]He KM, Zhang XY, Ren SQ, et al. Deep residual learning for image recognition [C] // IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[4]Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation [J]. Computer Science, 2013: 580-587.
[5]Girshick R. Fast R-CNN [J]. Computer Science, 2015, arXiv:1504.08083.
[6]Ren S, He K, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39(6): 1137-1149.
[7]Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation [C] // IEEE Conference on Computer Vision and Pattern Recognition, 2015: 3431-3440.
[8]Chen LC, Papandreou G, Kokkinos I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 40(4): 834-848.
[9]Zheng S, Jayasumana S, Romera-Paredes B, et al. Conditional random fields as recurrent neural networks [J]. Computer Science, 2015, doi: 10.1109/ICCV.2015.179.
[10]Holschneider M, Kronland-Martinet R, Morlet J, et al. A real-time algorithm for signal analysis with the help of the wavelet transform [M] // Wavelets. Springer Berlin Heidelberg, 1990: 286-297.
[11]Krähenbühl P, Koltun V. Efficient inference in fully connected CRFs with gaussian edge potentials [J]. Computer Science, 2012: 109-117.
[12]Zhao HS, Shi JP, Qi XJ, et al. Pyramid scene parsing network [C] // IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6230-6239.
[13]Everingham M, Gool LV, Williams CKI, et al. The pascal visual object classes (VOC) challenge [J]. International Journal of Computer Vision, 2010, 88(2): 303-338.
[14]Karsch K, Liu C, Kang SB. Depth transfer: depth extraction from videos using nonparametric sampling [M] // Dense Image Correspondences for Computer Vision. Springer International Publishing, 2016: 775-788.
[15]Saxena A, Sun M, Ng AY. Make3D: learning 3D scene structure from a single still image [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(5): 824-840.
[16]Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture [C] // IEEE International Conference on Computer Vision, 2015: 2650-2658.
[17]Li B, Shen CH, Dai YC, et al. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs
[C]// IEEE Conference on Computer Vision and Pattern Recognition, 2015: 1119-1127.
[18]Todd JT, Norman JF. The visual perception of 3-D shape from multiple cues: are observers capable of perceiving metric structure? [J]. Perception & Psychophysics, 2003, 65(1): 31-47.
[19]Chen WF, Fu Z, Yang DW, et al. Single-image depth perception in the wild [C] // Advances in Neural Information Processing Systems, 2016: 730-738.
[20]Zoran D, Isola P, Krishnan D, et al. Learning ordinal relationships for mid-level vision [C] // IEEE International Conference on Computer Vision (ICCV), 2015: 388-396.
[21]Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation [M] // Stacked Hourglass Network for Human Pose Estimation. Springer International Publishing, 2016: 483-499.
[22]Gupta S, Girshick R, Arbeláez P, et al. Learning rich features from RGB-D images for object detection and segmentation [C] // European Conference on Computer Vision, 2014: 345-360.
[23]Hazirbas C, Ma L, Domokos C, et al. Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture [C] // Asian Conference on Computer Vision, 2016: 213-228.
[24]Song SR, Lichtenberg SP, Xiao JX. SUN RGB-D: a RGB-D scene understanding benchmark suite [C] // IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[25]Hariharan B, Arbeláez P, Bourdev L, et al. Semantic contours from inverse detectors [C] // IEEE International Conference on Computer Vision (ICCV), 2011: 991-998.
[26]Silberman N, Hoiem D, Kohli P, et al. Indoor segmentation and support inference from rgbd images [C] // European Conference on Computer Vision, 2012: 746-760.
[27]Janoch A, Karayev S, Jia Y, et al. A category-level 3D object dataset: putting the kinect to work
[C]// IEEE International Conference on Computer Vision, 2011: 1168-1174.
[28]Xiao JX, Owens A, Torralba A. SUN3D: a database of big spaces reconstructed using SfM and object labels [C] // IEEE International Conference on Computer Vision, 2013: 1625-1632.
[29]Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge [J]. International Journal of Computer Vision, 2015, 115(3): 211-252.
[30]Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks [J] Journal of Machine Learning Research, 2010, 9: 249-256.