In recent years, due to the development of high-performance computers and the expansion of datasets, deep learning models have been widely used in the field of medical image classification and detection. Training automated diagnostic models from large-scale medical image databases using deep learning is attracting widespread research interest.
method
1. Convolutional Neural Networks
Inspired by biological neural systems, Convolutional Neural Networks (CNNs) have achieved great success in object recognition and detection. Unlike traditional neural networks, CNNs combine local connectivity and weight-sharing strategies, thus significantly reducing the number of parameters and making it possible to build deeper convolutional networks. The main component of a CNN is the convolutional layer, which contains many neurons, each with a set of learnable weights and a bias term. These weights change continuously during network training. Each neuron perceives a local region of the previous layer, using that local region as its input. Assuming χlj is the output of the j-th neuron in the l-th convolutional layer and is also the output of a neuron in the (l-1)-th layer, and M represents the size of the local input of the current neuron, then χlj can be expressed as:
Here, δ represents the weights connected to the output of the m-th neuron in the previous layer, and δ(.) represents the neuron activation function (typically the ReLU nonlinear unit). Pooling layers and fully connected layers are another major component of CNNs.
In this paper, we treat the classification softmax layer as an adjunct to the fully connected layer. Generally, pooling layers are added between convolutional layers. Pooling layers themselves have no parameters; their function is to reduce the output size of the convolutional layers, thereby significantly reducing the number of parameters in the entire network and enhancing the spatial stability of the output features. Therefore, pooling layers can, to some extent, prevent overfitting. The fully-connected layer (FClayer) is similar to a convolutional layer, also composed of many neurons, but here the neurons are fully connected to the input of the previous layer, meaning each neuron interacts with all inputs from the previous layer. The softmax layer is the last layer of the CNN network, and its function is to classify the features extracted by the network. To evaluate the consistency between the network's predicted output and the ground truth label of the input image, a loss function is used. Specifically, assuming the input image is and its corresponding ground truth label is , the loss function can be expressed as:
Here, represents the network's predicted class probability output for the input image Ii. Additionally, is an indicator function; its output value is 1 when Ck = Ti, and 0 otherwise. fj is the network's output of the j-th neuron in the layer preceding the softmax layer for image Ii. The purpose of CNN training is to obtain appropriate weight parameters so that the entire network can automatically learn suitable feature representations for the target data, thereby achieving better prediction results for unknown samples.
2. CNN Structure Setup
For deep networks like CNN-16, training with directly randomized initial parameters results in extremely slow convergence and gradient vanishing during backpropagation parameter updates. Therefore, we directly use transfer learning to initialize the network, and the corresponding result is denoted as CNN-16-TR. Table 1 illustrates the specific structural details of the CNN in the experiment.
3. Data Augmentation
As a deep learning model, CNN networks have extremely high requirements for training data. To a certain extent, the size of the data directly determines the network's scale and trainability. However, in clinical practice, collecting a large number of representative medical images is already quite difficult, and this data also requires manual annotation. Therefore, constructing such a high-quality, large-scale medical image dataset is extremely challenging. Increasing the dataset size by performing various transformations on the image data while keeping the image labels unchanged is a feasible and effective data augmentation method. Through this data augmentation, we can significantly increase the dataset size, thereby solving the problem of insufficient data in medical image datasets preventing the training of CNN models.
4. Transfer learning
Even though CNNs possess extremely strong feature representation capabilities and have been successfully applied to many medical images, the amount of training data remains the biggest limitation. Therefore, overfitting is a problem that supervised deep models can never avoid. In this case, pre-training a CNN on a large-scale dataset and then copying its parameters into the target network is an effective network initialization method. This can significantly reduce the training speed and avoid overfitting caused by insufficient training data.
Currently, the most common transfer learning method is to first train a base network on another dataset, then copy the parameters of the first n layers of that base network to the corresponding layers of the target network, and then randomly initialize the parameters of the remaining layers of the target network. Depending on the training method, transfer learning can be divided into two types: one is to keep the parameters of these transferred learning layers fixed, only changing the parameters of the randomly initialized learning layers during training; the other is to fine-tune these transferred learning layer parameters during training. Based on our research, due to the significant differences between the ImageNet dataset and our FFSP dataset, the former method of fixing the transfer parameters is not suitable when there are many transfer layers. Therefore, in this study, we adopt a fine-tuning transfer learning approach.
Experiments and Results
1. Dataset and System Settings
2. Qualitative Analysis and Evaluation
In Figure 4, (c) shows the training set features extracted by CNN-8-TR, where the four types of cross-sectional features are clearly distinguished. (d) shows the training set features extracted by CNN-8-RI, where there is still a small amount of overlap among the four types of cross-sectional features. Correspondingly, (g) and (h) show the test set features of CNN-8-TR and CNN-8-RI, respectively, with results similar to those of the training set.
3. Quantitative analysis and evaluation
Currently, the most mainstream classification and recognition techniques utilize artificial features combined with classifiers. The basic idea behind these methods is to first extract features from the image, encode these features, and then train a classifier for classification and recognition. Examples include DSIFT-based feature encoding methods such as Histogram Coding (BoVW), Local Feature Aggregation Descriptor (VLAD) encoding, and FV Vector Coding. Our previous research utilized these methods for the automatic recognition of FFSP (Freeform Image Processing).
Figure 5 shows the ROC curves and confusion matrices of the classification performance of each CNN network. As can be seen from Figure 5 and Table 2, the CNN-16-TR has a higher recognition accuracy than the CNN-8-TR, indicating that increasing the depth of the CNN model can significantly improve the final classification performance. Furthermore, the CNN-8-TR has a higher classification accuracy than the CNN-8-RI, suggesting that pre-training the base network on other datasets and fine-tuning the transfer parameters is also an effective method to improve CNN recognition performance. The experimental results show that all CNN models perform well and outperform our previous hand-crafted feature classification results. Although CNNs have extremely strong classification performance, we also observed some noteworthy details in the experimental results: First, in the testing phase, each image combines the prediction results of its 10 sub-images; this 10-crop testing improves the result by about 3% compared to directly testing a single image. Additionally, when using the transfer learning strategy, the network convergence speed is significantly accelerated, more than half the time faster than networks with randomly initialized parameters.
discuss
Deep networks, as a representational learning method, combine and iterate features at different levels to form high-level abstract features. Compared to traditional hand-crafted features, these features are more robust or invariant in terms of conceptual representation. Furthermore, deep networks can learn corresponding features from given data, thus exhibiting stronger generalization ability and can be applied to various image domains. However, deep learning models generally require a sufficiently large amount of training data; otherwise, overfitting can occur during network training. In different image domains, the difficulty of data acquisition varies significantly, and the size of natural image datasets is often far greater than that of medical image datasets. Therefore, the biggest challenge for deep network applications in the medical image field lies in the limitation of dataset size.
Training a base network using natural image datasets and then applying transfer learning is an effective way to address the insufficient data volume for deep networks used in various image domains. Therefore, this study combines transfer learning and data augmentation to comprehensively improve the classification performance of deep networks. The final results analysis also shows that its FFSP classification performance is significantly better than our previous research, which used a method combining hand-crafted features with a classifier.
However, this study still has some limitations. First, the test set is insufficient, with only 2418 test images. While this can reflect the classification performance of the CNN model to some extent, a larger dataset would be more informative. This is one of the directions we will explore for future improvement. Secondly, there is still room for improvement in the test results. Many non-standard cross-sections that closely resemble FFSPs were identified as standard cross-sections. This is largely due to the noise and low difference between the images themselves. Future research could increase the stability of the network's recognition by randomly adding noise to the training set images. Furthermore, clinicians consider the contextual information of preceding and following frames when searching for FFSPs. Therefore, incorporating the current image context information during network training can eliminate the interference caused by the small intra-class differences between FFSPs and non-FFSPs.
in conclusion
In this study, we proposed using deep convolutional networks (CNNs) to identify fetal facial ultrasound images. We also analyzed the classification results of CNN models with different depths for FFSP (Face-Focused Parallel Surface) images. To prevent overfitting due to insufficient training datasets, we employed data augmentation combined with transfer learning to improve the classification results. The final results show that deep networks can effectively identify standard FFSP sections, and deeper networks deliver even better classification performance. Therefore, the combination of deep networks and transfer learning holds great promise for clinical applications and warrants further exploration and research.