In existing super-resolution applications, deep learning has become a mainstream method for achieving super-resolution reconstruction. However, images generated by deep learning-based super-resolution often have poor subjective perception, mainly because deep learning learns more of the low-frequency components of the image, while the features of the image are mainly concentrated in the high-frequency components. How to utilize the features of the high-frequency components of the image is an urgent problem to be solved.
1. Introduction
Images, as a medium for conveying information, are widely used in daily life, especially super-resolution images, which possess a much stronger ability to express information than low-resolution images. With the development of image technology, the demand for super-resolution images is increasing, and their application in machine vision is widespread. Since the 1970s, the generation of super-resolution images from low-resolution images has been under research. With the rapid development of deep learning, hardware, and convolutional neural network technology, the reconstruction of super-resolution images from low-resolution images has progressed rapidly over the past decade.
Super-resolution applications primarily encompass scenarios in the military, meteorological remote sensing, and medical image processing—scenarios where obtaining super-resolution images is crucial yet challenging. In the military, they are mainly used for high-altitude observation, nighttime observation, and battlefield surveillance. In meteorological remote sensing, super-resolution images are often difficult to obtain due to limitations imposed by weather conditions and imaging systems. In medical imaging, a large number of high-resolution images are needed to understand a patient's condition; various medical imaging techniques and endoscopic images all require super-resolution reconstruction.
Super-resolution reconstruction methods are mainly divided into traditional methods and deep learning-based methods. Traditional methods include interpolation, nonlocal mean algorithms, convex set projection, and machine learning-based reconstruction methods. With Chao et al. applying deep learning to super-resolution reconstruction, they proposed the SuperResolution Convolutional Nerural Network (SRCNN), which significantly outperformed almost all traditional methods, thus establishing deep learning's dominance in super-resolution reconstruction. As shown in Figure 1, the reconstructed images using bicubic interpolation and SRCNN yield completely different results. Subsequently, Ledig et al. proposed introducing GAN models into deep learning to improve the subjective perception of generated images, achieving good results. However, their loss function metric was still based on MSE, which cannot effectively measure subjective perception. This paper improves the loss function metric based on deep learning and GAN models to achieve better subjective perception in super-resolution reconstructed images.
Figure 1. Super-resolution reconstructed image
In Single Image Super-resolution Reconstruction (SISR) research, deep learning-based methods have gradually become mainstream and achieved excellent results. Among these algorithms, the loss function is the most critical and important, with most utilizing metrics such as PSNR or SSIM. Although these methods can yield good results, because they are pixel-level, the final image, while having good PSNR and other metrics, suffers from poor subjective perception of the high-resolution reconstructed image.
To address this issue, reference [3] proposed the SRGAN algorithm, which utilizes the image generation capabilities of the GAN model and improves the subjective perception quality of SR images by introducing adversarial loss. In the adversarial generation stage, the corresponding content and adversarial loss are calculated using the features extracted by the pre-trained VGG19 network to replace the PSNR metric, and a very good subjective perception quality is achieved.
However, SRGAN uses a pre-trained VGG19 model when calculating the corresponding loss function. Therefore, on the one hand, it does not make good use of the generative ability of the generative network because the decision network has no basis; on the other hand, although the VGG19 network has good discriminative power in object classification and recognition, its discriminative power is not very good in super-resolution applications.
The loss function of super-resolution generator networks is generally measured by the minimum mean square error (MSE). The latest research (reference [1]) shows that the MSE measure cannot effectively measure people's subjective perception. Although the MSE measure can be used as a loss measure to obtain high PSNR and other quality evaluation indicators, it is not the best for people's subjective perception.
To address the above issues, this paper proposes an improved super-resolution generative model based on an adversarial network with an improved loss function metric, as shown in Figure 2. The model is built upon the SRGAN model, but the loss function metric is improved by incorporating an adversarial network. The goal of this network is to ensure that super-resolution images are generated while maintaining good subjective perception through adversarial interaction between the two network loss metrics. This novel adversarial model ensures that the generated images utilize the high-frequency components of the image as much as possible, thereby guaranteeing the best generation of all high and low-frequency parts of the image, improving image accuracy and subjective perception.
2. Super-resolution generation algorithm based on adversarial loss
2.1 Algorithm Principle
The algorithm principle is shown in Figure 2:
Figure 2. Structure diagram of the super-resolution generation algorithm based on adversarial loss.
As shown in the figure above, a weight calculation network is added to the original super-resolution generation network. A simple super-resolution network is easy to train and generate the low-frequency parts of the original image, but difficult to generate the high-frequency parts. Adding a weight calculation network increases the weight of the high-frequency parts, thus balancing the generation of high- and low-frequency components in the image.
The choice of loss function has a significant impact on the results. Traditionally, super-resolution generative networks use the MSE (Mean Sequence Size) metric for loss function measurement; as follows:
The essence of MSE metric is to accumulate the weights corresponding to each pixel in the image. However, in practical applications, since most of the image is a smooth region, the training process tends to favor the smooth regions of the image.
In an image, most areas are smooth regions, with only a small number being high-frequency edge regions. However, edge regions have the greatest impact on people's subjective perception. As a result, the MSE algorithm learns mostly the reconstruction of smooth regions of the image, while the reconstruction of high-frequency regions is not so ideal.
The proposed two-layer network in this paper changes the metric of the loss function in super-resolution generative networks. The loss metric function for super-resolution generative networks is as follows:
In the formula, TH represents the pixels of the original image object, and SH represents the pixels of the image reconstructed by the super-resolution generation network. It can be seen that the weights of both networks are related to the weights w generated by the weight calculation network. During forward propagation, both networks are calculated, and the dimension of the weights w generated by the weight calculation network is the same as the dimension of the high-resolution image S generated by the super-resolution generation network, ensuring the same dimension when calculating the loss function. During backpropagation, only one network weight is updated at a time. After one forward propagation, the parameters of the two networks are updated alternately; that is, when backpropagating lossa, wa is not updated, and when backpropagating lossb, SH is not updated. Based on the adversarial relationship between the two networks' losses, the weights of the parts of the original image that are difficult to generate are increased. The main components here are wa and wb in the loss metric functions of the two networks. Updating the weights of the super-resolution network, the larger weights in the super-resolution generation network will decrease due to the presence of the weight calculation network, while the smaller weights will increase. After passing through the weight calculation network, backpropagation occurs again. This process continuously uses the adversarial relationship between the two networks to update the parameters of the super-resolution generation network, increasing the weights of the parts of the image that are difficult to generate.
Once the error loss metric drops to the set value or the number of iterations is reached, the network training is complete, and the super-resolution generator network parameters are updated. During testing, only images generated by the super-resolution generator network are used, and the SSIM and PSNR values of the generated images and the original ground truth images are calculated.
As can be seen from the expressions of the parameters wa and wb of the two networks above, even when w takes the maximum value of 1 (normalization processing), the network performance is the same as that of the super-resolution generative network without loss adversarial processing. In other words, the algorithm in this paper can achieve the effect of the super-resolution generative network model even at the worst.
3. Algorithm implementation steps:
3.1 Weight Calculation Network
Figure 3. Network structure diagram for weight calculation
As shown in Figure 3, the weight calculation network mainly consists of three convolutional layers and activation functions. Each convolutional layer is followed by an activation layer. Convolutional layer 1 has 64 kernels and a size of 5x5; convolutional layer 2 has 128 kernels and a size of 3x3; and convolutional layer 3 has 3 kernels and a size of 3x3. The image size remains unchanged after each convolutional layer. The first two layers use the LeakyReLU activation function, while the last convolutional layer uses the Tanh function. The network employs Adam optimization, and the learning rate decreases with increasing iterations to prevent small gradients from becoming zero. Finally, the network outputs weights w. The weight calculation network primarily uses the generated w to regulate and balance the weights in the super-resolution network.
When training the weight generation network, real high-resolution images are used directly. This not only makes the best use of image features but also helps the network converge faster. The dimension of the generated w is the same as that of the real high-resolution image.
3.2 Super-resolution generative network
Figure 4. Super-resolution generator network structure diagram
As shown in Figure 4, the super-resolution generator network also consists of three convolutional layers and corresponding activation functions. Convolutional layer 1 has 192 convolutional kernels with a kernel size of 5x5, convolutional layer 2 has 96 convolutional kernels with a kernel size of 3x3, and convolutional layer 3 has Dim convolutional kernels with a kernel size of 3x3.
Where `upscale_factor` is the magnification factor for generating the super-resolution image, and factor 3 indicates that the number of channels in the training image is 3. Similarly, the activation layer function is the same as the weight calculation network. After several convolutional layers and activation layers, it finally undergoes an upsampling process to generate an image with the same matrix dimensions as the original real image.
The loss function of the super-resolution generator network is related to the weights w generated by the weight calculation network. Each time the weights of the super-resolution generator network are updated, the weights of the weight calculation network are not updated temporarily.
In the network, the loss function of the super-resolution generator network and the loss function of the weighted generator network have similar training speeds. Therefore, the learning rates of the two networks should change at the same rate to avoid making the algorithm difficult to converge.
4. Experimental Results
This paper trained the model on 1000 low-resolution and corresponding super-resolution images, resulting in a well-trained model that performed well in testing. The training results are shown in Figure 4, with comprehensive testing and comparisons conducted using Comic, Baboon, Lenna, and Zebra images as examples.
The weights generated by the weight generation network are shown in Figure 5, representing the high-frequency information of the images generated by the algorithm. As can be seen from the figure, the algorithm fully extracts the high-frequency information. Lenna has relatively fewer high-frequency regions, while the other three images have more high-frequency information. When generating super-resolution images, Lenna is easier to generate, and the PSNR value of the generated image is higher than that of the original image. Comic, Baboon, and Zebra have relatively lower PSNR values than the original image in order to balance subjective perception.
Figure 5. Weight calculation network generates weight graph.
As can be seen from the detailed images in Figures 6 and 7, both the SRGAN algorithm and the algorithm in this paper significantly improve the clarity of the generated images after adding the adversarial network. For the high-frequency part, such as the hair part in Figure 7, the algorithm in this paper appears more delicate. Compared with SRCNN, it has enhanced the generation of the high-frequency part of the image and is more outstanding in the generation of the high-frequency part of the image, with better subjective perception performance.
The model test results are compared with those of the SRCNN model, as shown in Table 1:
Table 1. Quality Analysis Comparison of Algorithms with SRCNN and SRGAN Algorithms
The main evaluation metrics tested were PSNR and SSIM. As can be seen from the table and figure above, the super-resolution generation algorithm based on adversarial loss proposed in this paper has improved most of the performance of the deep learning model SRCNN in super-resolution reconstruction, and also has good subjective perception performance.
5. Conclusion
The most significant improvement in this algorithm is the addition of an adversarial network (ANN). The key to an AAN lies in the choice of loss function. This algorithm employs a shallower network, which is more effective for machine vision, especially in embedded devices. The main improvement lies in the ability to generate high-frequency components of images—an important metric—through adversarial comparisons using two different loss functions, thus enhancing subjective perception. While deeper networks for weight calculation and super-resolution generation do not significantly improve performance compared to other deep learning-based networks, future work could focus on further refining the choice of loss function metrics, such as optimizing the loss function weights.