Image captioning aims to automatically generate natural language descriptions for input images, which can be used to assist visually impaired individuals in perceiving their surroundings and to help people process large amounts of unstructured visual information more conveniently. Current mainstream methods mainly rely on end-to-end training and optimization based on deep encoder-decoder frameworks. However, due to the discrepancy between the correspondence between visual concepts and semantic entities, the recognition and understanding of fine-grained semantics in image captioning is insufficient. This paper addresses this problem by proposing an attention mechanism based on detection features and Monte Carlo sampling, and a sequence optimization method based on improved policy gradients, integrating these two approaches into a unified framework for image captioning.
In our method, to better extract strong semantic features of images, we first replace the general convolutional network with Faster R-CNN as the encoder; on this basis, we design a Reinforce Attention mechanism based on Monte Carlo sampling to filter out visual concepts worth paying attention to at the current moment, so as to achieve more accurate semantic guidance. In the sequence optimization stage, we improve the evaluation function of policy gradient by using discount factor and word frequency-inverse document frequency (TF-IDF) factor, so that words with stronger semantics when generating captions have a larger reward value, thereby contributing more gradient information and better guiding sequence optimization. We mainly train and evaluate on the MS COCO dataset, and the model has achieved significant improvements in scores on all current authoritative metrics. Taking the CIDEr metric as an example, compared with the current representative methods [5] and [7], our model improves the final score by 8.0% and 4.1%, respectively.
Image captioning aims to generate a matching natural language description for an input image, and its workflow is shown in Figure 1(a). Open-domain image captioning is a challenging task because it requires not only fine-grained semantic understanding of all local and global entities in the image, but also the generation of attributes and relationships between these entities. From an academic perspective, research in image captioning has greatly inspired discussions on how to better integrate computer vision (CV) and natural language processing (NLP). In terms of practical applications, advancements in image captioning are crucial for building better AI interaction systems, especially in assisting visually impaired individuals to better perceive the world and in more comprehensively and conveniently helping people organize and understand massive amounts of unstructured visual information.
Research in image captioning has progressed rapidly, with many landmark works emerging recently. Currently, visual attention models based on deep encoder-decoder frameworks have achieved good results on various standard datasets for image captioning. Visual attention models are primarily used to extract spatially salient regions for better mapping to the words to be generated. This has led to numerous improvements, and recent research has focused on integrating bottom-up object detection and attribute prediction methods with attention mechanisms, achieving significant improvements in evaluation metrics. However, all of these works use Word-Level training and optimization methods, which leads to the following two problems: The first is "Exposure Bias", which means that the model calculates the maximum likelihood of the next word based on the given ground-truth word during training, but needs to predict the next word based on the actual generation during testing; The second problem is the inconsistency of the model's objectives during training and evaluation, because the cross-entropy loss function is used during training, but when evaluating the generated captions, some non-differentiable metrics specific to the NLP field are used, such as BLEU[11], ROUGE, METEOR and CIDEr.
To address the aforementioned issues, recent works have innovatively introduced reinforcement learning-based optimization methods. By leveraging policy gradients and baseline functions, the original word-level training is improved to a sequence-level model, significantly compensating for the shortcomings of the original approach and enhancing the performance of image captioning. However, these methods also have limitations. For example, in [5] and [10], a complete caption is generated through a single sequence sampling, yielding a reward value, which is then assumed to be shared by all words during gradient optimization. Clearly, this is unreasonable in most cases, as different words have different parts of speech, semantic emphases, and significantly different amounts of implicit information; they should be classified as different linguistic entities and correspond to different visual concepts during training. To address these issues, we propose the following image captioning method that integrates reinforcement attention mechanisms and sequence optimization.
In our method, we first replace the general convolutional network with Faster R-CNN as the encoder to extract strong semantic features based on object detection and attribute prediction from the input image. Then, we design a Reinforce Attention mechanism based on Monte Carlo sampling to filter out visual concepts worthy of attention at the current moment, achieving more accurate semantic entity guidance. In the Sequence Optimization stage, we use the policy gradient method to calculate the approximate gradient of the sequence. When calculating the reward value for each sampled word, we improve the original policy gradient function using a discount factor and a term frequency-inverse document frequency (TF-IDF) factor, so that words with stronger semantic meaning in generating captions receive larger reward values, thus contributing more gradient information to training and better guiding sequence optimization. In experiments, our performance on the MS COCO dataset outperforms current baseline methods across all performance metrics, demonstrating the effectiveness of our design.
Image captioning method
Generally, image captioning methods can be divided into two main categories: template-based and neural network-based. The former mainly uses a template to generate captions, and the filling of this template requires the output of object detection, attribute prediction, and scene understanding. The method proposed in this paper adopts the same framework as the latter, so we will mainly introduce the relevant work on image captioning based on neural networks below.
In recent years, a series of works on deep encoder-decoder architectures incorporating visual attention mechanisms have achieved excellent results on various standard datasets for image captioning tasks. The core mechanism of these methods lies in the fusion of convolutional and recurrent networks with visual attention mechanisms, which better mines implicit contextual visual information and fully integrates local and global entity information during end-to-end training, thus providing stronger generalization capabilities for caption generation. Many subsequent works have followed this path: on the one hand, they continue to strengthen and improve the effectiveness of attention mechanisms, proposing new computational modules or network architectures; on the other hand, some works focus on integrating feature extraction and representation methods based on detection frameworks with attention mechanisms to achieve better entity capture capabilities.
However, current visual attention-based methods using pure word-level training with cross-entropy have two significant drawbacks: Exposure Bias and Inconsistency. To better address these issues, reinforcement learning-based optimization methods have been introduced into image captioning tasks. A particularly representative work is [10], which remodels the problem as a policy gradient optimization problem and uses the REINFORCE algorithm for optimization. To reduce variance and improve training stability, [10] proposed a hybrid incremental training method. Subsequently, [5], [15] and other works made different improvements based on this, mainly proposing a better baseline function to maximize and more efficiently improve the effect of sequence optimization. However, a significant limitation of these current methods is that when sampling and approximating the sequence gradient, all words in a sentence are assumed to share a common reward value. This is obviously unreasonable. To address this deficiency, we introduced two optimization strategies: First, starting from the computation of the evaluation function in reinforcement learning, we introduced a discount factor to more accurately calculate the gradient value of each word sampled backpropagation; second, based on the original intention of metric-driven learning, we introduced the TF-IDF factor into the reward computation to better leverage the driving role of strong linguistic entities in the overall optimization of the sequence.
method
The overall framework of our model is shown in Figure 1, where (a) is a forward computation process from input to output, and (b) is a sequence optimization process based on reinforcement learning. Below, we will introduce the details of our method in three progressively: semantic feature extraction, caption generator, and sequence optimization.
Figure 1(a) Model forward computation flow
Figure 1(b) Sequence optimization process based on reinforcement learning
Unlike common practices, for the input image, we do not extract convolutional feature vectors. Instead, we extract semantic feature vectors based on object detection and attribute prediction, allowing for better matching with linguistic entities in real captions during training. In this paper, we use Faster R-CNN as the visual encoder in the image captioning model. Given an input image, the semantic features to be output are denoted as: We apply non-maximum suppression to the final output of Faster R-CNN. For each selected candidate region i, we define the pooled convolutional feature of that region. Here, we first initialize the encoder using ResNet-101 pre-trained on ImageNet, and then train it on the Visual Genome dataset. The Visual Genome dataset is used for attribute prediction. Through this round of training, we concatenate the pooled convolutional features and the attribute prediction output vector to obtain the final semantic feature vector.
2. Caption Generator
(1) Model Structure and Objective Function Given an image and its corresponding semantic feature vector, our model needs to generate corresponding captions, where ( is our extracted dictionary, the size of which is denoted as ). Overall, our generator consists of two LSTM layers. The operation performed at each time step in the caption generation process can be formulated as follows:
(1)
Here, represents the internal computation graph of a standard LSTM, and and represent its input vector, output vector, and memory cell, respectively. The likelihood of each word is determined by the conditional probability, which is calculated in each forward pass according to the following formula:
(2)
Here, is the output of the second LSTM layer, and and are the weights and biases to be learned, respectively. The probability of the currently generated sequence is the product of the conditional probabilities of all words: