Share this

Fine-grained classification based on coarse label constraints

2026-04-06 05:09:31 · · #1

Abstract: Fine-grained classification identifies species with high class similarity. Taking bird classification as an example, different classes not only share strong morphological similarities but also have close taxonomic relationships, often belonging to the same order and family. Most current classification methods use individual class labels as supervision information, which cannot express the taxonomic correlation between different classes. Instead, this paper considers this correlation and aims to use this information to improve fine-grained classification performance. To this end, we innovatively propose a novel coarse label representation and a corresponding cost function. The idea of ​​coarse label representation comes from class representation in multi-label learning. This coarse label representation can well express the structural information of different classes in taxonomy, and the coarse labels are obtained simply from the suffix of the class name or pre-given in datasets like CIFAR-100. We propose a new cost function that can fully utilize this coarse label supervision information to improve fine-grained classification. Our method can be generalized to any fine-tuning task; it does not increase the size of the original model or add extra training time. We conducted extensive experiments showing that using coarse label constraints can improve classification accuracy.

Keywords: Image recognition; fine-grained classification; coarse label constraint;

1. Introduction

Fine-grained classification aims to distinguish between very similar categories, such as birds [1,2], dogs [3], and flowers [4]. These tasks differ from traditional image classification [5] because they require expert-level knowledge to find subtle differences between categories. Fine-grained classification has wide applications in many fields, such as image search, image generation, and machine teaching [6].

Most existing fine-grained classification methods use class labels and pre-defined candidate boxes as supervision information. We found that all methods use individual class labels as supervision information, while the categories in fine-grained classification are highly correlated in biological taxonomy, and existing class labels cannot express this biological taxonomic correlation.

Current popular biological taxonomy methods are hierarchical and structured. Taking bird classification as an example, birds are categorized into order, family, genus, and species. In fine-grained classification, we usually identify the "species" rather than the corresponding "family" and "genus." This leads me to two questions: First, can the structural information of biological taxonomy be used to facilitate fine-grained classification? Second, how can this be achieved?

This paper answers the two questions mentioned above at a very basic level. We propose a novel coarse label representation and a corresponding cost function to utilize this coarse label supervision information. We refer to commonly used category labels as fine labels to represent individual categories; while coarse labels are the common labels of multiple independent categories. The idea of ​​coarse label representation comes from multi-label learning [7,8]. This coarse label can represent the structural relationships between categories, including the parent-child relationship between coarse and fine labels, and the sibling relationship between different fine labels belonging to the same coarse label. The cost function utilizes this coarse label supervision information to constrain the classification error of fine labels to a smaller range, thereby improving classification accuracy. Through our new coarse label representation and cost function, we can achieve a 1-7 percentage point improvement in classification performance on the basis of existing networks. This method does not change the size of the original model, nor does it increase the additional training time.

The main contributions of this paper can be summarized as follows:

We propose a novel coarse label representation that can express the correlation between different categories in biological taxonomy.

We propose a novel cost function to utilize this coarse-labeled supervision information.

We conducted extensive experiments on three fine-grained classification databases (CUB[1][1][1][1], Stanford Dogs, NABirds) and a conventional image classification database (CIFAR-100), achieving performance improvements of 1-7 percentage points.

The remaining parts of this paper are organized as follows: Part II introduces related work; Part III introduces the method proposed in this paper; Part IV introduces the database and network architecture used; Part V introduces experimental results and analysis; and Part VI is the conclusion.

2. Related work

2.1 Fine-grained classification

Fine-grained classification methods mainly rely on local part localization and more discriminative feature extractors. The biggest difference between fine-grained classification tasks and traditional classification tasks is that the differences between fine-grained categories are very subtle. Taking bird classification as an example, the difference between different categories may be different wing colors, and the differences in local details become an important factor affecting classification; therefore, we use local information of the image to help classification, such as by additionally processing the bird's head and body [9-12] to improve the overall classification performance; or by improving performance based on attention mechanisms [13,14]. Using more discriminative feature extractors is also crucial for fine-grained classification. Due to the success of convolutional neural networks [15,16] in traditional image classification, fine-tuning pre-trained models has become a very effective method. In addition, ensemble learning methods [17] and new feature encoding methods [18,19] have achieved certain results in fine-grained classification; these works may be combined with our method in the future.

2.2 Transfer Learning

Pre-trained network models on the ImageNet[5] dataset have been widely used for transfer learning. Pre-trained network models can be used as feature extractors or fine-tuned with the entire network. Compared with traditional image classification, fine-grained classification datasets are much smaller; in addition, for biological data collection for fine-grained classification, some rare species are difficult to photograph, resulting in uneven distribution of images across different categories; these factors make it very difficult to train a fine-grained classification model from scratch. Recently, using large-scale noisy network data[20] to fine-tune networks, or using large-scale fine-grained datasets[6,21] to fine-tune small datasets, has yielded incredible experimental results.

2.3 Multi-tag learning

In image classification, multi-label learning studies how a single image corresponds to a set of labels, while conventional image classification studies how a single image corresponds to a single label. To some extent, conventional image classification can be considered a special case of multi-label learning. There are two main differences between our method and multi-label learning. First, in multi-label learning, each dimension of the category vector represents whether that category appears. Assuming there are N categories, there are 2 ^N possible combinations of multi-label categories. We use a representation rule for multi-label categories to represent coarse labels, but the number of coarse labels is less than N. Second, in multi-label learning, the network output is a multi-label vector; our method uses coarse labels as supervision information, and the final output is a single label.

3. Methods

We create a novel coarse label representation that effectively represents the correlation between different categories in biological taxonomy. Simultaneously, we propose a novel cost function to leverage this coarse label supervision information and improve the network's classification performance.

3.1 Method for representing bold labels

The concept of coarse labels is the opposite of fine labels. For an instance in an image, a fine label represents the specific category it belongs to, while a coarse label is usually a common label of several related fine labels. We typically use an additional label to describe the coarse label of an instance. This incurs additional storage overhead and makes it difficult to merge coarse and fine labels during training.

The CIFAR-100 dataset provides us with the class and superclass to which an image belongs. CIFAR-100 has 100 classes, each containing 600 images. These 100 classes are further divided into 20 superclasses. Each image has a "fine" label (its class) and a "coarse" label (its superclass). For example, a superclass named "fish" has five subclasses: aquarium fish, halibut, ray, shark, and trout. In this case, we use the additional label "fish" to represent the coarse label. Table 1 shows the fine and corresponding coarse labels for CIFAR-100.

In multi-label learning, we use category vectors to represent instances. Multi-label learning studies the association of a single instance with multiple labels. Assuming there are N categories in total, the position i of the multi-label vector is 1, indicating that the instance belongs to class i. The N-dimensional multi-label vector representing an instance is shown below:

[0,0,1,0,0…1,0,0,1,0,0] (1)

In fine-grained classification, an instance is associated with a single label, and the class vector is in one-hot encoding. Assuming there are N classes in total, the position i of the class vector is 1, indicating that the instance belongs to class i. The N-dimensional single-label vector representing an instance is shown below:

[0,0,0…0,1,0,0,0,0] (2)

Each fine tag has only one corresponding coarse tag, and each coarse tag corresponds to at least one fine tag. We assume there are a total of N fine tags. For a given coarse tag, we assume there are n corresponding fine tags. These n fine tags are a1 , a2, ..., an . We use single-tag vectors to represent fine-grained tags, where the position i of the vector is 1, indicating that it belongs to class i. The final coarse tag is the union of the tag vectors of all corresponding fine tags. Therefore, the N-dimensional coarse tag vector representing a given instance is as follows:

[1,1,0,0…0,0,1,0,0] (3)

The following are all the corresponding fine tags for this bold tag:

[1,0,0,0…0,0,0,0,0] (4)

[0,1,0,0…0,0,0,0,0]

[0,0,0,0…0,0,1,0,0]

In biological taxonomy, the relationships between biological categories are typically represented by parent-child nodes and sibling nodes, requiring a multi-level tree structure for storage. While tree structures can represent many relationships, the category information within this structure is difficult to effectively utilize in machine learning. In machine learning, the supervisory information used is usually simple category labels rather than complex data structures. In contrast, our proposed coarse label representation can represent the structural relationships between categories. Specifically, our proposed coarse label representation includes structural information between fine labels; this structural information includes not only the parent-child relationship between coarse and fine labels but also the sibling relationship between different fine labels.

3.2 Cost Function

This paper proposes a novel cost function to utilize this coarse label supervision information. This cost function combines the Sigmoid cross-entropy function with the Softmax cross-entropy function, effectively leveraging coarse labels to improve fine label classification. In deep network learning, the cost function is a crucial metric for evaluating training performance, and the goal of network parameter tuning is to minimize the cost function. Commonly used cost functions in convolutional neural network training include the Softmax cross-entropy function, the Sigmoid cross-entropy function, and others.

We assume a neural network with parameters θ , an input image x, a correct label y, and N possible classification categories. The conditional probability generated by the neural network for the input image x is pθ (x) . We can then calculate the softmax cross-entropy between the correct label and the conditional probability:

Sigmoid cross-entropy is a commonly used metric in discrete classification tasks, where each class is independent and not mutually exclusive. For example, in multi-label classification tasks, an image can contain both houses and trees. In fine-grained classification based on coarse label constraints, assuming the input image x uses a newly proposed coarse label z as the correct label, we then calculate the sigmoid cross-entropy between the conditional probability pθ (x) and z:

For an input image x, with fine label y and coarse label z as supervision information, the final cost function is:

The final cost function consists of two parts: L<sub> softmax </sub> and L<sub> sigmoid</sub> . Clearly, in traditional image classification, we typically use L <sub>softmax</sub> as the cost function. Therefore, we minimize the L <sub>softmax</sub> cost function using fine labels and minimize the L <sub>sigmoid </sub> using coarse labels. The coarse labels contain the parallel relationships between different fine labels belonging to the same coarse label; during the cost function minimization process, we use L <sub>sigmoid </sub> to constrain the classification errors of fine labels to fine labels under the same coarse label, and we use L <sub>softmax</sub> to teach the model how to correctly classify fine labels. Parameters a and b are two hyperparameters that measure the ratio of the influence of L<sub> softmax</sub> and L <sub>sigmoid</sub> on L <sub>final</sub> ; we typically set a to 1 and vary the value of b.

4. Experimental Setup

We use the open-source TensorFlow[22] and PyTorch frameworks to train all models on multiple NVIDIA TITAN X GPUs. We will briefly introduce the three fine-grained classification datasets and one standard image classification dataset used in this paper, and we will also briefly introduce the neural networks used for fine-tuning in this paper.

4.1 Dataset

This paper selects three mainstream fine-grained classification databases, CUB, NAAbirds, and StanfordDogs, and a conventional image classification database, CIFAR-100, as evaluation criteria.

The CUB dataset contains 5,994 training images and 5,794 test images, across 200 classes. We only observe whether the suffixes of the class names are the same, and then divide them into 70 superclasses. Therefore, for the CUB dataset, there are 200 fine labels and 70 coarse labels. The NABirds dataset contains 23,929 training images and 24,633 test images, across 555 classes, and we use the same method to divide them into 156 superclasses. The StanfordDogs dataset contains 12,000 training images and 8,580 test images, across 120 classes, and we use the same method to divide them into 72 superclasses.

We also used the standard image classification dataset CIFAR-100 for our research. The CIFAR-100 dataset has 100 classes, each containing 600 images: 500 training images and 100 test images. The 100 classes in CIFAR-100 are divided into 20 superclasses. Each image has a "fine" label (its class) and a "coarse" label (its superclass). We used the official partitioning as our classification standard. The four datasets mentioned above are listed in Table 2:

4.2 Network Framework

We fine-tuned three types of network architectures for three fine-grained classification datasets: VGG19[23], Resnet50[15], and Inception-V3[16]. We fine-tuned VGG19 and WideResidualNetwork[24] for standard image classification datasets.

VGG is a common network in fine-grained classification, such as Bilinear-CNN[18] which uses VGG as a feature extractor. VGG employs a deeper network structure than AlexNet[25], and it won first and second place in localization and classification in the 2014 ILSVRC competition. VGG networks typically have 16-19 layers and a kernel size of 3x3. This paper uses a 19-layer VGG network.

Residual networks effectively mitigate gradient vanishing and allow for deeper network structures. In our experiments, we used ResNet50 as a representative residual network.

Inception-V3. The Inception module was originally proposed in GoogleNet; it was later optimized by introducing Batch Normalization, residual connections, and other features. In our experiments, we use the Inception-V3 network as a representative of the Inception series.

WideResidualNetwork addresses the issue that high-performing residual networks are typically very deep, where many residual units provide only a small amount of useful information, or only a few blocks provide significant information. The authors argue that the performance of residual networks primarily comes from the residual units, with increased depth serving only a secondary purpose. Therefore, they reduced the depth of the residual network and increased its width; they proposed a 16-layer WideResidualNetwork, which performs comparably to a 1000-layer residual network on standard image classification datasets.

5. Results Analysis

5.1 Fine-grained classification dataset

We first conduct experiments on three fine-grained datasets, fine-tuning three pre-trained network models on the ImageNet database. Our experiments consist of two steps: the first step uses only fine labels as supervision information, and the second step uses coarse labels as new supervision information. In the second step, we set two parameters of the cost function to a=1 and b=1, while the remaining hyperparameters are the same as in the first step. Experimental results show that our method can improve performance on any dataset and any pre-trained network. The results are shown in Tables 3, 4, and 5.

Taking the CUB dataset as an example, if the VGG19 pre-trained model is used, the accuracy is improved by nearly 7 percentage points after using coarse label constraints, and by 2 percentage points when using ResNet50 or Inception-V3. On the ImageNet database, VGG19 performs worse than ResNet50 or Inception-V3, which indicates that VGG's feature extraction ability is not as good as ResNet50 or Inception-V3. We greatly improved this by introducing coarse label supervision information; with our method, VGG19 can achieve the same effect as ResNet50. In (7), the cost function consists of a*L softmax and b*L sigmoid , and the parameters a and b affect their speed ratio during backpropagation. We usually set a to 1 and then change b. If b is greater than a, the sigmoid cross-entropy has a greater impact. In our experiments, we found that setting the value of b to be greater than a usually makes the network have better results; this will lead to an increase of nearly one percentage point in the end. For example, when we used Inception-V3 to fine-tune the CUB dataset with coarse label constraints, we set b=2, and the final result was 0.6% higher than b=1. However, the values ​​of parameters a and b still need to be manually adjusted. To achieve better performance, the parameter values ​​are not proportional when fine-tuning different datasets with different models; therefore, we set a=1 and b=1 in the experiments above.

5.2 Standard Image Classification Dataset

We evaluated our method on the standard image classification database CIFAR-100 using two convolutional neural networks, VGG19 and WideResidualNetwork. CIFAR-100 has 100 subclasses and 20 superclasses, each superclass containing five more refined subclasses. Experimental results are shown in Table 6.

As shown in Figure 1, after introducing the coarse label constraint mechanism, the accuracy of the network on the test set is consistently higher than that of the original WRN, indicating that the constraint mechanism does indeed improve the performance of the original WRN.

Figure 1. Test set accuracy after 100 epochs using WRN and WRN with coarse label constraints.

In our experiments, we set the same learning rate and number of iterations for the network with coarse label constraints and the original network. We observed that the accuracy curves were very consistent. This indicates that the convergence trend of the network did not change significantly after the constraint mechanism was introduced. Furthermore, in the initial training phase, the network with coarse label constraints converged faster and its accuracy increased more rapidly. We can see that the network with coarse label constraints can significantly accelerate convergence and promote the convergence of the entire network in the correct direction. Table 7 shows a comparison with existing methods.

6. Summary

In this work, we propose a novel coarse label representation that effectively expresses the structural information between categories. We also propose a corresponding cost function that leverages this coarse label supervision information to guide the convergence of fine labels through coarse label constraints. Extensive experiments were conducted on three fine-grained classification datasets and one standard image classification dataset. Experimental results demonstrate that this method accelerates network convergence and consistently improves the performance of the original network.

Using coarse label constraints is easy to implement and can be generalized to any fine-tuning task; it does not increase the size of the original model or add extra training time. Therefore, our method should be beneficial for a large number of models. In the future, we plan to combine our method with existing methods to reduce classification errors.

Read next

CATDOLL Oliva Soft Silicone Head

You can choose the skin tone, eye color, and wig, or upgrade to implanted hair. Soft silicone heads come with a functio...

Articles 2026-02-22