1. Abstract
This paper utilizes adversarial examples in the training of deep neural network acoustic models for speech recognition and keyword detection to improve the robustness of these models. During model training, adversarial examples are generated as augmentations of the original training samples using a fast symbolic gradient method. Unlike traditional data augmentation methods based on data transformation, our proposed method is a model-data related approach, dynamically generating adversarial examples based on the model parameters and the current training data. In our speech recognition experiments on the Aurora-4 database, our proposed method significantly improves the model's robustness to noise and channel interference. Furthermore, by combining our proposed data augmentation method with a teacher/student learning strategy, we achieve a 23% reduction in relative word error rate on the Aurora-4 database. In the keyword detection task, our proposed method also significantly reduces the false wake-up and false rejection rates of attention-based wake-up models.
Keywords: Robust speech recognition, keyword detection, adversarial examples, fast symbolic gradient method, data augmentation
2. Introduction
In recent years, with the rise of deep learning (DL) and the successful application of deep neural networks (DNNs) in acoustic models, automatic speech recognition (ASR) [1][2] and keyword spotting (KWS) [3][4] have developed rapidly. Various network structures, such as CNN, RNN and LSTM, have been successfully applied in acoustic modeling. In practical applications, acoustic models based on DNNs have shown good noise robustness because their special structure and multi-layer nonlinear transformation give them strong modeling capabilities. Even so, ASR and KWS systems based on DNNs are still affected by noise, reverberation and channel factors [6], resulting in a decrease in recognition performance. To address these issues, a great deal of work has been proposed in various areas, such as data augmentation [7], single/multi-channel speech enhancement, feature transformation, and some effective learning strategies, such as teacher/student (T/S) learning [8] and adversarial training [9]. In this paper, we focus on data augmentation methods to improve the robustness of ASR and KWS systems.
When there is a distribution mismatch between training and test data, the performance of acoustic models will be significantly reduced. To compensate for this mismatch, data augmentation is a very effective and widely adopted method. Data augmentation aims to create noisy copies of clean data by adding noise, reverberation, and other interference to simulate real noisy data, thereby increasing the diversity of the training data. This data is then used for model training. This training method is called multi-scenario training. In addition, T/S learning is also a commonly used method to improve model robustness. It can be used in both supervised and unsupervised scenarios. T/S learning requires parallel data to train the T model and the S model separately.
In order to improve the robustness of the model to noise, this paper proposes a method to augment data using adversarial examples. The concept of adversarial examples was first proposed in computer vision tasks in [10]. Researchers found that for a fully trained image recognition network, if some very subtle pixel-level perturbations are made to an image that can be correctly classified, even if the perturbation is imperceptible to the human eye, the model will misclassify the perturbed image. Such misclassified samples are called adversarial examples. The existence of adversarial examples shows that existing models are very sensitive to some very small perturbations! In the field of computer vision, adversarial examples have attracted widespread interest from researchers. Recently, the research on adversarial examples has also been extended to the field of speech signals. [12] proposed a targeted attack method on end-to-end speech recognition models: given a speech, a perturbation imperceptible to the human ear is generated, and the perturbed speech can be recognized as any target text. Similarly, in the KWS system, we naturally regard false alarmed (FA) or false rejected (FR) samples as adversarial examples! When the system encounters samples completely unrelated to the keyword, false wake-ups still occur, or the system may incorrectly reject inputs that are clearly keywords. Due to the complex acoustic environment and many other unpredictable factors, samples that trigger FA and FR are often unreproducible. This unreproducible property makes further improvements to KWS performance very difficult.
Previous work on improving model robustness based on adversarial examples mainly aimed to improve the robustness of the model to adversarial examples. In our work, our goal is to improve the robustness of the model to normal noisy data by using adversarial example-based data augmentation, rather than just adversarial examples. During the training phase, the Fast Gradient Sign Method (FGSM) [11] was used to dynamically generate adversarial examples. Compared with other methods, the FGSM method is more efficient. For each mini-batch of training data, after the adversarial examples are generated, the model parameters will be updated using the adversarial examples. In addition, in the ASR task, we also combined the proposed adversarial example-based data augmentation method with T/S learning and found that the gains brought by the two methods can be superimposed.
The chapters of this paper are arranged as follows: Chapter 2 details the method of generating adversarial examples using FGSM; Chapter 3 introduces the application of adversarial examples in acoustic model training; Chapter 4 presents the experimental setup and results; and Chapter 5 summarizes the entire paper.
3. Adversarial Examples
Adversarial example definition
The purpose of adversarial examples is to successfully disrupt a well-trained neural network model. Even a very good model is particularly vulnerable to adversarial attacks, meaning that the model's predictions are easily interfered with by artificial perturbations at the input, even if these perturbations are imperceptible to the human ear. These artificial perturbations are called adversarial perturbations, and the samples that are interfered with by adversarial perturbations are called adversarial examples. The existence of adversarial examples indicates that the network's output is not smooth with respect to the input; that is, a very small change at the input can cause a large jump in the output.
Generally, a machine learning model, such as a neural network, can be represented as a parameterized function, where is the input feature vector and are the model parameters. Given an input sample and its corresponding label , a trained model will be used to predict the label of the sample. Adversarial examples can be constructed using the following formula:
(1)
And there are
in
These are called adversarial perturbations. For a pre-trained neural network, ordinary random perturbations generally will not affect the network's output. Therefore, the key to generating adversarial examples is the design and generation of adversarial perturbations. Once adversarial perturbations can be generated, adversarial examples can be used as training data to train the network, thereby improving the smoothness and robustness of the model.
Generation of adversarial examples
In this paper, we use the Fast Signed Gradient Method (FGSM) to generate adversarial examples. FGSM uses the current model parameters and training data to generate the adversarial perturbation in formula (1). Given the model parameters, input, and output, the model uses the training data to minimize the loss function during the training phase. In general classification tasks, the loss function is usually cross-entropy, which is also the loss function used in this paper. After the network parameters have been optimized and the network has converged, in order to find a perturbation direction in the input space that can increase the network loss function, that is, a direction that can cause the network to misclassify the input, FGSM proposes to use the following formula to calculate the perturbation:
Here, is a very small constant. Note that FGSM uses a sign function to obtain the sign of the gradient of the loss function with respect to the input, instead of directly using the value of the gradient. This is to satisfy the maximum norm constraint of the perturbation and to easily control the magnitude of the perturbation, thereby satisfying the constraint of formula (3). In the following experiments, we will show that a small is sufficient to generate adversarial examples that enhance the robustness of the model.
4. Using adversarial examples for acoustic model training
Unlike other data augmentation methods based on data simulation, such as adding noise and reverberation, adversarial sample-based data augmentation is a model- and data-related approach that explicitly links adversarial samples to the loss function, generating samples that increase the value of the loss function. Therefore, this method is more efficient. Once generated, these adversarial samples are used to train the network, thereby enhancing its robustness to interference. In this work, the FGSM method is used to dynamically generate adversarial samples for each mini-batch of training data. Algorithm 1 describes the process used in training the acoustic model.
Algorithm 1: Training the acoustic model using adversarial examples
In acoustic model training, the input features are generally MFCC features, and the target is the state of the bound Hidden Markov Model. In Algorithm 1 above, we use four steps to train the model in each mini-batch of training data: (1) Train the model parameters using the original training data, then fix the model parameters and generate adversarial perturbations for the current data. Because FGSM uses a sign function, the value of each dimension of the adversarial perturbation is or; (2) Use the generated adversarial perturbations to generate adversarial examples; (3) Combine the adversarial examples with the target of the original data to generate new training data; (4) Use the newly generated training data to train the model and update the model parameters. Here, we want to emphasize that we combined the adversarial examples with the original labels because the perturbation is very small in our experiment, and we hope that the neural network can output the same predicted category as the original samples. The adversarial examples generated by FGSM can significantly increase the model loss function, indicating that these samples are the "blind spots" of the current model, and the model cannot successfully cover these areas, resulting in unpredictable errors in the model.
5. Experiment
Database and system description
Aurora-4 database
The Aurora-4 database is a noise-robust, medium-vocabulary continuous speech recognition database based on the Wall Street Journal (WSJ) database, specifically generated by adding noise to the WSJ0 database. In Aurora-4, two microphones are used for recording: a primary microphone and a secondary microphone. Several different models of secondary microphones are included, and both microphones are used simultaneously to record 7138 training sentences. The Aurora-4 training dataset can be divided into two parts: clean training data and multi-scene noisy training data. The clean training data is entirely recorded using the primary microphone and contains no noise. The multi-scene training data also includes 7138 sentences, containing data recorded by both the primary and secondary microphones, as well as both clean and noisy data; therefore, the multi-scene training data covers more noise and channel (microphone) distortion. The Aurora-4 test set also includes four types: a clean test set (A), a noisy test set (B), a channel distortion test set (C), and a noisy and channel distortion test set (D). Set A contains only 330 clean speech recordings from the main microphone; Set B contains 6 copies of noisy data from Set A, totaling 330*6=1980 sentences; Set C contains only 330 clean speech recordings from the secondary microphone; Set D contains 6 copies of noisy data from Set C.
Wake up the database
We validated our method using wake-up data collected from the Mobvoi TicKasaFox2 smart speaker. The wake word consisted of three Mandarin syllables (“Hi Xiaowen”). This dataset covered 523 different speakers, including 303 children and 220 adults. Furthermore, each speaker's ensemble included positive samples (with the wake word) and negative samples, with each ensemble comprising data recorded at different microphone distances and signal-to-noise ratios, where the noise originated from a typical home environment. A total of 20K positive samples (approximately 10 hours) and 54K negative samples (approximately 57 hours) were used as training data. The validation set included 2.3K positive samples (approximately 1.1 hours) and 5.5K negative samples (approximately 6.2 hours), while the test set included 2K positive samples (approximately 1 hour) and 5.9K negative samples (approximately 6 hours).
System Description
In speech recognition, we used CNN as the acoustic model. CNN has shown strong robustness to noise in many studies. In this paper, we adopted the same model structure as in [15]. For the Aurora-4 experiment, 40-dimensional FBANK features and 11 frames of context information were used to train the neural network. For the CHiME-4 experiment, we used Kaldi's fMLLR features as the features for network training. All feature extraction and training of the Gaussian mixture model acoustic model were based on Kaldi [13]. The training of the neural network and the implementation of adversarial examples were based on Tensorflow [14]. In both experiments, the development set was used to determine the parameters of the optimal model, including the adversarial perturbation weights of the adversarial examples. Then the optimal model was directly applied to the test set.
In the keyword detection work, we followed the end-to-end model structure based on the attention mechanism used in [5]. The encoder used a 1-layer GRU. Since negative samples have a longer duration than positive samples, we segmented positive samples during training, with a segment length of 200 frames (about 2 seconds). During testing, a window with a frame length of 200 was used, with frame shifts of 1 frame at a time. If the score of a sample after at least one frame shift is greater than a pre-set threshold, the KWS system is triggered. Our experiments were conducted using TensorFlow, with ADAM as the optimizer.
Experimental results
Aurora-4 speech recognition experiment
Figure 1 shows the relationship between WER (%) and adversarial weight ∈ on the Aurora-4 database development set.
Figure 1 illustrates the relationship between Word Error Rate (WER) and adversarial weights on the Aurora-4 database development set. Based on the results in Figure 1, the optimal performance was achieved on the development set. Therefore, we tested the word model using the test set. Table 1 presents the results on four test sets of Aurora-4. The baseline model was trained using multi-scenario training data, and the adversarial example model was trained using the procedure in Algorithm 1. From Table 1, we can see that after using the adversarial examples, we achieved an average relative decrease in WER of 14.1%. The adversarial example model improved performance on the three distorted test sets, especially on set D, where our proposed method achieved a relative improvement of 18.6% in WER. Although the recognition performance deteriorated on the clean test set A, this was mainly due to the introduction of too much noisy data into the training data. This problem can be compensated for by adding more clean data.
Table 1. Comparison of WER (%) between the baseline model and the model using adversarial examples on the Aurora-4 test set.
Furthermore, the data augmentation method proposed in this paper can be combined with other learning and training strategies. To verify this, we combined it with T/S learning, and experimental results show that the benefits of the two strategies are additive. The Aurora-4 database contains pairs of clean and noisy speech; therefore, we can use the clean model to train the T model and the noisy data to train the S model. When training the S model, the following loss function is used:
(5)
Where , CE is the cross-entropy loss function, are the parameters of the S model, are the features of the noisy data, are the original supervision information, and is the probability distribution of the teacher model's output, which is obtained by inputting clean speech into the T model:
Wherein, represents the parameters trained by the T model. Table 2 shows the experimental results of using T/S learning combined with adversarial examples. As can be seen from Table 2, using T/S learning can significantly reduce WER. After combining T/S learning with adversarial examples, we can obtain the best recognition result of 8.50%. At the same time, in order to prove that the gain comes from adversarial examples rather than from the increase in data volume, we replaced the adversarial perturbation with random perturbation. We found that random perturbation only brought a small gain, thus proving the effectiveness of adversarial examples. More details can be found in article [16].
Table 2. Experimental results of adversarial examples and T/S combinations on the Aurora-4 test set.
wake-up experiment
To verify the impact of the FGSM method on the model, we generated opposing examples using FGSM on the test set. Positive example perturbation (Pos-FGSM) means the perturbation is added only to the keyword part. Negative example perturbation (Neg-FGSM) adds the perturbation directly to the entire example. Our test results show that the FRR (Free Rate Reduction) of the KWS model increases sharply when facing adversarial examples. As shown in Figure 3, we analyzed the changes in the weights of the attention layer before and after adding the adversarial perturbation. It can be seen that the model's weights shifted significantly, indicating that the attention mechanism was disrupted, and the keyword position "attention" by the model was incorrect, leading to easily incorrect output results.
Figure 3(1) Positive Sample Attention Weight Layer Figure 3(2) Negative Sample Attention Weight Layer
This observation confirms that the current model is indeed highly sensitive to adversarial perturbation examples. To improve the model's robustness, we further expanded the training data using adversarial examples. Specifically, we retrained the model using adversarial examples. During the training phase, adversarial examples (including positive and negative examples) were generated for the training data at each step. These examples were then used to retrain an already trained KWS model. In our experiments, we also tried different training strategies, including using only positive adversarial examples, only negative adversarial examples, and both positive and negative adversarial examples. As a control, we also included random perturbation examples.
Table 4 shows the false rejection rate when there is one false wake-up per hour.
Figure 4. ROC curves for different training strategies
Figure 4 shows the ROC curve results for various methods, where the hyperparameter is . Pos-FGSM and Neg-FGSM represent augmentation with positive and negative adversarial examples, respectively, while ALL-FGSM represents augmentation with both positive and negative examples. Random represents adding random sign perturbation to all training data instead of using adversarial perturbation. Table 4 shows the FRR when FAR is 1.0 in the test set. From this, we can see that adversarial example augmentation based on Pos-FGSM and Neg-FGSM can significantly reduce FRR, with reductions of 45.6% and 24.8%, respectively. In comparison, random perturbation augmentation of examples can also slightly improve model performance. In summary, augmenting training data with adversarial examples is an effective method to improve model robustness. More details can be found in article [17].
6. Conclusion
This paper proposes a data augmentation method based on adversarial examples and applies it to robust ASR and KWS tasks. During model training, the FGSM method is used to dynamically generate adversarial examples. On the Aurora-4 robust speech recognition task, our proposed method achieves a relative reduction of WER of 14.1%. Furthermore, experimental results show that combining this method with other learning techniques, such as T/S learning, can achieve further improvements; on the Aurora-4 task, by combining it with T/S, we achieve a relative reduction of WER of 23%. In the KWS task, we perform data augmentation in different ways, and the proposed data augmentation method can also effectively reduce the FAR and FRR of attention-based KWS models.
7. References