A Review of Short Speech Speaker Confirmation Technologies for Intelligent Voice Control Scenarios

Speech-based speaker authentication technology (voiceprint recognition technology) falls under the research scope of biometric identification and has significant application value in the Internet/IoT era. Currently, speaker authentication technology under certain conditions is relatively mature and has been widely applied in scenarios such as smartphones, banking services, access control, and intelligent customer service. However, as a key technology for realizing natural human-computer interaction based on speech in intelligent control scenarios, short-voice speaker authentication technology currently cannot meet application needs. This paper will focus on providing a technical overview of short-voice speaker verification technology. First, it outlines the basic concepts and mainstream technical routes of speaker verification technology; second, it analyzes the challenges faced by short-voice speaker verification technology; then, it reviews deep learning-based speaker verification technology; finally, it looks forward to the development trends and application prospects of short-voice speaker verification technology.

1 Introduction

Humans can identify a person by their voice because each speaker has a different speaking style, vocabulary habits, and slightly different physiological structures of their vocal organs. These two factors result in unique voice characteristics and distinct voiceprint information for each speaker. Speech-based speaker identification is a technology that uses computers to analyze and extract speaker information contained in speech to automatically authenticate the speaker. It is one of the important technologies for natural human-computer interaction and a key technology for intelligent robots, possessing significant research value.

Speaker verification technology is currently widely used in various fields with identity authentication needs. For example, in the smart home sector, it helps smart devices verify the speaker's identity, enabling the system to provide customized services and content for different speakers. In the financial sector, it can be used for remote identity authentication in online transactions, thereby improving the security of financial accounts and reducing the success rate of internet-based financial crimes. In the public security and judicial fields, it can be used for the identification of telecommunications fraud suspects, helping the police effectively curb and combat crime. Specifically, police officers can use speaker verification technology to first extract the target speaker's voice data from telephone recordings, then match it against a speaker database, and ultimately identify the suspect. Using advanced speaker verification technology can reduce investigation costs and increase the crime-solving rate.

Research on speaker identification technology began in the 1930s, with early researchers focusing on human hearing recognition and template matching. With the development of statistics and computer science, speaker identification work began to shift towards methods such as speech feature extraction and pattern matching. In recent years, with the rise of artificial intelligence and the improvement of computing power, speaker identification technology based on machine learning and deep learning has gradually become mainstream.

This paper will first introduce the basic concept of speaker verification, then briefly review the development history of short speech speaker verification technology in intelligent voice control scenarios from the perspectives of feature extraction and short speech modeling, then analyze several types of short speech speaker verification technologies using deep learning, and finally summarize and look forward to the development trend.

2. Overview of Speaker Confirmation Techniques

Speaker verification technology, as shown in Figure 1, is used to determine whether the speech to be identified comes from the claimed target speaker [1][2], which is a "one-to-one" decision problem. Specifically, the speaker verification task can be divided into three stages: training, registration, and verification. In the training stage: a general speaker model is trained using a large amount of data; in the registration stage: a small amount of speech data of registered speakers is collected and the voiceprint model of the registered speakers is obtained through an algorithm; in the verification stage: the test speech claimed to be the target speaker is input and the corresponding speaker model is calculated, and then matched with the registered target speaker model to finally determine whether the speaker is the registered target speaker.

Figure 1. Schematic diagram of speaker confirmation concept

2.1 Text-related and text-free

Based on whether speaker verification technology restricts the text information of speech, speaker verification technology can be divided into text-independent and text-dependent technologies.

Text-independent speaker identification technology: When training the model, the text information of the speech data used is not limited, and the text content of the training speech and the test speech does not need to be the same, that is, the speaker can say any sentence at will.

Text-related speaker identification techniques: The text content of the speech data used during model training is pre-fixed within a specific range, and the text content of the training speech and the test speech must be consistent.

2.2 Speaker Confirmation Technical Performance Evaluation

The two basic metrics for measuring the performance of speaker confirmation techniques are the False Acceptance Rate (FAR) and the False Rejection Rate (FRR), which are defined as follows:

Here, FAR represents the error rate at which a non-target speaker's voice, after passing through the speaker verification system, has a similarity score greater than a given threshold and is thus identified as the target speaker. A lower FAR value indicates a lower probability of the system misidentifying a non-target speaker as the target speaker, and thus better system performance. In everyday situations where fast access is required and accuracy is not a primary concern, a slightly higher FAR value can be set for the speaker verification system.

On the other hand, FRR represents the error rate at which a target speaker's voice score, as determined by the speaker verification system, falls below a set threshold and is thus misidentified as a non-target speaker. It can be seen that the smaller the FRR value, the lower the probability of the system misidentifying the target speaker as a non-target speaker, and the better the system's performance. In commercial scenarios requiring high security, the speaker verification system can be set to a slightly higher FRR value, sacrificing access speed for greater system security.

Based on FAR and FRR, three commonly used performance evaluation indicators for speaker confirmation systems can be derived.

(1) Equal error rate (EER)

In speaker verification system performance evaluation, the error rate (FAR) decreases as the threshold increases, while the error rate (FRR) increases. Currently, the most commonly used evaluation metric in international competitions unifies FAR and FRR into a single metric. This is the error rate at which FAR and FRR are equal, known as the Equal Error Rate (EER). In the EER metric, FAR and FRR are assigned equal weights, meaning they are considered to have the same impact on the system.

For different speaker confirmation algorithms, a lower EER value means that both the FAR and EER curves will shift downward, indicating that the algorithm performs better.

(2) Minimum detection cost

The National Institute of Standards and Technology (NIST) defined a metric for evaluating the performance of speaker recognition systems using a weighted sum of FAR and FRR, namely the Detection Cost Function (DCF) [3], in its Speaker Recognition Evaluation (SRE) competition.

Here, _CFR represents the weighting coefficient for false rejection, and _CFR represents the weighting coefficient for false acceptance. These parameter values are provided by NIST during the competition, and vary from competition to competition for different tasks. In practical applications, these weighting parameter values can be set according to the specific application scenario. The DCF value depends on the decision threshold; changing the decision threshold can minimize the DCF value, thus forming the minimum detection cost (minDCF).

Compared to EER, minDCF considers the different costs associated with two different error rates, making it more reasonable in practical applications and allowing for a better evaluation of the performance of speaker confirmation systems.

(3) DET curve

In the speaker confirmation system, different thresholds can be set according to different application scenarios to make a choice between FAR and FRR. In practical applications, the DET curve (Detection Error Trade-off Curve, DETCurve) is generally used to represent the relationship between FAR and FRR and the threshold. Figure 2 shows the DET curves corresponding to different back-end scoring models of the i-vector system [4]. From the DET curve, the performance difference of the speaker confirmation algorithm after different back-end scoring functions can be seen intuitively. Obviously, the closer the DET curve is to the origin, the better the system performance. In addition, the change in the DET curve is a step function in the form of a ladder. When there is enough test dataset, the DET curve can show a smoother slope.

Figure 3 DET curve

3. Overview of Mainstream Short Speech Speaker Verification Technologies

After nearly 80 years of development, speaker identification technology has achieved remarkable results in terms of recognition ability, robustness and model expression ability. Long-term speaker identification technology under quiet conditions can meet commercial needs. In practical applications, researchers have found that the length of the speaker's speech has a significant impact on the speaker identification system [5][6]. The performance of mainstream speaker identification technology fluctuates greatly when the test speech duration is short (less than 3 seconds). Figure 3 shows the changes in EER of the Gaussian Mixture Model-Universal Background Model (GMM-UBM) [7] system and the ivector-GPLDA [8] system when the speech duration is shortened from 150 seconds to 2 seconds [9]. It can be seen that the speaker identification system drops sharply as the speech duration in the training and test data becomes shorter. In response to this problem, researchers have begun to shift the research focus of speaker identification technology to speaker identification technology under short speech conditions.

Figure 3. Performance variation of speaker confirmation system with different speech durations.

3.1 Challenges of Short Speech Speaker Confirmation Technology in Voice Control Scenarios

Generally, short-voice speaker verification is commonly found in smart home and smart robot voice control scenarios. For short-voice speaker verification technology in smart voice control scenarios, "short voice" refers to the speaker's registration and verification voice content consisting of short words, such as "open the door" or "close the door," with a duration of less than 3 seconds. Considering specific application scenarios, the collected voice signal is mixed with interference information such as other speakers, environmental noise, and channel mismatch. The challenges of short-voice speaker verification technology are summarized as follows:

(1) Short duration: The speech duration of the speaker registration and test is short, usually containing only a few words, such as "open the window", "turn off the light" etc. These sentences contain less effective speech information and insufficient speaker information [10], which may lead to a decrease in matching degree during training and testing, and thus a poor performance of the speaker confirmation system.

(2) Noise interference problem: In practical applications, environmental background noise will cause great interference to the speaker confirmation result. Noise will cause a lot of uncertain information to be mixed into the speech of the target speaker, making it difficult for the parameter model to estimate the accurate statistics, and ultimately seriously reducing the performance of the speaker confirmation system [11].

(3) Invalid recordings: When collecting speech data in real-world scenarios, invalid speech inevitably gets mixed into the speech in the test set and the training set. This further shortens the time of useful speech, which is insufficient to provide enough information to train the model. For traditional speaker statistical models, this will increase the posterior covariance of the model [12][13], and increase the uncertainty of system estimation.

3.2 Short Speech Speaker Confirmation Technology

Since short speech contains limited information, the traditional approach to speaker identification for long speech cannot be followed. It is necessary to find more suitable feature representations for short speech and to perform reasonable modeling or compensation for short speech.

3.2.1 Feature Extraction

Traditional long-term speaker identification methods often use Mel-Cepstral Coefficient (MFCC) as input features. However, for short speech speaker identification, the uncertainty in speech is often not negligible, so the MFCC-based and traditional i-vector methods are difficult to estimate the accurate speaker representation, resulting in poor recognition rate [14]. To overcome this problem, some researchers have proposed using a multi-feature fusion method, which utilizes the characteristic that different features contain different information to make up for the shortcomings of short speech. In the context of text-independent speaker identification with limited data, features that are not sensitive to changes in speech context information are selected for fusion [15]. In the early stages, researchers tried to use short-term spectral features such as LPCC, LSF, PLP and PARCOR (Partial Correlation Coefficients) [16][17][18] for fusion to improve the performance of short speech speaker identification systems. In recent years, Todisco[19] proposed a new feature that can better represent speaker information, called CQCC (constant Q transform coefficients). By simulating the human auditory perception system, a constant Q factor is introduced, so that the generated spectrogram has high resolution in both high and low frequencies. Compared with MFCC features, it is more suitable for short speech speaker identification tasks. In addition, Leung et al.[20] proposed a method for short speech speaker identification based on N-gram language model by utilizing the correlation of speech context. Penny et al.[21] proposed a method to convert phoneme posterior probability information into features, and used speech recognition to obtain phoneme posterior probability information to assist in training UBM. Fu[22] used the tandem feature method, that is, tandem short-time spectral features and features based on speech recognition deep network, and achieved a high recognition rate in the GMM-UBM framework. Sainath[23] adopted the structure of autoencoder, set a certain hidden layer of the network as the bottleneck layer, and tandem the output of the bottleneck layer with other features. Experiments showed that this method helps to improve the performance of short speech speaker identification system.

3.2.2 Short Speech Modeling

In recent years, as the i-vector framework has become the benchmark for speaker identification, researchers have also begun to study speaker identification of short speech based on the i-vector framework. Since the PLDA framework can be applied to speaker identification of speech of any length[24], many researchers have begun to explore speaker identification technology for short speech based on the i-vector-PLDA framework. Among them, pattern matching and normalization are the research hotspots in recent years.

Jelil et al. [25] proposed a method to use phoneme sequence information implied in speech for speaker identification of text-related short speech. They constructed speaker-related GMMs and Gaussian posterior probability maps of specific phrases. During the testing phase, on the one hand, it is necessary to compare the GMM of the target speaker, and on the other hand, it is necessary to use the dynamic time warping (DTW) method to match the posterior map of the specific phrase template. Dey et al. [26] attempted to improve the performance of speaker identification of text-related short speech by referencing sequence information through DTW in a DNN and i-vector framework.

Normalization methods are mainly used to compensate for the impact of speech duration mismatch during training, registration and testing. Hautamäki et al. [12] proposed to extract i-vectors based on minimax strategy to represent speakers. When using the EM algorithm to extract Baum-Welch statistics, the minimax method was introduced to help the model obtain more robust i-vectors. In 2014, Kanagasundaram et al. [27][28] found that the model estimated i-vectors of multiple short speech of the same speaker with significant differences. They assumed that this difference came from the inconsistent phoneme information contained in the i-vectors. Because short speech contains fewer words and fewer phonemes, the speaker information contained is limited. Based on this assumption, they proposed the Shortutterance Variance Normalization (SUVN) method to compensate for the missing phoneme content. Hasan et al. [29] found that when the speech duration is shortened, the number of detectable phonemes in a sentence decreases exponentially. Based on this finding, they regarded the duration difference as noise in the i-vector space and modeled it, which improved the performance of the speaker confirmation system under short speech conditions.

After 2013, deep learning-based methods were also introduced. Under the DNN framework, Snyder et al. [30] used temporal pooling layer to process variable-length speech input. In long-term speaker identification, the i-vectors of speech can also learn the phoneme content differences under short speech [29]. Inspired by this, Hong et al. [31] introduced the transfer learning method into the short speech speaker identification system, learned speaker discriminative information from the model domain trained by long-term speech, and added the KL regularization term to the back-end PLDA objective function to measure the similarity between the source domain and the target domain. Experimental results show that this method helps to improve the short speech speaker identification performance under the i-vector-PLDA framework.

3.3 Speaker Confirmation Algorithm Based on i-vector and PLDA

In 2011, Dehak found in experiments that the JFA algorithm[32] assumes that the intrinsic channel space is estimated using speaker-independent channel information, but in reality, some speaker-related information will also leak into the intrinsic channel space. That is, although the JFA algorithm assumes that the intrinsic sound space and intrinsic channel space are used to distinguish speaker information and channel information, it cannot effectively separate the two spaces. Therefore, Dehak does not divide the intrinsic sound channel space and intrinsic sound space, and uses a global difference (TotalVariability) space to uniformly describe speaker information and channel information, and a global difference factor (i-vector) to describe speaker factor and channel factor[4]. In the i-vector speaker confirmation system, the speaker hypervector is decomposed into:

In the formula, m represents the speaker-independent hypervector; � is the low-rank global difference space; � is the global difference factor, also known as the identity vector, i.e., the i-vector.

In the i-vector method, both speaker information and channel information are contained in the global difference space. In order to improve the accuracy of i-vector in representing the speaker, channel compensation technology needs to be introduced to further eliminate the influence of channel factors. Therefore, PLDA[33] is introduced to perform further factor analysis on i-vector, that is, the i-vector space is further decomposed into speaker space and channel space, as follows:

Where ? represents the speaker's speech; ? is the mean of all training i-vectors; ? represents the speaker space matrix, describing speaker-related features; ? is the speaker factor; ? is the channel space matrix, describing the differences between different speech samples from the same speaker; ? is the channel factor; and ? is the noise factor. Furthermore, ? and ? follow a ?(0,1) distribution. During the testing phase, the log-likelihood ratio is used to determine whether two speech samples are generated from features of the same speaker space, and the formula is as follows:

Where _{ω1 and} _ω2 are the speaker i-vectors during the registration and testing phases, respectively. _H0 assumes that the two speech segments belong to different speakers, and _H1 assumes that the two speech segments belong to the same speaker.

4. Mainstream Techniques for Real-Time Speaker Confirmation Based on Deep Learning

Before 2013, mainstream speaker identification techniques were based on statistical models. With the breakthroughs achieved by deep learning methods in speech recognition, image processing, and other fields, researchers began to study deep learning-based speaker identification techniques. The main branches include: speaker identification methods based on deep network feature learning, speaker identification methods based on metric learning, speaker identification methods based on multi-task learning, and end-to-end speaker identification methods.

4.1 Deep Network-Based Feature Extraction Methods

Deep network-based feature learning methods leverage the feature extraction capabilities provided by complex nonlinear structures to automatically analyze the features of input speech signals and extract higher-level, more abstract speaker representations.

In 2014, Google researchers Ehsan et al. [34] proposed a speaker identification algorithm based on the DNN (Deep Neural Networks, DNN) structure, selecting the output after activation of the last hidden layer as the speaker frame-level features; the average of all frame-level features of a speech segment is used to obtain the utterance-level features of that speech segment, which is called d-vector; in 2015, Chen et al. [35] found that there was a problem of excessively large weight matrix between the input layer and the first hidden layer of the DNN. After visualizing it, they found that there was a large number of Zero values and non-zero weights have a clustering effect. To address this issue, a method using locally connected and convolutional neural networks (CNN) to replace fully connected networks was proposed. The number of parameters in the new network decreased by 30%, and the performance loss was only 4%. In addition, with the same number of parameters, the EER of the new network increased by 8% and 10%, respectively. In 2017, Wang from Tsinghua University[36] proposed a feature extraction network that combines CNN and TDNN. The input is a spectrogram, and the output is a separable speaker representation. Since sentence-level features can be directly obtained through the spectrogram, the network performance is greatly improved. In 2018, Li et al. [37] found that in the traditional feature extraction structure based on DNN, the use of a softmax layer with parameters may cause some speaker information to "leak" into the weight parameters of the connection between the hidden layer and the softmax layer, resulting in incomplete deep features represented by the last layer network nodes, which in turn leads to low accuracy. Therefore, the authors improved the loss function so that the new loss function does not contain additional parameters and all speaker information is represented in the last layer of the network. In the same year, Povey [30] of Johns Hopkins University proposed the x-vector speaker identification system based on DNN. This system divides the speech feature extraction process into frame-level and segment-level, and uses a statistical pooling layer to connect the two levels of features. In the same year, Povey [38] found that by using data augmentation, noise, reverberation, human voice and other interference factors are added to the original speech data in a certain proportion, so that the network can extract effective information from the noisy data, thereby improving the performance of the speaker identification system.

4.2 Metric-Based Learning Methods

Metric learning-based methods aim to design objective functions that are more suitable for speaker confirmation tasks, enabling feature extraction networks to learn features with smaller intra-class distances and larger inter-class distances through the new objective function.

In 2017, Baidu proposed the DeepSpeaker[39] system, which uses triplet loss, which is widely used in face recognition, as the loss function. When training the model, it first extracts the representations of two speech segments of the same speaker, and then extracts the representations of speech segments of different speakers. The goal is to make the cosine similarity between the representations of the same speaker higher than the cosine similarity between different speakers. On text-independent data, the EER is reduced by 50% compared with the DNN-based method. In 2018, Salehghaffari[40] of New York University proposed using the Simaese structure, that is, using contrastive loss as the loss function. The speaker confirmation network based on CNN designed by him reduced the EER by nearly 10% compared with the i-vector system.

4.3 Multi-task learning-based methods

Because there are certain similarities between different speech tasks (such as keyword detection and speaker identification, speech separation and speech enhancement), researchers have tried to improve the generalization ability of the speaker identification system by sharing information in related tasks, so as to make the model have better generalization ability in speaker identification tasks.

In 2018, Ding et al. [41] from Tsinghua University transferred TripletGAN from the field of image generation to the field of speaker identification. They used the idea of multi-task learning to enable the network to perform two tasks at the same time: speaker identification and speech synthesis. They used Generative Adversarial Networks (GAN) as the data generator to generate more speech data as the input of the speaker identification network, so that it could learn a more generalized speaker representation. Compared with tripletloss network, the performance was greatly improved. In the same year, Novoselov et al. [42] combined the speaker identification task and the digit recognition task, so that the last layer of the network could output the speaker identification and speech digit recognition results at the same time. On the RSR2015 database, it improved by nearly 50% compared with the benchmark algorithm. Dey et al. [43] used multi-task learning of digit recognition and speaker identification to enable the network to jointly optimize each problem. They used tripletloss as the objective function and improved by 43% compared with i-vector method on the RSR database.

4.4 End-to-end speaker confirmation

An end-to-end speaker verification system takes as input speech signals from different speakers and outputs the speaker verification result. End-to-end networks typically contain a large number of parameters, requiring more training and testing data compared to other deep learning-based speaker verification methods.

In 2016, Heigold et al. of Google [44] proposed an end-to-end speaker verification system, which includes two networks: a pre-trained feature extraction network and a decision network for decision scoring. In the training phase, the pre-trained feature extraction network is used to obtain speech frame-level features, and after averaging, sentence-level features are obtained and cosine similarity is calculated with the features extracted from other sentences. Then, the similarity is input into the logistic regression layer, which contains only two scalar parameters: weight variable and bias variable. Finally, the logistic regression layer outputs whether the speakers are the same. In the registration phase, the features of the input speech are obtained, and the entire network is trained again. During training, only the bias parameter of the logistic regression layer is changed, and other parameters remain unchanged. In the verification phase, the speech to be verified is input, and the logistic regression layer directly outputs the decision result. In 2016, Zhang of Microsoft[45] found that the redundant contribution of the silent frame signal to the sentence-level features would weaken its representation ability. Therefore, he proposed to use an attention mechanism and introduce two pre-trained networks. One network is used to obtain the phoneme features of each frame of speech, and the other network is used to determine whether the current word is a triphone group. The outputs of the two networks are combined to give different weights to each frame signal and synthesize sentence-level features by weighting. In 2017, Chowdhury of Google improved the attention mechanism[46]. The acquisition of weights no longer depends on the pre-trained auxiliary network, but directly performs nonlinear transformation on the frame-level features to learn the weight parameters, which greatly reduces the complexity of the network. Li of Google[47] proposed to use a domain-adaptive method to use a large corpus dataset to assist a small corpus dataset in the end-to-end speaker confirmation task. At the same time, two different loss functions were designed for text-related and text-independent scenarios, which reduced the network training time by 60% and improved the accuracy by 10%.

5. Summary and Outlook

This paper focuses on short speech speaker verification technology for intelligent voice control scenarios. It reviews the basic concepts, analyzes the challenges faced by short speech speaker verification technology, introduces mainstream methods from the perspectives of feature extraction and short speech modeling, and finally introduces the current development status of deep learning-based speaker verification technology.

Compared to traditional machine learning-based speaker identification techniques, deep learning-based short-speech speaker identification techniques offer superior performance, thanks to the powerful feature extraction capabilities of deep networks. However, we also observe that deep learning methods require a large amount of labeled training speech data for model training, which limits the generalization and application of deep learning-based speaker identification models. Therefore, future research directions include using transfer learning methods to transfer speaker models trained on large corpora to those trained on small corpora, effectively extracting more discriminative features from short speech, and designing objective functions more suitable for short-speech speaker identification tasks.

References

[1]HansenJHL,HasanT.SpeakerRecognitionbyMachinesandHumans:Atutorialreview[J].IEEESignalProcessingMagazine,2015,32(6):74-99.

[2] Zheng Fang, Li Lantian, Zhang Hui, et al. Voiceprint recognition technology and its current application status [J]. Information Security Research, 2016, 2(1):44-57.

[3] SchefferN, FerrerL, GraciarenaM, etal.TheSRINIST2010speakerrecognitionevaluationsystem[C]//IEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing.IEEE,2011:5292-5295.

[4]DehakN,KennyPJ,DehakR,etal.Front-EndFactorAnalysisforSpeakerVerification[J].IEEETransactionsonAudioSpeech&LanguageProcessing,2011,19(4):788-798.

[5]MarkelJ,OshikaB,GrayA.Long-termfeatureaveragingforspeakerrecognition[J].IEEETransactionsonAcousticsSpeech&SignalProcessing,1977,25(4):330-337.

[6]K.Li,E.Wrench.Anapproachtotext-independentspeakerrecognitionwithshortutterances[C]//Acoustics,Speech,andSignalProcessing,IEEEInternationalConferenceonICASSP.IEEE,1983:555-558.

[7]ReynoldsDA,QuatieriTF,DunnRB.SpeakerVerificationUsingAdaptedGaussianMixtureModels[C]//DigitalSignalProcessing.2000:19–41.

[8]KennyP.Bayesianspeakerverificationwithheavytailedpriors[C]//Proc.OdysseySpeakerandLanguageRecognitionWorkshop,Brno,CzechRepublic.2010.

[9]PoddarA,SahidullahM,SahaG.Speakerverificationwithshortutterances:areviewofchallenges,trendsandopportunities[J].IetBiometrics,2018,7(2):91-101.

[10]LarcherA,KongAL,MaB,etal.Text-dependentspeakerverification:Classifiers,databasesandRSR2015[J].SpeechCommunication,2014,60(3):56-77.

[11]DasRK,PrasannaSRM.SpeakerVerificationfromShortUtterancePerspective:AReview[J].IeteTechnicalReview,2017(1):1-19.

[12]V.Hautamäki,Y.-C.Cheng,P.Rajan,etal.Minimaxi-vectorextractorforshortdurationspeakerverification[J].2013.

[13]PoorjamAH,SaeidiR,KinnunenT,etal.IncorporatinguncertaintyasaQualityMeasureinI-VectorBasedLanguageRecognition[C]//TheSpeakerandLanguageRecognitionWorkshop.2016.

[14]KanagasundaramA,VogtR,DeanD,etal.i-vectorBasedSpeakerRecognitiononShortUtterances[C]//INTERSPEECH.DBLP,2011.

[15]HosseinzadehD,KrishnanS.OntheUseofComplementarySpectralFeaturesforSpeakerRecognition[J].EurasipJournalonAdvancesinSignalProcessing,2007,2008(1):1-10.

[16]MakhoulJ.Linearprediction:tutorialreview.ProcIEEE63:561-580[J].ProceedingsoftheIEEE,1975,63(4):561-580.

[17]HermanskyH.Perceptuallinearpredictive(PLP)analysisofspeech.[J].JournaloftheAcousticalSocietyofAmerica,1990,87(4):1738-1752.

[18]HuangX, AceroA.SpokenLanguageProcessing:AGuidetoTheory,Algorithm,andSystemDevelopment[M].PrenticeHallPTR, 2001.

[19]TodiscoM,DelgadoH,EvansN.ArticulationratefilteringofCQCCfeaturesforautomaticspeakerverification[C]//INTERSPEECH.2018.

[20]LeungKY,MakMW,SiuMH,etal.Adaptive articulatoryfeature-basedconditionalpronunciationmodelingforspeakerverification[J].SpeechCommunication,2006,48(1):71-84.

[21]KennyP,GuptaV,StafylakisT,etal.DeepneuralnetworksforextractingBaum-Welchstatisticsforspeakerrecognition[C]//Odyssey.2014.

[22] FuT, QianY, LiuY, etal.Tandemdeepfeaturesfortext-dependentspeakerverification[C]//ConferenceoftheInternationalSpeechCommunicationAssociation.InternationalSpeechCommunicationAssociation(ISCA), 2014:747-753.

[23]SainathTN,KingsburyB,RamabhadranB.Auto-encoderbottleneckfeaturesusingdeepbeliefnetworks[C]//IEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing.IEEE,2012:4153-4156.

[24]KennyP,StafylakisT,OuelletP,etal.PLDAforspeakerverificationwithutterancesofarbitraryduration[C]//IEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing.IEEE,2013:7649-7653.

[25]JelilS,DasRK,SinhaR,etal.SpeakerVerificationUsingGaussianPosteriorgramsonFixedPhraseShortUtterances[C]//INTERSPEECH.2015.

[26] DeyS, MotlicekP, MadikeriS, et al. Exploitingsequenceinformationfortext-dependentSpeakerVerification[C]//IEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing.IEEE,2017:5370-5374.

[27]KanagasundaramA,DeanD,Gonzalez-DominguezJ,etal.ImprovingShortUtterancebasedI-vectorSpeakerRecognitionusingSourceandUtterance-DurationNormalizationTechniques[C]//Proceed.ofINTERSPEECH.2013:3395-3400.

[28]KanagasundaramA,DeanD,SridharanS,etal.Improvingshortutterancei-vectorspeakerverificationusingutterancevariancemodellingandcompensationtechniques[J].SpeechCommunication,2014,59(2):69-82.

[29]HasanT,SaeidiR,HansenJHL,etal.Durationmismatchcompensationfori-vectorbasedspeakerrecognitionsystems[J].2013:7663-7667.

[30]SnyderD,GhahremaniP,PoveyD,etal.Deepneuralnetwork-basedspeakerembeddingsforend-to-endspeakerverification[C]//SpokenLanguageTechnologyWorkshop.IEEE,2017:165-170.

[31]HongQ,LiL,WanL,etal.TransferLearningforSpeakerVerificationonShortUtterances[C]//INTERSPEECH.2016:1848-1852.

[32]KennyP.Jointfactoranalysisofspeakerandsessionvariability:Theoryandalgorithms[J].2005.

[33]SenoussaouiM,KennyP,BrümmerN,etal.MixtureofPLDAModelsini-vectorSpaceforGender-IndependentSpeakerRecognition[C]//INTERSPEECH2011,ConferenceoftheInternationalSpeechCommunicationAssociation,Florence,Italy,August.DBLP,2011:25-28.

[34]VarianiE, LeiX, McdermottE, etal.Deepneuralnetworksforsmallfootprinttext-dependentspeakerverification[C]//IEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing.IEEE,2014:4052-4056.

[35]ChenY,Lopez-MorenoI,SainathTN,etal.Locally-connectedandconvolutionalneuralnetworksforsmallfootprintspeakerrecognition[C]//SixteenthAnnualConferenceoftheInternationalSpeechCommunicationAssociation.2015.

[36]LiL,ChenY,ShiY,etal.DeepSpeakerFeatureLearningforText-independentSpeakerVerification[J].2017:1542-1546.

[37]LiL,TangZ,WangD,etal.Full-infoTrainingforDeepSpeakerFeatureLearning[J].2018.

[38]SnyderD,Garcia-RomeroD,SellG,etal.X-vectors:RobustDNNembeddingsforspeakerrecognition[J].ICASSP,Calgary,2018.

[39]LiC,MaX,JiangB,etal.DeepSpeaker:anEnd-to-EndNeuralSpeakerEmbeddingSystem[J].2017.

[40]HosseinSalehghaffari,etal.SpeakerVerificationusingConvolutionalNeuralNetworks[J].2018

[41]DingW,HeL.MTGAN:SpeakerVerificationthroughMultitaskingTripletGenerativeAdversarialNetworks[J].2018.

[42]NovoselovS,KudashevO,SchemelininV,etal.DeepCNNbasedfeatureextractorfortext-promptedspeakerrecognition[J].2018.

[43]SDey,TKoshinaka,PMotlicek,SMadikeri,etal,DNNbasedspeakerembeddingusingcontentinformationfortext-dependentspeakerverification[J].2018

[44]HeigoldG,MorenoI,BengioS,etal.End-to-endtext-dependentspeakerverification[C]//Acoustics,SpeechandSignalProcessing(ICASSP),2016IEEEInternationalConferenceon.IEEE,2016:5115-5119.

[45]ZhangSX,ChenZ,ZhaoY,etal.End-to-endattentionbasedtext-dependentspeakerverification[C]//SpokenLanguageTechnologyWorkshop(SLT),2016IEEE.IEEE,2016:171-178.

[46]ChowdhuryFA,WangQ,MorenoIL,etal.Attention-BasedModelsforText-DependentSpeakerVerification[J].arXivpreprintarXiv:1710.10470,2017.

[47]WanL,WangQ,PapirA,etal.Generalizedend-to-endlossforspeakerverification[J].arXivpreprintarXiv:1710.10467,2017.

A Review of Short Speech Speaker Confirmation Technologies for Intelligent Voice Control Scenarios

Read next

CATDOLL 108CM Maruko

CATDOLL Marusya Hybrid Silicone Head

CATDOLL 123CM Sasha TPE

CATDOLL Kelsie Hard Silicone Head