Cutting-edge intelligent voice technologies – pronunciation error detection and correction and speech conversion

In recent years, artificial intelligence (AI) has sparked a new wave of technological innovation both domestically and internationally, and is becoming a new frontier in industrial revolution. According to the BBC, the global AI market is projected to reach 119 billion yuan by 2020. iResearch Consulting predicts that the Chinese AI market will reach approximately 9.1 billion yuan by 2020.

Currently, policies, economy, talent, and technology all provide excellent conditions for artificial intelligence:

(1) Policy: Artificial intelligence has entered the national strategic level. The State Council elevated artificial intelligence to the national strategic level in the "Guiding Opinions on Actively Promoting the 'Internet+' Action". In the "Science and Technology Innovation-2030 Project" of the 13th Five-Year Plan, intelligent manufacturing and robotics were listed as major development projects. In 2016, to accelerate the development of the artificial intelligence industry, the National Development and Reform Commission, the Ministry of Science and Technology, the Ministry of Industry and Information Technology, and the Cyberspace Administration of China formulated the "Three-Year Action Plan for 'Internet+' Artificial Intelligence". In 2016, well-known companies in the industry such as iFlytek and Huawei jointly released the "Shenzhen Declaration on Artificial Intelligence" in Shenzhen, actively exploring cooperation mechanisms among government, industry, academia, research, and application to promote the coordinated development of the artificial intelligence industry. At the 12th National People's Congress in 2017, Premier Li Keqiang proposed to promote deeper cooperation between the mainland and Hong Kong and Macao, study and formulate a development plan for the "Guangdong-Hong Kong-Macao Greater Bay Area" city cluster, and enhance its status and function in national economic development and opening up.

(2) Economy: The internet economy is developing rapidly. According to iResearch Consulting, China's internet economy grew by about 33% in 2015, with a market size exceeding 100 billion yuan, and this trend continues. Since 2014, investment institutions have significantly increased both the amount and the number of investments in the field of artificial intelligence. According to statistics from consulting firm Venture Scanner, the number of global artificial intelligence companies exceeded 1,000 in 2016, with financing reaching as high as US$4.8 billion.

(3) Talent: China's R&D investment accounts for 20% of the global total, second only to the United States. The information technology and Internet fields are the country's key investment areas. In the past five years, the number of researchers has maintained a continuous growth of 20%, providing sufficient talent support for the development of artificial intelligence.

(4) Technology: my country has achieved technological breakthroughs in fields such as computer vision and intelligent voice, and is at the international leading level. Representative domestic companies or institutions include: Baidu, Alibaba, Tencent, iFlytek, Megvii Technology, SenseTime, etc.

In artificial intelligence, intelligent voice technology is an important branch and an indispensable part of human-computer interaction. Voice technology includes speech recognition, speech synthesis, voiceprint recognition, speech evaluation, and speech conversion, among which pronunciation error detection and correction and speech conversion technologies have recently made new breakthroughs.

Pronunciation error detection and correction

English is the most widely used language in the world today, and its importance is deeply understood by everyone. Due to its dominant position, it is currently used as the primary foreign language in foreign language instruction in over 100 countries. According to EF Education First's "2015 English Proficiency Index Report," Chinese people spend hundreds of billions of yuan annually on English training, but the actual results are still unsatisfactory, ranking mainland China 47th out of 70 countries and regions worldwide. Although English proficiency remains at a low level, it has shown a significant upward trend in recent years. According to the China Residents' Consumption Survey Report published by the China Social Survey Institute, China has become the fastest-growing market in the global English training sector, with an annual growth rate of 12%. In such a vast English training market, offline English training faces numerous contradictions, including a shortage of teachers, inconsistent oral English proficiency among teachers, "cramming" versus "exam-oriented" teaching methods, and the inability to effectively improve spoken and listening skills.

Furthermore, the pronunciation methods and positions of Chinese Pinyin and English phonetic symbols differ, but many Chinese students, when first learning English, habitually use familiar Chinese Pinyin to mark and memorize the pronunciation of English words. Over time, this fosters poor pronunciation habits. Additionally, factors such as the generally shy nature of Chinese students, insufficient time for oral practice in class, lack of feedback on after-class oral practice, and the non-standard pronunciation of most English teachers contribute to the non-standard pronunciation of Chinese students. Since pronunciation has always been a hurdle for Chinese students learning English, many are willing to pay high tuition fees to have foreign teachers correct their pronunciation. The rise of mobile online language learning has spurred the development of AI voice assessment and promoted the advancement of AI pronunciation correction technology.

While many online English learning software programs are available on the market, most simply play audio and video learning materials, have students repeat after them, and the system plays the recordings. Only a few programs offer scoring and evaluation functions, and the accuracy of these evaluations has consistently been criticized by students. Therefore, the market urgently needs highly reliable scoring and evaluation technology.

Figure 1. Related products currently on the market

In addition to providing highly reliable scoring and assessment technology, students also urgently need specific feedback and suggestions for pronunciation diagnosis. Simply using scoring and assessment technology can only point out that a student's pronunciation is not good enough, but students don't understand where their pronunciation errors lie or how to improve. For example, highlighting mispronounced words in red requires repeated comparison with the original audio to analyze the details of the error. This is relatively easy to achieve in cases of obvious mispronunciation, such as: stew /steik/, mispronounced as /sti:k/.

However, it becomes very difficult in the following situations, especially when learners are unfamiliar with English pronunciation rules and grammar.

(1) For example, records /'rekɔːdz/ is misread as /'rekɔːds/.

(2) For example, the apple /ði/ is mispronounced as /ðə/ (the is pronounced /ðə/ before a consonant and /ði/ before a vowel).

(3) Long and short sounds, such as book /bʊk/, are mispronounced as /buːk/; Lily /'lɪli:/ is mispronounced as /'li:li:/.

If learners fail to identify specific errors during repeated practice, their learning efficiency and interest will decrease, and they may even repeatedly mispronounce words, developing incorrect muscle memory. This problem is known in academic research as "mispronunciation detection and diagnosis." To address this issue, many of the world's top research institutions have invested significant resources over the past decade, including the Chinese University of Hong Kong, Tsinghua University, National Taiwan University, MIT, the Infocomm Research Institute of Singapore, Microsoft Research Asia, and IBM, among others.

Figure 2. Acoustic phoneme model for multi-task learning

The challenge of "error pronunciation detection and diagnosis" lies in its difference from general speech recognition technology. It places stricter requirements on training models and data, and the errors that native speakers tend to make when learning English differ. Therefore, it is necessary to collect a large number of English recordings from native Chinese speakers and invite professionals to manually annotate the data. With the development of deep learning and years of technological accumulation, Dr. Li Kun and his team at SoundSound Technology have achieved a major breakthrough in this field, using deep neural networks to predict acoustic features and standard pronunciation, outputting posterior probabilities (as shown in Figure 2). This not only enables error detection and diagnosis but also allows for the evaluation of stress, intonation, fluency, etc. (as shown in Figure 3).

Figure 3. Demonstration of pronunciation, stress, and pitch error detection and correction technology.

Breakthroughs in speech assessment technology have made it possible for AI systems to become personal pronunciation tutors. Once the system identifies a student's specific errors, it can automatically match corresponding teaching content and practice questions, achieving more accurate adaptive recommendations. If this new technology becomes widespread, it will greatly improve the pronunciation of Chinese students, especially addressing the issues of educational resources and language environment in rural towns and villages.

Furthermore, with China's rising international status, especially driven by the Belt and Road Initiative, more and more foreigners are beginning to learn Chinese. Data shows that the number of foreigners learning Chinese globally has exceeded 100 million. However, Chinese pronunciation is a major challenge in Chinese learning. New pronunciation assessment technologies can target the characteristics of pronunciation errors made by foreign Chinese learners, using reliable automatic pronunciation assessment methods to correct pronunciation errors promptly and accurately, which can greatly improve the teaching effectiveness of Chinese-assisted pronunciation teaching systems.

Voice conversion

With the development of speech signal processing technology (including speech recognition and speech synthesis), speech has become one of the most natural and convenient interaction methods in human-computer interaction. Speech can not only convey information, but also emotions, attitudes, and the speaker's personal characteristics. Among these, the speaker's personal characteristics play a crucial role in our daily communication, allowing us to distinguish speakers in media such as telephone calls, radio programs, and movies. Furthermore, intelligent voice assistants are becoming increasingly popular, such as Apple Siri, Microsoft Cortana, and Amazon Alexa. Most people have strong preferences for the voice timbre of voice assistants; therefore, generating speech with distinctive timbre is very important in the field of human-computer interaction.

Figure 4. Diagram of speech conversion

The above scenario can be summarized as a voice conversion problem, with the goal of modifying the voice timbre of the non-target (NT) speaker to make it sound like the voice of the target (T) speaker, while keeping the speech content (T-content) unchanged (as shown in Figure 4).

Deep learning is a revolutionary technology in the field of artificial intelligence. With its application, the naturalness and fluency of synthesized and converted speech have been greatly improved. In the field of speech synthesis, Google DeepMind's WaveNet model, proposed in 2016, improved naturalness by 50%. In 2017, Yoshua Bengio et al. proposed an end-to-end synthesis model, enabling the model to generate speech directly from text without front-end preprocessing. In China, Kang Shiyin et al. applied Deep Belief Networks to speech synthesis in 2013, making them one of the earliest developers in academia and industry. In 2017, Baidu Research proposed the Deep Voice model, which improved speech synthesis speed, and experiments have shown that this model can be used for real-time speech synthesis.

The earliest attempts at speech conversion technology were made by Abe et al. in 1988. Between 1988 and 2013, most algorithms were based on Codebook Mapping, Frequency Warping, Unit Selection, and Gaussian Mixture Models. Starting in 2013, deep learning was applied to speech conversion. Nakashika et al. used deep neural networks to map non-target speaker speech to target speaker speech in a high-dimensional space. Although speech conversion technology has greatly improved compared to the past, there is still significant room for improvement in speech naturalness and timbre similarity. Furthermore, its practical applicability is not perfect; for example, it can only support conversion from a specific person to a specific target person (one-to-one), and it requires stringent data conditions, needing thousands of sentences of speech data from the target person.

Figure 5. Schematic diagram of deep recurrent neural networks (DBLSTM-RNNs)

In 2015, Dr. Sun Lifang's team, co-founder of SoundSound Technology, used deep recurrent neural networks (DBLSTM-RNNs) to improve the naturalness and fluency of converted speech (as shown in Figure 5). Traditional deep neural networks (DNNs) can only map the relationship between single frames and ignore the correlation between continuous speech signal frames. DBLSTM-RNNs solves this problem well, thereby improving naturalness and fluency.

Figure 6. Framework diagram of non-parallel sentence many-to-one speech conversion

In 2016, Dr. Sun Lifang and others used the posterior probabilities extracted from speech recognition models to map non-target speakers and target speakers (as shown in Figure 6). This enabled many-to-one conversion and reduced the requirements for training data, greatly improving the practicality of speech conversion technology. Speech conversion has a wide range of applications:

(1) Personalized speech synthesis. Combining speech conversion technology with existing speech synthesis systems to generate the desired voice for the user.

(2) Personalized speech feedback in computer-assisted language learning. Currently, language learners refer to standardized recordings during the learning process. Speech conversion systems can synthesize standard pronunciations with the user's own timbre to help users practice and compare pronunciations.

(3) Machine translation. Machine translation technology can translate a sentence from one language to another, while speech conversion technology can assist machine translation so that the translated speech still retains the speaker's tone.

(4) Personalized assistance for patients with language disorders. Language disorders are a common sequela of diseases such as stroke and Parkinson's disease. Language disorders can affect patients' daily communication and their personalization and emotional expression. Speech conversion technology combined with speech synthesis technology can help patients to conduct normal speech communication and regain the timbre of their own voice.

(5) Entertainment sector. Potential applications include film and television dubbing, game dubbing, navigation dubbing, etc.

Dr. Sun Lifang

A PhD from the Chinese University of Hong Kong and a Bachelor's degree from the University of Science and Technology of China, he previously interned at Apple in Silicon Valley. His research focuses on speech conversion and speech synthesis. In 2016, he received the only Best Paper Award at the top international conference ICME2016. He has served as a council member of the Hong Kong China Innovation and Entrepreneurship Association, co-chair of the Global Youth Leadership Alliance (GYL) Hong Kong Community, and president of the Mainland Students and Scholars Association (CSSA) of the Chinese University of Hong Kong.

Currently, he is the co-founder and CEO of Shengxi Technology, focusing on artificial intelligence + online education. In 2017, Shengxi Technology won fifth place in the China Innovation and Entrepreneurship Competition, becoming the only startup in Guangdong Province to be selected for the finals, and the only intelligent voice company to be selected for the finals.

Dr. Li Kun

Previously a research assistant and postdoctoral researcher at the Chinese University of Hong Kong. Holds 5 domestic and international patents and has published over 15 academic papers. His paper won the ICME 2016 Best Paper Award and was featured on the cover of the IEEE/ACM Transactions journal (2017). He also serves as a reviewer for several top international speech journals, including IEEE/ACM Transactions, Computer Speech & Language, and The Journal of the Acoustical Society of America.

Dr. Li Kun founded SoundSound Technology in 2016, primarily responsible for developing AI-based language learning systems. His team has received numerous grants from government and professional organizations, including startup grants from the Shenzhen Science and Technology Innovation Commission and Hong Kong Cyberport, totaling over RMB 1 million. His team has repeatedly won major entrepreneurship awards, including third prize in the Shenzhen Innovation and Entrepreneurship Competition (Internet Industry) and third prize in the China Innovation and Entrepreneurship Competition (Internet Industry). Currently, Dr. Li Kun's technology has been successfully applied to products of several educational companies, including Baicizhan, Colorful English, and Wantong.

Cutting-edge intelligent voice technologies – pronunciation error detection and correction and speech conversion

Read next

Research on Comprehensive Performance Testing System for Vehicle Clutch

CATDOLL 148CM Hanako Silicone Doll

CATDOLL 115CM Cici TPE (Asian Tone)

Research on Remote Monitoring System for Industrial Washing Machines Based on Embedded Web