Chinese Pause Fillers Prediction Module

Chinese Pause Fillers Prediction Module: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Contributor:

The prediction of pause fillers plays a crucial role in enhancing the naturalness of synthesized speech. Neural networks, including LSTM, BERT, and XLNet, have been employed for pause fillers prediction modules.

naturalness of speech
speech synthesis
Chinese pause fillers

1. Introduction

Pause fillers, also known as filled pauses, are brief pauses or meaningless interjections inserted in speech to simulate pauses and thinking processes in human speech expression. Pause fillers are widely used in the fields of speech synthesis and natural language processing with the aim of improving the naturalness and fluency of synthesized speech. The purpose of pause fillers is to mimic the natural habits of human speech during conversations and make synthesized speech more akin to real human speech expression. In human communication, people employ pauses of varying lengths to convey meaning, engage in thinking, and regulate speech pace, among other reasons. For instance, common pause fillers in English include “uh”, “um”, “well”, and “you know”, while in Chinese, common pause fillers include “啊”, “呃”, “嗯”, and “那个”. These pause fillers are inserted between sentences or phrases to serve as separators and connectors in speech.

In speech synthesis, the appropriate insertion of pause fillers can enhance the naturalness and fluency of synthesized speech. This provides better speech rhythm and intonation, making it easier for listeners to understand and accept the synthesized speech. The prediction of pause fillers plays a crucial role in speech synthesis systems, requiring the identification of suitable positions to insert pause fillers based on input text and context.

Methods for predicting pause fillers include rule-based approaches, statistical models, and deep learning methods. In recent years, with the advancement of deep learning technologies, neural network models such as LSTM, BERT, and XLNet have been widely employed in pause filler prediction tasks, achieving favorable results. However, the accuracy of existing pause filler prediction modules still has room for improvement, necessitating the adoption of more effective strategies and more efficient natural language processing models for further advancements.

2. Prediction of Pause Fillers

Over the years, numerous scholars have extensively researched the prediction of pause fillers using various methods, which has resulted in more natural and authentic text generation. For instance, Nakanishi R. et al. proposed a method based on analyzing human–robot interaction data and machine learning models to predict the occurrence and appropriate forms of pause fillers, aiming to generate them at the beginning of system utterances in humanoid robot spoken dialog systems, indicating turn-taking or turn-holding intentions ^[1]. Balagopalan A. et al. compared two common methods for AD detection on a matched dataset, assessing the advantages of domain knowledge and BERT pre-trained transfer models in predicting pauses and interruptions ^[2]. Sabrina J. M. et al. obtain a dialog agent with greatly improved linguistic calibration by incorporating such metacognitive features into the training of a controllable generation model ^[3]. Ryan L. B. et al. discuss the trajectory of interdisciplinary research on language and the challenges of integrating analysis methods across paradigms, recommending promising future directions for the field along the way ^[4], and Jennifer E. A. et al. found that in online language processing, disfluent expressions affect listeners’ understanding of subsequent nouns, making them more inclined to form associations with objects that have not been previously mentioned, revealing that the fundamental process of decoding language input is influenced by disfluent expressions ^[5]. These scholars’ research in linguistics highlights the importance of pauses, and the aim is to enhance linguistic fluency by training a well-versed pause fillers model.

However, due to individual speaking habits, it is challenging to train a universal pause filler prediction module. To address personalized needs, Matsunaga et al. proposed a personalized pause filler generation method based on a group prediction model and explored an alternative group prediction approach ^[6]. It should be noted that there are significant differences between the Chinese and Japanese languages, making the aforementioned models unsuitable for Chinese. To train a Chinese-specific model, new datasets must be sought, and the grouping conditions for pause fillers need to be reexamined. Furthermore, many pause filler prediction models, including those in Japanese, have not considered integration with mainstream speech synthesis systems and overall performance, necessitating further exploration and improvement in practicality ^[7].

An accurate and appropriate pause fillers prediction model can automatically predict suitable pause fillers in Text-to-Speech (TTS) systems, simulating the fluency and coherence of natural human speech. This makes the synthesized speech sound more natural and reduces the artificiality of the machine-generated voice. A pause fillers prediction model helps TTS systems better simulate human communication and expression in dialogs ^[8]. By inserting pause fillers at appropriate positions, TTS systems can better mimic human-to-human conversations, enhancing user experience and making interactions more natural and friendly ^[9]. Accurately predicting pause fillers can avoid unnecessary pauses and redundancies, thereby improving the efficiency of TTS systems. Speech synthesis can proceed more smoothly, reducing unnecessary delays and waiting times. The pause fillers prediction model can be customized based on individuals’ speaking habits and speech characteristics, making the speech synthesis more personalized and adaptable. Different speakers’ pause habits can be incorporated into the model, making the synthesized speech more in line with individuals’ styles and traits. Combining the pause fillers prediction model with TTS systems significantly enhances the quality, naturalness, and personalization of speech synthesis, making the synthesized speech more closely resemble authentic human expression, ultimately improving user experience and satisfaction.

3. TTS

TTS is a technology that converts text into speech. It takes input text and transforms it into audible speech output using speech synthesis techniques ^[10]^[11]^[12]. TTS systems find wide applications in speech synthesis, accessibility technologies, education, and various other fields. A typical TTS system consists of two main components: the front-end and the back-end. The front-end is responsible for analyzing the input text and extracting linguistic information and speech features, and then generating an intermediate representation in the form of a phoneme sequence. The back-end utilizes these phoneme sequences and acoustic models to generate the final speech output using speech synthesis algorithms.

In a TTS system, the pause fillers prediction module plays a crucial role ^[13]. It automatically determines and inserts appropriate pause fillers based on the input text and contextual information, aiming to enhance the naturalness and fluency of the synthesized speech. By incorporating the pause fillers prediction module, the TTS system can better simulate the pauses and thinking processes in human speech expression, making the synthesized speech more realistic and easier to comprehend.

In recent years, more and more TTS systems have been used in Chinese. For example, Tacotron2 system, parallel WaveNet system, Fastspeech system, and so on. Qin et al. proposed Myanmar TTS synthesis ^[14] using an end-to-end model. Luz et al. used text context embedding to directly predict the prosodic features of reference audio, and text context embedding is calculated by BERT (a pre-trained language representation model). Luz’s improved system could generate richer prosodic speech during the inference stage with limited training data ^[15]. Zhang and Ling proposed a speech synthesis model based on fine-grained style representation, called word-level style variation (WSV) ^[16]. In order to improve the accuracy of WSV prediction and the naturalness of synthesized speech, Zhang and Ling used a pretrained BERT model and speech information to derive semantic descriptions. Liu et al. proposed a speech synthesis method ^[17] based on LPCNet. Qiu et al. proposed an end-to-end speech synthesis method based on WaveNet ^[18]. Zhang and Ling designed two context encoders ^[19], the sentence window context encoder and the paragraph-level context encoder. The context representation is extracted by BERT from multiple sentences through an additional attention module. They also proposed a deep learning method using BERT to provide a wide range of contextual representations for statistical parametric speech synthesis (SPSS) ^[20]^[21]^[22]. In order to explore the problem of zero-shot TTS, Casanova E et al. proposed a speaker-conditional architecture ^[23] that includes a flow-based decoder. The speaker-conditional architecture achieved state-of-the-art results for similarity with new speakers using video of only eleven speakers. Wu Y et al. developed Ada-Speech 4 ^[24], a zero-shot adaptive TTS system for high-quality speech synthesis. It could achieve better voice quality and similarity than baselines in multiple datasets without any fine-tuning. Kumar N et al. presented a novel zero-shot multi-speaker speech synthesis approach (ZSM-SS) ^[25]. Compared to the normalization architecture, ZSM-SS added non-autoregressive multi-head attention between the encoder–decoder architecture ^[26]^[27]^[28].

This entry is adapted from the peer-reviewed paper 10.3390/app131910652

References

Nakanishi, R.; Inoue, K.; Nakamura, S. Generating fillers based on dialog act pairs for smooth turn-taking by humanoid robot. In Proceedings of the 9th International Workshop on Spoken Dialogue System Technology (IWSDS 2019), Singapore, 24–26 April 2019; pp. 91–101.
Balagopalan, A.; Eyre, B.; Robin, J.; Rudzicz, F.; Novikova, J. Comparing pre-trained and feature-based models for prediction of Alzheimer’s disease based on speech. Front. Aging Neurosci. 2021, 13, 635945.
Mielke, S.; Szlam, A.; Dinan, E.; Lan, B. Reducing conversational agents’ overconfidence through linguistic calibration. Trans. Assoc. Comput. Linguist. 2022, 10, 857–872.
Boyd, L.R.; Schwartz, H.A. Natural language analysis and the psychology of verbal behavior: The past, present, and future states of the field. J. Lang Soc. Psychol. 2021, 40, 21–41.
Arnold, J.E.; Tanenhaus, M.K.; Altmann, R.J.; Fagnano, M. The old and thee, uh, new: Disfluency and reference resolution. Psychol. Sci. 2004, 15, 578–582.
Matsunaga, Y.; Saeki, T.; Takamichi, S.; Saruwatari, H. Empirical study incorporating linguistic knowledge on filled pauses for personalized spontaneous speech synthesis. arXiv 2022, arXiv:2210.07559.
Maekawa, K.; Koiso, H.; Furui, S.; Isahara, H. Spontaneous speech corpus of Japanese. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta, 17–23 May 2010; pp. 1–5.
Dinkar, T.; Vasilescu, I.; Pelachaud, C. How confident are you? Exploring the role of fillers in the automatic prediction of a speaker’s confidence. In Proceedings of the 46th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020), Barcelona, Spain, 4–8 May 2020; pp. 8104–8108.
Zhao, W.; Yang, Z. An Emotion Speech Synthesis Method Based on VITS. Appl. Sci. 2023, 13, 2225.
Huang, W.C.; Hayashi, T.; Wu, Y.C.; Kameoka, H.; Toda, T. Pretraining techniques for sequence-to-sequence voice conversion. IEEE ACM Trans. Audio Speech Lang. Process. 2021, 29, 745–755.
Chen, L.; Ren, J.; Chen, P.; Mao, X.; Zhao, Q. Limited text speech synthesis with electroglottograph based on Bi-LSTM and modified Tacotron-2. Appl. Intell. 2022, 52, 15193–15209.
Zhang, Q.; Qian, X.; Ni, Z.; Aaron, N.; Eliathamby, A.; Haizhou, L. A time-frequency attention module for neural speech enhancement. IEEE ACM Trans. Audio Speech Lang. Process. 2022, 31, 462–475.
Block, A.; Predeck, K.; Zellou, G. German Word-Final Devoicing in Naturally-Produced and TTS Speech. Languages 2022, 7, 270.
Qin, Q.; Yang, J.; Li, P. Myanmar Text-to-Speech Synthesis Using End-to-End Model. In Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval (NLPIR 2020), Seoul, Republic of Korea, 18–20 December 2020; pp. 6–11.
De la Fuente Garcia, S.; Ritchie, C.W.; Luz, S. Artificial Intelligence, Speech, and Language Processing Approaches to Monitoring Alzheimer’s Disease: A Systematic Review. J. Alzheimers Dis. 2020, 78, 1547–1574.
Zhang, Y.J.; Ling, Z.H. Extracting and predicting word-level style variations for speech synthesis. IEEE ACM Trans. Audio Speech Lang. Process. 2021, 29, 1582–1593.
Liu, Z.; Yi, X.; Zhao, X.; Yang, Y. Content-Aware Robust JPEG Steganography for Lossy Channels Using LPCNet. IEEE Signal Process. Lett. 2022, 29, 2253–2257.
Qiu, Z.; Qu, D.; Zhang, L. End-to-end speech synthesis method based on WaveNet. J. Comput. Appl. 2019, 39, 1325–1329.
Zhang, Y.J.; Ling, Z.H. Learning deep and wide contextual representations using BERT for statistical parametric speech synthesis. In Proceedings of the 5th International Conference on Digital Signal Processing (CISAI2021), Kunming, China, 17–19 September 2021; pp. 146–150.
Bai, Y.; Yi, J.; Tao, J.; Tian, Z.; Wen, Z.; Zhang, S. Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from BERT. IEEE ACM Trans. Audio Speech Lang. Process. 2021, 29, 1897–1911.
Yasuda, Y.; Toda, T. Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language. IEEE J. Sel. Top. Signal Process. 2022, 16, 1319–1328.
Yi, J.; Tao, J.; Fu, R.; Rui, F.; Tao, W.; Chu, Y.Z.; Cheng, W. Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction with Multi-Modal Embeddings. IEEE ACM Tran. Audio Speech Lang. Process. 2023, 31, 2963–2973.
Edresson, C.; Christopher, S.; Eren, G.; Nicolas, M.; Müller, F.; Santos, d.O.; Arnaldo, C.; Junior, A.; da Silva, S.; Sandra, M.; et al. Sc-glowtts: An efficient zero-shot multi-speaker text-to-speech model. arXiv 2021, arXiv:2104.05557.
Wu, Y.; Xu, T.; Li, B.; He, L.; Zhao, S.; Song, R.; Qin, T.; Liu, T. Adaspeech 4: Adaptive text to speech in zero-shot scenarios. arXiv 2022, arXiv:2204.00436.
Kumar, N.; Narang, A.; Lall, B. Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis. IEEE ACM Tran. Audio Speech Lang. Process. 2022, 30, 1679–1693.
Li, S.; Yang, H.; Wang, S.; Mei, D.; Wang, J.; Fan, W. Facile synthesis of hierarchical macro/microporous ZSM-5 zeolite with high catalytic stability in methanol to olefins. Micropor. Mesopor. 2022, 329, 111538.
Ngoc, P.P.; Quang, C.T.; Chi, M.L. Adapt-Tts: High-Quality Zero-Shot Multi-Speaker Text-to-Speech Adaptive-Based for Vietnamese. J. Comput. Sci. Cybern. 2023, 39, 159–173.
Schnell, B.; Garner, P.N. Investigating a neural all pass warp in modern TTS applications. Speech Commun. 2022, 138, 26–37.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.