The prediction of pause fillers plays a crucial role in enhancing the naturalness of synthesized speech. Neural networks, including LSTM, BERT, and XLNet, have been employed for pause fillers prediction modules.
1. Introduction
Pause fillers, also known as filled pauses, are brief pauses or meaningless interjections inserted in speech to simulate pauses and thinking processes in human speech expression. Pause fillers are widely used in the fields of speech synthesis and natural language processing with the aim of improving the naturalness and fluency of synthesized speech. The purpose of pause fillers is to mimic the natural habits of human speech during conversations and make synthesized speech more akin to real human speech expression. In human communication, people employ pauses of varying lengths to convey meaning, engage in thinking, and regulate speech pace, among other reasons. For instance, common pause fillers in English include “uh”, “um”, “well”, and “you know”, while in Chinese, common pause fillers include “啊”, “呃”, “嗯”, and “那个”. These pause fillers are inserted between sentences or phrases to serve as separators and connectors in speech.
In speech synthesis, the appropriate insertion of pause fillers can enhance the naturalness and fluency of synthesized speech. This provides better speech rhythm and intonation, making it easier for listeners to understand and accept the synthesized speech. The prediction of pause fillers plays a crucial role in speech synthesis systems, requiring the identification of suitable positions to insert pause fillers based on input text and context.
Methods for predicting pause fillers include rule-based approaches, statistical models, and deep learning methods. In recent years, with the advancement of deep learning technologies, neural network models such as LSTM, BERT, and XLNet have been widely employed in pause filler prediction tasks, achieving favorable results. However, the accuracy of existing pause filler prediction modules still has room for improvement, necessitating the adoption of more effective strategies and more efficient natural language processing models for further advancements.
2. Prediction of Pause Fillers
Over the years, numerous scholars have extensively researched the prediction of pause fillers using various methods, which has resulted in more natural and authentic text generation. For instance, Nakanishi R. et al. proposed a method based on analyzing human–robot interaction data and machine learning models to predict the occurrence and appropriate forms of pause fillers, aiming to generate them at the beginning of system utterances in humanoid robot spoken dialog systems, indicating turn-taking or turn-holding intentions
[1]. Balagopalan A. et al. compared two common methods for AD detection on a matched dataset, assessing the advantages of domain knowledge and BERT pre-trained transfer models in predicting pauses and interruptions
[2]. Sabrina J. M. et al. obtain a dialog agent with greatly improved linguistic calibration by incorporating such metacognitive features into the training of a controllable generation model
[3]. Ryan L. B. et al. discuss the trajectory of interdisciplinary research on language and the challenges of integrating analysis methods across paradigms, recommending promising future directions for the field along the way
[4], and Jennifer E. A. et al. found that in online language processing, disfluent expressions affect listeners’ understanding of subsequent nouns, making them more inclined to form associations with objects that have not been previously mentioned, revealing that the fundamental process of decoding language input is influenced by disfluent expressions
[5]. These scholars’ research in linguistics highlights the importance of pauses, and the aim is to enhance linguistic fluency by training a well-versed pause fillers model.
However, due to individual speaking habits, it is challenging to train a universal pause filler prediction module. To address personalized needs, Matsunaga et al. proposed a personalized pause filler generation method based on a group prediction model and explored an alternative group prediction approach
[6]. It should be noted that there are significant differences between the Chinese and Japanese languages, making the aforementioned models unsuitable for Chinese. To train a Chinese-specific model, new datasets must be sought, and the grouping conditions for pause fillers need to be reexamined. Furthermore, many pause filler prediction models, including those in Japanese, have not considered integration with mainstream speech synthesis systems and overall performance, necessitating further exploration and improvement in practicality
[7].
An accurate and appropriate pause fillers prediction model can automatically predict suitable pause fillers in Text-to-Speech (TTS) systems, simulating the fluency and coherence of natural human speech. This makes the synthesized speech sound more natural and reduces the artificiality of the machine-generated voice. A pause fillers prediction model helps TTS systems better simulate human communication and expression in dialogs
[8]. By inserting pause fillers at appropriate positions, TTS systems can better mimic human-to-human conversations, enhancing user experience and making interactions more natural and friendly
[9]. Accurately predicting pause fillers can avoid unnecessary pauses and redundancies, thereby improving the efficiency of TTS systems. Speech synthesis can proceed more smoothly, reducing unnecessary delays and waiting times. The pause fillers prediction model can be customized based on individuals’ speaking habits and speech characteristics, making the speech synthesis more personalized and adaptable. Different speakers’ pause habits can be incorporated into the model, making the synthesized speech more in line with individuals’ styles and traits. Combining the pause fillers prediction model with TTS systems significantly enhances the quality, naturalness, and personalization of speech synthesis, making the synthesized speech more closely resemble authentic human expression, ultimately improving user experience and satisfaction.
3. TTS
TTS is a technology that converts text into speech. It takes input text and transforms it into audible speech output using speech synthesis techniques
[10][11][12]. TTS systems find wide applications in speech synthesis, accessibility technologies, education, and various other fields. A typical TTS system consists of two main components: the front-end and the back-end. The front-end is responsible for analyzing the input text and extracting linguistic information and speech features, and then generating an intermediate representation in the form of a phoneme sequence. The back-end utilizes these phoneme sequences and acoustic models to generate the final speech output using speech synthesis algorithms.
In a TTS system, the pause fillers prediction module plays a crucial role
[13]. It automatically determines and inserts appropriate pause fillers based on the input text and contextual information, aiming to enhance the naturalness and fluency of the synthesized speech. By incorporating the pause fillers prediction module, the TTS system can better simulate the pauses and thinking processes in human speech expression, making the synthesized speech more realistic and easier to comprehend.
In recent years, more and more TTS systems have been used in Chinese. For example, Tacotron2 system, parallel WaveNet system, Fastspeech system, and so on. Qin et al. proposed Myanmar TTS synthesis
[14] using an end-to-end model. Luz et al. used text context embedding to directly predict the prosodic features of reference audio, and text context embedding is calculated by BERT (a pre-trained language representation model). Luz’s improved system could generate richer prosodic speech during the inference stage with limited training data
[15]. Zhang and Ling proposed a speech synthesis model based on fine-grained style representation, called word-level style variation (WSV)
[16]. In order to improve the accuracy of WSV prediction and the naturalness of synthesized speech, Zhang and Ling used a pretrained BERT model and speech information to derive semantic descriptions. Liu et al. proposed a speech synthesis method
[17] based on LPCNet. Qiu et al. proposed an end-to-end speech synthesis method based on WaveNet
[18]. Zhang and Ling designed two context encoders
[19], the sentence window context encoder and the paragraph-level context encoder. The context representation is extracted by BERT from multiple sentences through an additional attention module. They also proposed a deep learning method using BERT to provide a wide range of contextual representations for statistical parametric speech synthesis (SPSS)
[20][21][22]. In order to explore the problem of zero-shot TTS, Casanova E et al. proposed a speaker-conditional architecture
[23] that includes a flow-based decoder. The speaker-conditional architecture achieved state-of-the-art results for similarity with new speakers using video of only eleven speakers. Wu Y et al. developed Ada-Speech 4
[24], a zero-shot adaptive TTS system for high-quality speech synthesis. It could achieve better voice quality and similarity than baselines in multiple datasets without any fine-tuning. Kumar N et al. presented a novel zero-shot multi-speaker speech synthesis approach (ZSM-SS)
[25]. Compared to the normalization architecture, ZSM-SS added non-autoregressive multi-head attention between the encoder–decoder architecture
[26][27][28].
This entry is adapted from the peer-reviewed paper 10.3390/app131910652