Speech disfluency, particularly stuttering, can have a significant impact on effective communication. Stuttering is a speech disorder characterized by repetitions, prolongations, and blocks in the flow of speech, which can result in communication difficulties, social isolation, and low self-esteem. Stuttering can also lead to negative reactions from listeners, such as impatience or frustration, which can further exacerbate communication difficulties.
1. Background: Stuttering and Its Types
Stuttering is not a disease but a disorder that can be cured through proper consultation
[1]. It has many types depending upon how they hinder fluent speech. A summary of these stuttering types is given in
Table 1 and further explained below. Block (BL) refers to sudden pauses in vocal utterances in between the speech. For example—I want [...] pizza—where there is a distinct gap between the speech. This pause is involuntary and hard to detect from audio signals only.
Prolongation (PR) happens when the speaker elongates a syllable/phoneme of any word during speaking. The duration of such elongations varies according to the severity of dysfluency and is often accompanied by high pitch. An example of this is—Have a ni[iiii]ce day.
In stuttering disfluency, Repetition is stated as the quick repetition of a part of speech. It is further classified into different categories. Sound Repetition (SR) happens when only a small sound is repeated. For example—I am re[re-re-re]ady, where their sound is repeated more than once. In Word Repetition (WR), the speaker repeats a complete word as in I am [am] fine. Phrase Repetition (PhR), as the name suggests, is the repetition of a phrase while speaking. An example of this is—He was [he was] there.
The last stuttering type is Interjection (IJ), in which the speaker utters some filler words/exclamations that do not belong to the spoken phrase. Some common filler words are ‘um’, ‘uh’, ‘like’, ‘you know’, etc. The No Dysfluency (ND) in Table 1 does not refer to any stuttering type. It is for when someone/some audio clip does not have any stuttering problems. In this research, the researchers focus on detecting the following stuttering types: BL, PR, SR, WR, and IJ. These 5 stuttering types are the most common and implemented in most of the research work.
2. Stutter Classification Using Classic Machine Learning
The paper
[2] focused on the use of Linear Predictive Cepstral Coefficient (LPCC) to identify prolongations and repetitions in speech signals. The authors of the paper manually annotated 10 audio clips from University College London’s Archive of Stuttered Speech Dataset (UCLASS)—a single clip from each of the 8 male and 2 female speakers. They then extracted the LPCC feature from the clips by representing the Linear Predictive Coefficient (LPC) in the cepstrum domain
[3] using auto-correlation analysis. Linear Discriminant Analysis (LDA) and k-Nearest Neighbors (k-NN) algorithms were used to classify the clips. The authors obtained 89.77% accuracy while using k-NN with k = 4 and 87.5% accuracy using the LDA approach.
Mel-Frequency Cepstral Coefficients (MFCCs) are used as the speech feature in
[4] to determine if an audio clip has repetition or not. The authors employed the Support Vector Machine (SVM) algorithm as a classifier in the attempt to identify diffluent speech from 15 audio samples. Their approach resulted in 94.35% average accuracy. The paper
[5] also emphasized using MFCCs and obtained an average of 87% accuracy using Euclidean Distance as the classification algorithm.
The work undertaken in
[6] explored the applicability of the Gaussian Mixture Model (GMM) for stuttering disfluency recognition. They curated a dataset containing 200 audio clips from 40 male and and 10 female speakers and annotated each clip with one of the following stuttering types—SR, WR, PR, and IJ. The authors extracted MFCCs from each of the clips and trained the model. They achieved the highest average accuracy of 96.43% when using 39 MFCC parameters and 64 mixture components.
The work
[7] suggested that Speech Therapy has a significant effect on curing stuttered speech. In this research, the authors introduced the Kassel State of Fluency Dataset (KSoF) containing audio clips from PWS and underwent speech therapy. KSoF contains 5500 audio clips of 6 different stuttering events—BL, PR, SR, WR, IJ, and therapy-specific speech modifications. The authors extracted ComParE 2016
[8] features using OpenSMILE
[9] and wav2vec 2.0 (W2V2)
[10] and then trained an SVM classifier with a Gaussian kernel. The model produced a 48.17% average F1 Score.
Table 2 provides a summary of different ML methods used for stutter classification. Most of the works that utilize classical ML methods have used less number of audio clips—often curated by the authors themselves. Given the variability of stuttering disfluency, these small datasets neither represent a wide range of speakers nor have much data available for the ML models to get trained properly. This might cause the models to be biased.
Table 2. Summary of Prior Machine Learning Approaches for Stuttered Speech Classification.
Paper |
Dataset |
Feature |
Model/Method |
Results |
[2] |
UCLASS |
LPCC |
k-NN and LDA |
Acc. 89.27% for k-NN and 87.5% for LDA |
[4] |
Custom |
MFCC |
SVM |
Avg. Acc. 94.35% |
[5] |
Custom |
MFCC |
Euclidean Distance |
Avg. Acc. 87% |
[6] |
Custom |
MFCC |
GMM |
Avg. Acc. 96.43% |
[7] |
KSoF |
OpenSMILE and wav2vec 2.0 |
SVM |
Avg. F1 48.17% |
The work performed in
[11] explores the usage of respiratory bio-signals to differentiate between BL and non-BL speech. The authors carried out the research where a total of 68 speakers (36 Adult Who Stutter (AWS) and 33 Adult Who Do Not Stutter (AWNS)) were given a speech-related task and their respiratory patterns and pulse were recorded. Various features were extracted from the bio-signals and a Multi-Layer Perceptron (MLP) was trained to classify them. Their approach resulted in 82.6% accuracy.
In the paper
[12], the authors explored Residual Networks (ResNet)
[13] and Bidirectional Long Short-Term Memory (Bi-LSTM)
[14]. Long Short-Term Memory (LSTM) is used in speech processing and natural Language Processing (NLP) and it is effective for classifying sequential data
[15]. The authors manually annotated a total of 800 audio clips from UCLASS
[16] to train the model and obtained a 91.15% average accuracy.
The FluentNet architecture suggested in
[17] is a successor of the previous paper, where the authors upgraded the normal ResNet to a Squeeze-and-Excitation Residual Network (SE-ResNet)
[18] and added an extra layer of Attention Mechanism (Attention) to focus on the important parts of speech. The experiments were performed using UCLASS and LibriStutter—a synthetic dataset built using clips from LibriSpeech ASR Corpus
[19]. They obtained an average accuracy of 91.75% and 86.7% after training the FluentNet using Spectrogram (Spec) obtained from UCLASS and LibriStutter, respectively.
The study
[20] used a Controllable Time-delay Transformer (CT-Transformer) to detect speech disfluencies and correct punctuation in real time. In this research, the authors first created the transcripts for each audio clip
[21] and then speech Words and Positional Embed were generated from each transcript. In this way, a CT-Transformer was trained on the IWSLT 2011
[22] dataset and an in-house Chinese dataset. The model obtained an overall 70.5% F1 Score for disfluency detection using the in-house Chinese Corpus.
One of the recent deep learning (DL) models for stutter classification is StutterNet
[23]. The authors used the Time-Delay Neural Network (TDNN) model and trained it using MFCC input obtained from UCLASS. The optimized StutterNet resulted in 50.79% total accuracy while classifying stutter types—BL, PR, ND, and Repetition.
In
Table 3, a summary is given of existing DL models for stuttered speech classification. Also, the paper
[24] conducted a comprehensive examination of the various techniques used for stuttering classification, including acoustic features, statistical methods, and DL methods. Additionally, the authors highlighted some of the challenges associated with these methods and suggested potential future avenues for research.
Table 3. A Summary of Previous Deep Learning Methods for Stuttered Speech Classification.
Paper |
Dataset |
Feature |
Model/Method |
Results |
[11] |
Custom |
Respiratory Bio-signals |
MLP |
Acc. 82.6% |
[12] |
UCLASS |
Spectrogram |
ResNet + Bi-LSTM |
Avg. Acc. 91.15% |
[17] |
UCLASS + LibriStutter |
Spectrogram |
FluentNet |
Avg. Acc. 91.75% and 86.7% |
[20] |
In-house Chinese Corpus |
Word and Position Embedding |
CT Transformer |
F1 70.5% |
[23] |
UCLASS |
MFCC |
StutterNet |
Acc. 50.79% |
This entry is adapted from the peer-reviewed paper 10.3390/s23198033