The rapid momentum of deep neural networks (DNNs) has yielded state-of-the-art performance in various machine-learning tasks using speaker identification systems. Speaker identification is based on the speech signals and the features that can be extracted from them.
1. Introduction
Speaker identification is the process of determining who a person is by comparing their voice to a list of registered and known users. In addition, there is a difference between the act of authentication (also known as speaker verification) and identification. Speaker recognition in terms of identifying and verifying the identity of humans depends on many features that the signal of the human voice could provide, like acoustical information (short-time features) and prosody (speech pace, pitch period, stress, etc.)
[1]. The speech signal’s acoustic features are commonly used when developing a speaker recognition system. These features are supposed to be effective, redundant-free, and minimal concerning the original size of the speech signal waveform. Much speaker recognition research has used DNNs as a classification model. Various models in speaker recognition are used: vector quantization (VQ)
[2], hidden Markov model (HMM)
[3], and deep neural networks (DNNs)
[4]. The VQ technique operates in speaker recognition models with a text-independent system. VQ techniques are used in speaker recognition models, which map large vector spaces into dimensionally small finite spaces. It is a dimension-reducing approach in that the reduced space is divided into clusters, and the cluster’s center is called a codebook. By choosing the closest codebook vector to the input, a classification algorithm is used for evaluation
[2]. For text-dependent systems, VQ shows unreliable and poor results
[3]. Recently, the success of DNNs for speaker identification models spurred researchers to adopt this approach. DNNs could be used to generate and classify a feature, or in some research, feature extraction steps are carried out before evaluating DNNs.
2. Deep Learning Networks and YOHO English Speech Dataset
Many speaker recognition methods were proposed by researchers, such as text-dependent or text-independent systems and speaker identification or speaker verification systems that use pitch or other features as system inputs. The authors in
[5] developed a voice recognition methodology that depends on the acoustic signal of the vocal cords’ vibration to identify the person. The methodology works through pre-processing and feature extraction steps based on a new measurement technique for the vocal cords’ vibration frequency, unlike the existing strategy that deals with the speaker’s voice. The methodology provides 91.00% accuracy in identifying the desired speaker.
Nasr et al.
[6] proposed a speaker identification system with a normalized pitch frequency (NPF) incorporated with mel frequency cepstral coefficients (MFCCs) as feature input for improving the speaker identification system’s performance. The MFCCs were extracted as a feature of the system, and then a normalization step for the pitch detection approach was implemented. After feature extraction, the proposed speaker identification system was evaluated with an artificial neural network (ANN) using a multi-layer perceptron (MLP) model and classifying the speaker based on the new proposed feature NFP incorporated with MFCCs. The experiment includes seven cases of feature extraction. Simulation results show that the proposed system improves the recognition rate.
An et al.
[7] used two prototypical deep convolutional neural networks (CNNs): the visual geometry group (VGG-CNN) and residual networks (ResNet) and selected the VoxCeleb dataset to evaluate the proposed system. The authors detailed the network training parameters with varying learning rates to warm up the learning process. They fixed the length of the input sequence up to 3 s. They compared the state-of-the-art methods to evaluate the proposed approach’s effectiveness, such as traditional i-Vector-based methods, ResNet with and without the proposal self-attention layer, and VGG like CNN with and without the proposed self-attention layer. The result for the speaker identification system shows that the proposed two methods reached 88.20% and 90.80% top 1 accuracy, respectively.
Moreover, Meftah et al.
[8] proposed a speaker identification system in an emotional state using CNN and long short-term memory (LSTM) to design a convolutional recurrent neural network (CRNN) system. They selected the King Saud University Emotion (KSUEmotion) corpus for Modern Standard Arabic (MSA) and the Emotional Prosody Speech and Transcripts (EPST) corpus for the English language. The experiment investigated the impact of language variation and the speaker’s emotional state on system performance. For the pre-processing step, the unvoiced and silent part of the speech signal was ignored. The proposed system recognizes speakers’ regard for their emotional state. The accuracy of the experiment shows a good performance with 97.46% for the Arabic language, 97.18% for the English language, and 91.83% for cross-language, which is highly promising considering that only one input feature was used in the system, which is the spectrogram. Jahangir et al.
[9] evaluated a novel fusion of MFCCs and time-based features, which were combined to improve the accuracy of the speaker identification system. The system was developed by a feedforward neural network (FFNN). A customized FFNN was used as a classifier to identify the speaker. To train the FFNN, the MFCCs feature was fed to identify speakers based on the utterance’s unique pattern. To avoid overfitting, different training functions were used. The experimental results for the proposed MFCCT showed overall accuracy between 83.50% and 92.90%.
Jakubec et al.
[10] recently evaluated a text-independent speaker recognition system using deep CNN. They experimented with two comprehensive architectures: ResNet and VGG and selected the VoxCeleb1 dataset for the proposed system. Two types of speaker recognition (SR) approaches, speaker identification (SI) and speaker verification (SV), were used. According to the experiment results, the best accuracy was achieved by the neural network based on ResNet-34 in both SI and SV. Singh
[11] used CNN to develop a speaker recognition model. The system experimented with a novel combination of low-level and high-level features using MFCCs. They used two classifiers to enhance the system’s performance: support vector machine (SVM) and k-nearest neighbors (KNN). To sum up, the results in
Table 1 illustrate the different experiment accuracy based on different SNR levels.
Vandyke et al.
[12] developed voice source waveforms for utterance-level speaker identification using SVM: each speech signal of the desired YOHO dataset is processed for the feature extraction phase. The feature extraction process goes through many steps to determine the pitch period for each frame. The source-frame feature is then used as input for the SVM model. For the experiment, they expressed multi-class SVM and single-class SVM regression. The experiment results were 85.30% and 72.50% for multi-class and single-class, respectively. Shah et al.
[13] proposed a two-branch network to extract features from face and voice signals in a multimodal system and tested by SVM. The results indicate an overlap between face and voice. Furthermore, they demonstrated that facial data enhances speaker recognition performance.
Hamsa et al.
[14] evaluated an end-to-end framework for speaker identification under challenging circumstances, including emotion and interference, using a learned voice segregation and speech VGG. Using the Ryerson audio-visual dataset (RAVDESS), the presented model outperformed recent literature on emotional speech data in English and Arabic, reporting an average speaker identification rate of 85.2%, 87.0%, and 86.6% using the RAVDESS, speech under simulated and actual stress (SUSAS) dataset, and Emirati-accented speech dataset (ESD), respectively. Despite that, since speech carries a wealth of information, obtaining salient features that can identify speakers is a difficult problem in speech recognition systems
[15][16].
Table 1. Related works comparison with results.
Reference |
Year |
Dataset |
Speaker Recognition |
Features |
Technique |
Accuracy (%) |
SI |
SV |
[12] |
2013 |
YOHO |
✓ |
|
Pitch |
SVM |
Multi-class: 85.30 Single-class: 72.50 |
[5] |
2017 |
- |
✓ |
|
Spectrogram |
Correlation |
91.00 |
[7] |
2019 |
VoxCeleb |
✓ |
|
MFCCs |
VGG ResNet |
88.20 90.80 |
[8] |
2020 |
KSU-Emotions, EPST |
✓ |
|
Spectrogram |
CRNN |
Arabic: 97.46 English: 97.18 |
[9] |
2020 |
LibriSpeech |
✓ |
|
MFCCT |
FFNN |
92.90 |
[17] |
2020 |
- |
✓ |
|
MFCCs, Pitch |
Correlation |
92.00 |
[10] |
2021 |
VoxCeleb1 |
✓ |
✓ |
Spectrogram |
VGG ResNet |
SV: 5.33 EER SI: 93.80 |
[11] |
2023 |
- |
✓ |
|
MFCCs |
CNN |
92.46 |
[13] |
2023 |
VoxCeleb1 |
✓ |
|
Two-branch network |
SVM |
97.20 |
[14] |
2023 |
RAVDESS SUSAS ESD |
✓ |
|
Voice Segregation, Speech VGG |
New pipeline |
85.20 87.00 86.60 |
[18] |
2023 |
VoxCeleb1 |
✓ |
|
Pre-feed-forward feature extractor |
SANs |
94.38 |
[19] |
2023 |
DEMoS |
✓ |
|
Spectrogram, MFCCs |
CNN |
90.15 |
This entry is adapted from the peer-reviewed paper 10.3390/app13179567