Deep Learning Networks and YOHO English Speech Dataset

Deep Learning Networks and YOHO English Speech Dataset: Comparison

Please note this is a comparison between Version 2 by Rita Xu and Version 1 by Nourah Marzoq Almarshady.

The rapid momentum of deep neural networks (DNNs) has yielded state-of-the-art performance in various machine-learning tasks using speaker identification systems. Speaker identification is based on the speech signals and the features that can be extracted from them.

classification
DNNs
English
speaker identification

1. Introduction

Speaker identification is the process of determining who a person is by comparing their voice to a list of registered and known users. In addition, there is a difference between the act of authentication (also known as speaker verification) and identification. Speaker recognition in terms of identifying and verifying the identity of humans depends on many features that the signal of the human voice could provide, like acoustical information (short-time features) and prosody (speech pace, pitch period, stress, etc.) ^[1]. The speech signal’s acoustic features are commonly used when developing a speaker recognition system. These features are supposed to be effective, redundant-free, and minimal concerning the original size of the speech signal waveform. Much speaker recognition research has used DNNs as a classification model. Various models in speaker recognition are used: vector quantization (VQ) ^[2], hidden Markov model (HMM) ^[3], and deep neural networks (DNNs) ^[4]. The VQ technique operates in speaker recognition models with a text-independent system. VQ techniques are used in speaker recognition models, which map large vector spaces into dimensionally small finite spaces. It is a dimension-reducing approach in that the reduced space is divided into clusters, and the cluster’s center is called a codebook. By choosing the closest codebook vector to the input, a classification algorithm is used for evaluation ^[2]. For text-dependent systems, VQ shows unreliable and poor results ^[3]. Recently, the success of DNNs for speaker identification models spurred researchers to adopt this approach. DNNs could be used to generate and classify a feature, or in some research, feature extraction steps are carried out before evaluating DNNs.

2. Deep Learning Networks and YOHO English Speech Dataset

Many speaker recognition methods were proposed by researchers, such as text-dependent or text-independent systems and speaker identification or speaker verification systems that use pitch or other features as system inputs. The authors in [6]^[5] developed a voice recognition methodology that depends on the acoustic signal of the vocal cords’ vibration to identify the person. The methodology works through pre-processing and feature extraction steps based on a new measurement technique for the vocal cords’ vibration frequency, unlike the existing strategy that deals with the speaker’s voice. The methodology provides 91.00% accuracy in identifying the desired speaker. Nasr et al. [7]^[6] proposed a speaker identification system with a normalized pitch frequency (NPF) incorporated with mel frequency cepstral coefficients (MFCCs) as feature input for improving the speaker identification system’s performance. The MFCCs were extracted as a feature of the system, and then a normalization step for the pitch detection approach was implemented. After feature extraction, the proposed speaker identification system was evaluated with an artificial neural network (ANN) using a multi-layer perceptron (MLP) model and classifying the speaker based on the new proposed feature NFP incorporated with MFCCs. The experiment includes seven cases of feature extraction. Simulation results show that the proposed system improves the recognition rate. An et al. [8]^[7] used two prototypical deep convolutional neural networks (CNNs): the visual geometry group (VGG-CNN) and residual networks (ResNet) and selected the VoxCeleb dataset to evaluate the proposed system. The authors detailed the network training parameters with varying learning rates to warm up the learning process. They fixed the length of the input sequence up to 3 s. They compared the state-of-the-art methods to evaluate the proposed approach’s effectiveness, such as traditional i-Vector-based methods, ResNet with and without the proposal self-attention layer, and VGG like CNN with and without the proposed self-attention layer. The result for the speaker identification system shows that the proposed two methods reached 88.20% and 90.80% top 1 accuracy, respectively. Moreover, Meftah et al. [9]^[8] proposed a speaker identification system in an emotional state using CNN and long short-term memory (LSTM) to design a convolutional recurrent neural network (CRNN) system. They selected the King Saud University Emotion (KSUEmotion) corpus for Modern Standard Arabic (MSA) and the Emotional Prosody Speech and Transcripts (EPST) corpus for the English language. The experiment investigated the impact of language variation and the speaker’s emotional state on system performance. For the pre-processing step, the unvoiced and silent part of the speech signal was ignored. The proposed system recognizes speakers’ regard for their emotional state. The accuracy of the experiment shows a good performance with 97.46% for the Arabic language, 97.18% for the English language, and 91.83% for cross-language, which is highly promising considering that only one input feature was used in the system, which is the spectrogram. Jahangir et al. [10]^[9] evaluated a novel fusion of MFCCs and time-based features, which were combined to improve the accuracy of the speaker identification system. The system was developed by a feedforward neural network (FFNN). A customized FFNN was used as a classifier to identify the speaker. To train the FFNN, the MFCCs feature was fed to identify speakers based on the utterance’s unique pattern. To avoid overfitting, different training functions were used. The experimental results for the proposed MFCCT showed overall accuracy between 83.50% and 92.90%. Jakubec et al. [11]^[10] recently evaluated a text-independent speaker recognition system using deep CNN. They experimented with two comprehensive architectures: ResNet and VGG and selected the VoxCeleb1 dataset for the proposed system. Two types of speaker recognition (SR) approaches, speaker identification (SI) and speaker verification (SV), were used. According to the experiment results, the best accuracy was achieved by the neural network based on ResNet-34 in both SI and SV. Singh [12]^[11] used CNN to develop a speaker recognition model. The system experimented with a novel combination of low-level and high-level features using MFCCs. They used two classifiers to enhance the system’s performance: support vector machine (SVM) and k-nearest neighbors (KNN). To sum up, the results in Table 1 illustrate the different experiment accuracy based on different SNR levels. Vandyke et al. [13]^[12] developed voice source waveforms for utterance-level speaker identification using SVM: each speech signal of the desired YOHO dataset is processed for the feature extraction phase. The feature extraction process goes through many steps to determine the pitch period for each frame. The source-frame feature is then used as input for the SVM model. For the experiment, they expressed multi-class SVM and single-class SVM regression. The experiment results were 85.30% and 72.50% for multi-class and single-class, respectively. Shah et al. [14]^[13] proposed a two-branch network to extract features from face and voice signals in a multimodal system and tested by SVM. The results indicate an overlap between face and voice. Furthermore, they demonstrated that facial data enhances speaker recognition performance. Hamsa et al. [15]^[14] evaluated an end-to-end framework for speaker identification under challenging circumstances, including emotion and interference, using a learned voice segregation and speech VGG. Using the Ryerson audio-visual dataset (RAVDESS), the presented model outperformed recent literature on emotional speech data in English and Arabic, reporting an average speaker identification rate of 85.2%, 87.0%, and 86.6% using the RAVDESS, speech under simulated and actual stress (SUSAS) dataset, and Emirati-accented speech dataset (ESD), respectively. Despite that, since speech carries a wealth of information, obtaining salient features that can identify speakers is a difficult problem in speech recognition systems [16,17]^[15][16].

Table 1. Related works comparison with results.

Reference	Year	Dataset	Speaker Recognition		Features	Technique	Accuracy (%)
Reference	Year	Dataset	SI	SV	Features	Technique	Accuracy (%)
[13]^[12]	2013	YOHO	✓		Pitch	SVM	Multi-class: 85.30 Single-class: 72.50
[6]^[5]	2017	-	✓		Spectrogram	Correlation	91.00
[8]^[7]	2019	VoxCeleb	✓		MFCCs	VGG ResNet	88.20 90.80
[9]^[8]	2020	KSU-Emotions, EPST	✓		Spectrogram	CRNN	Arabic: 97.46 English: 97.18
[10]^[9]	2020	LibriSpeech	✓		MFCCT	FFNN	92.90
[18]^[17]	2020	-	✓		MFCCs, Pitch	Correlation	92.00
[11]^[10]	2021	VoxCeleb1	✓	✓	Spectrogram	VGG ResNet	SV: 5.33 EER SI: 93.80
[12]^[11]	2023	-	✓		MFCCs	CNN	92.46
[14]^[13]	2023	VoxCeleb1	✓		Two-branch network	SVM	97.20
[15]^[14]	2023	RAVDESS SUSAS ESD	✓		Voice Segregation, Speech VGG	New pipeline	85.20 87.00 86.60
[19]^[18]	2023	VoxCeleb1	✓		Pre-feed-forward feature extractor	SANs	94.38
[20]^[19]	2023	DEMoS	✓		Spectrogram, MFCCs	CNN	90.15

References

Kacur, J.; Truchly, P. Acoustic and auxiliary speech features for speaker identification system. In Proceedings of the 2015 57th International Symposium ELMAR (ELMAR), Zadar, Croatia, 28–30 September 2015; IEEE: Piscataway, NJ, USA; pp. 109–112.
Bharali, S.S.; Kalita, S.K. Speaker identification using vector quantization and I-vector with reference to Assamese language. In Proceedings of the 2017 International Conference on Wireless Communications, Signal Processing and Networking, WiSPNET 2017, Chennai, India, 22–24 March 2017; pp. 164–168.
Zeinali, H.; Sameti, H.; Burget, L. HMM-based phrase-independent i-vector extractor for text-dependent speaker verification. IEEE/ACM Trans. Audio Speech Lang. Process 2017, 25, 1421–1435.
Chang, J.; Wang, D. Robust speaker recognition based on DNN/i-vectors and speech separation. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing–Proceedings, New Orleans, LA, USA, 5–9 March 2017; pp. 5415–5419.
Ishac, D.; Abche, A.; Karam, E.; Nassar, G.; Callens, D. A text-dependent speaker-recognition system. In Proceedings of the 2017 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Turin, Italy, 22–25 May 2017; IEEE: Piscataway, NJ, USA; pp. 1–6.
Nasr, M.A.; Abd-Elnaby, M.; El-Fishawy, A.S.; El-Rabaie, S.; El-Samie, F.E.A. Speaker identification based on normalized pitch frequency and Mel Frequency Cepstral Coefficients. Int. J. Speech Technol. 2018, 21, 941–951.
An, N.N.; Thanh, N.Q.; Liu, Y. Deep CNNs With Self-Attention for Speaker Identification. IEEE Access 2019, 7, 85327–85337.
Meftah, A.H.; Mathkour, H.; Kerrache, S.; Alotaibi, Y.A. Speaker Identification in Different Emotional States in Arabic and English. IEEE Access 2020, 8, 60070–60083.
Jahangir, R.; Teh, Y.W.; Memon, N.A.; Mujtaba, G.; Zareei, M.; Ishtiaq, U.; Akhtar, M.Z.; Ali, I. Text-Independent Speaker Identification Through Feature Fusion and Deep Neural Network. IEEE Access 2020, 8, 32187–32202.
Jakubec, M.; Lieskovska, E.; Jarina, R. Speaker Recognition with ResNet and VGG Networks. In Proceedings of the 2021 31st International Conference Radioelektronika (RADIOELEKTRONIKA), Brno, Czech Republic, 19–21 April 2021; IEEE: Piscataway, NJ, USA; pp. 1–5.
Singh, M.K. Robust Speaker Recognition Utilizing Lexical, MFCC Feature Extraction and Classication Technique. 2023. Available online: https://www.researchgate.net/publication/366857924_Robust_Speaker_Recognition_Utilizing_Lexical_MFCC_Feature_Extraction_and_Classification_Technique (accessed on 18 July 2023).
Vandyke, D.; Wagner, M.; Goecke, R. Voice source waveforms for utterance level speaker identification using support vector machines. In Proceedings of the 2013 8th International Conference on Information Technology in Asia (CITA), Kota Samarahan, Malaysia, 1–4 July 2013; IEEE: Piscataway, NJ, USA; pp. 1–7.
Shah, S.H.; Saeed, M.S.; Nawaz, S.; Yousaf, M.H. Speaker Recognition in Realistic Scenario Using Multimodal Data. In Proceedings of the 3rd IEEE International Conference on Artificial Intelligence, ICAI 2023, Islamabad, Pakistan, 22–23 February 2023; pp. 209–213.
Hamsa, S.; Shahin, I.; Iraqi, Y.; Damiani, E.; Nassif, A.B.; Werghi, N. Speaker identification from emotional and noisy speech using learned voice segregation and speech VGG. Expert Syst. Appl. 2023, 224, 119871.
Zailan, M.K.N.; Ali, Y.M.; Noorsal, E.; Abdullah, M.H.; Saad, Z.; Leh, A.M. Comparative analysis of LPC and MFCC for male speaker recognition in text-independent context/Mohamad Khairul Najmi Zailan. ESTEEM Acad. J. 2023, 19, 101–112.
CKao, Y.; Chueh, H.E. Voice Response Questionnaire System for Speaker Recognition Using Biometric Authentication Interface. Intell. Autom. Soft Comput. 2022, 35, 913–924.
Gupte, R.; Hawa, S.; Sonkusare, R. Speech Recognition Using Cross Correlation and Feature Analysis Using Mel-Frequency Cepstral Coefficients and Pitch. In Proceedings of the 2020 IEEE International Conference for Innovation in Technology (INOCON), Bangluru, India, 6–8 November 2020; IEEE: Piscataway, NJ, USA; pp. 1–5.
Safari, P.; India, M.; Hernando, J. Self Attention Networks in Speaker Recognition. Appl. Sci. 2023, 13, 6410.
Costantini, G.; Cesarini, V.; Brenna, E. High-Level CNN and Machine Learning Methods for Speaker Recognition. Sensors 2023, 23, 3461.