Automatic Speech Recognition in Portuguese Language

Automatic Speech Recognition in Portuguese Language: Comparison

Please note this is a comparison between Version 1 by Luis Rato and Version 2 by Peter Tang.

Automatic speech recognition (ASR), commonly known as speech-to-text, is the process of transcribing audio recordings into text, i.e., transforming speech into the respective sequence of words.

machine learning
deep learning
deep neural networks
speech-to-text
automatic speech recognition

1. Introduction

Automatic speech recognition (ASR), commonly known as speech-to-text, is the process of converting sounds into text, i.e., transforming speech into the respective sequence of words. In today’s world, most of the writing process is done on computers, tablets, or smartphones. Although typing is predominant, dictation has shown to be faster ^[1], which opens the opportunity for ASR to become the primary way for writing text. Despite this scenario, some challenges still persist in the ASR field, especially regarding its efficiency in less favorable conditions, e.g., noisy environments, or its application in low-resource languages.

In recent years, the field of artificial intelligence (AI) and related fields, such as machine learning (ML), have seen major growth in popularity and in technological advancements. Deep learning (DL), a subset of ML, makes use of deep neural networks (DNNs), a special type of artificial neural network (ANN), with a large number of layers. This large amount of layers allows features to be extracted from the raw input data without the need for any pre-processing ^[2]. Alongside a very rapid growth of available data, DNNs have started to be used in a variety of fields, such as computer vision, natural language processing, and speech recognition.

The Portuguese language is one of the most spoken languages in the world and it is divided into different variants, such as Brazilian and European Portuguese ^[3]. Among other variants, European Portuguese represents a small fraction of Portuguese speakers, and hence has fewer available resources and research. Developing deep neural models requires large datasets, which are scarce or not available for the European variant of the Portuguese language. Thus, transfer learning can be used as a means of creating ASR systems for languages with limited amounts of data. Such can be achieved by transferring the knowledge from models developed for languages with a larger amount of accessible data, such as the English language. The pre-trained models are then tuned for languages with less available data, such as the Portuguese language ^[4]

2. Deep Learning in ASR

Recent years have seen a growth in knowledge about the application of ANNs in a variety of research areas, including ASR, computer vision (image processing and recognition), and natural language processing. End-to-end (E2E) methods for ASR systems have increased in popularity due to these developments and the exponential growth in the amount of data available ^[5]. Approaches to automatic speech recognition systems are largely based on four types of deep neural networks: convolutional neural networks (CNNs) ^[6]; recurrent neural networks (RNNs) ^[7]; time-delay neural networks (TDNNs) ^[8]; and most recently, transformers ^[9]. These architectures can employ E2E mechanisms such as attention-based encoders and decoders (AED), recurrent neural network transducer (RNN-T), and connectionist temporal classification (CTC) ^[5]. The CTC algorithm is commonly used in ASR, as well as handwriting and other problems, the output of which consists of a sequence. When speech data only contain the audio and the transcript, there is no direct correlation between them. The correct alignment is possible by applying the CTC loss function on a DNN. This decoding process of aligning speech to words makes use of a tokenizer that contains the vocabulary (or alphabet) of the target audio language. The Jasper model architecture is a good example of this ^[10]. It uses a TDNN architecture trained with CTC loss. This model achieved best performance of 2.95% WER on a test clean LibriSpeech dataset and improved over various domains compared to other state-of-the-art models. Many applications in ASR infer pre-trained models from APIs due to their performance for the English language, as shown in Table 1.

Table 1.

API architecture and respective training hours.

	Architecture	Train Hours
Voxfoge WER
	Facebook wav2vec 2.0	1041
	Facebook wav2vec 2.0	12.29%	11.44%
	Google LAS	2025
	Google LAS	12.58%	10.49%	Microsoft ASR System	12,500

These models can also be fine-tuned to other languages if given new vocabulary and speech data. This process is also known as transfer learning (TL). These types of DNNs require large quantities of data to be trained and achieve good performance. When the available data are scarce, the transfer learning technique can be applied to these networks to improve their performance. Transferring knowledge on ANN-based systems is the equivalent of reusing layers from previously trained models. This is accomplished by using previously calculated weights to initialize specified layers of new models, followed by training the remaining layers. The reused layers can either be fixed, in which case the pre-calculated weights will not be updated, or flexible, in which case the pre-calculated weights will be able to be updated according to the new data ^[11]. The weights of the remaining layers are randomly initialized as in a normal ANN. Sampaio et al. ^[12] evaluated the APIs from Table 1 using two collaborative and public Portuguese datasets, Mozilla Common Voice (https://commonvoice.mozilla.org/pt) (accessed on 11 November 2021) and the Voxforge (http://www.voxforge.org/pt) (accessed on 11 November 2021). Each had a domain, the former being common words and utterances, while the latter was audiobooks. The result of each API over each dataset can be seen in Table 2.

Table 2.

APIs results on Mozilla Common Voice (MCV) Corpus and Voxforge Corpus datasets.

	Architecture			MCV WER

Microsoft ASR System			9.56%			7.25%

The model can adapt to new languages or domains if given enough training data for it to transfer to new vocabularies. Transferring knowledge from a high-resource language to a low-resource language, such as Portuguese, has been shown to improve the low-resource ASR model ^[13][14].

3. ASR in the Portuguese Language

The Portuguese language is one of the most spoken in the world, not due to the size of the Portuguese population (10 million), which gives the language its name, but thanks to countries with a much larger number of inhabitants, such as Brazil (214 million), Angola (34 million), and Mozambique (32 million). Despite speaking the same language, the speech varies from country to country and even from region to region, not only in accent but also in vocabulary. The goal of this work is to use European Portuguese (EP), i.e., from Portugal, to develop an ASR system. As already mentioned, EP has fewer speakers than other variants such as Brazilian Portuguese (BP). However, some research has already been developed in the field of ASR with the goal of transcribing EP speech. Pellegrini et al. ^[15] and Hämäläinen et al. ^[16] aimed to transcribe speech from elder and young people since in these age groups people have more difficulties expressing themselves. The goal was to improve the understatement of their speech through the use of ASR systems. Other research aimed to create a speech recognizer for EP based on a corpus obtained from broadcast news and newspapers. The AUDIMUS.media ^[17] speech recognizer is a hybrid system composed of an ANN, a multilayer perceptron (MLP), which classifies phones according to features extracted by a Perceptual Linear Prediction (PLP), a log-RelAtiveSpecTrAl (Log-RASTA), and Modulation Spectrogram (MSG). These components are then combined and used in an HMM for temporal modeling ^[18]. In variants with a larger amount of speakers, such as Brazilian Portuguese, there is also a lack of results related to the development of ASR systems. This shortage is mostly due to the lack of data quantity, quality, or detail in public datasets, or lack of public datasets to begin with, though they are desperately needed, especially when creating models based on DNNs. Lima et al. ^[19] provided a list of 24 Portuguese language datasets alongside some of the available features, such as size, quality, rate, amount of speakers, speaker’s age, and availability (public or private). Of the twenty-four datasets, only six are public, which leads Lima et al. to state that the amount of datasets available is acceptable to build ASR systems for the Portuguese language. Lima et al. also conclude that the types of data are diverse (noisy, wide age range, medical, commands), but the overall quantity, quality, and standardization are poor. Nevertheless, some research has shown that it is possible to create models for ASR systems for the Portuguese language using reduced amounts of data, as little as 1 h, and still achieve considerable results in word error rate (WER), as high as 34% ^[20]. Works regarding ASR systems for Portuguese using DNNs worth mentioning include Gris et al. ^[21], who make use of Wav2vec 2.0 and pre-trained models in other languages (which are then fine-tuned to BP) and achieve an average WER of 12.4% on seven datasets; and Quintanilha et al. ^[22][23], who make use of four datasets (three of which are open) and use models based on DeepSpeech 2 ^[24] with convolutional and bidirectional recurrent layers, making it possible to achieve values of 25.45% WER. Additional work is available regarding ASR systems for the Portuguese language. The app TipTopTalk! ^[25] by Tejedor-García et al. uses Google’s ASR systems to implement a pronunciation training application for various languages including both European and Brazilian variants of Portuguese.

References

Ruan, S.; Wobbrock, J.O.; Liou, K.; Ng, A.; Landay, J. Speech Is 3x Faster than Typing for English and Mandarin Text Entry on Mobile Devices. arXiv 2016, arXiv:1608.07323.
Amigo, J.M. Data Mining, Machine Learning, Deep Learning, Chemometrics Definitions, Common Points and Trends (Spoiler Alert: VALIDATE your models!). Braz. J. Anal. Chem. 2021, 8, 45–61.
Eberhard, D.M.; Simons, G.F.; Fennig, C.D. Ethnologue: Languages of the World, 26th ed.; SIL International: Dallas, TX, USA, 2023.
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016.
Li, J. Recent Advances in End-to-End Automatic Speech Recognition, APSIPA Transactions on Signal and Information Processing, 2022, Vol. 11: No. 1, e8. http://dx.doi.org/10.1561/116.00000050
Zhang, Y.; Pezeshki, M.; Brakel, P.; Zhang, S.; Bengio, C.L.Y.; Courville, A. Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks, Proceedings of INTERSPEECH 2016, San Francisco, USA, 2016.
Graves, A.; Jaitly, N. Towards End-To-End Speech Recognition with Recurrent Neural Networks. In Proceedings of Machine Learning Research, Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; Xing, E.P., Jebara, T., Eds.; PMLR: Bejing, China, 2014; Volume 32, pp. 1764–1772.
Peddinti, V.; Povey, D.; Khudanpur, S. A time delay neural network architecture for efficient modeling of long temporal contexts. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; pp. 3214–3218.
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision, arXiv, 2022. Available online: https://doi.org/10.48550/arXiv.2212.04356 (accessed on 16 April 2023).
Li, J.; Lavrukhin, V.; Ginsburg, B.; Leary, R.; Kuchaiev, O.; Cohen, J.M.; Nguyen, H.; Gadde, R.T. Jasper: An End-to-End Convolutional Neural Acoustic Model. arXiv 2019, arXiv:1904.03288.
Buijs, R.; Koch, T.; Dugundji, E. Applying transfer learning and various ANN architectures to predict transportation mode choice in Amsterdam. Procedia Comput. Sci. 2021, 184, 532–540.
Xavier Sampaio, M.; Pires Magalhães, R.; Linhares Coelho da Silva, T.; Almada Cruz, L.; Romero de Vasconcelos, D.; Antônio Fernandes de Macêdo, J.; Gonçalves Fontenele Ferreira, M. Evaluation of Automatic Speech Recognition Systems. In Anais do XXXVI Simpósio Brasileiro de Bancos de Dados; Technical Report; SBC: Porto Alegre, Brazil, 2021.
Dalmia, S.; Sanabria, R.; Metze, F.; Black, A.W. Sequence-based Multi-lingual Low Resource Speech Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, 2018, Pages 4909–4913. https://doi.org/10.1109/ICASSP.2018.8461802 (accessed on 16 April 2023).
Cho, J.; Baskar, M.K.; Li, R.; Wiesner, M.; Mallidi, S.; Yalta, N.; Karafiát, M.; Watanabe, S.; Hori, T. Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 521–527.
Pellegrini, T.; Hämäläinen, A.; de Mareüil, P.B.; Tjalve, M.; Trancoso, I.; Candeias, S.; Dias, M.S.; Braga, D. A corpus-based study of elderly and young speakers of European Portuguese: Acoustic correlates and their impact on speech recognition performance. In Proceedings of the Proc. Interspeech 2013, Lyon, France, 25–29 August 2013; pp. 852–856.
Hämäläinen, A.; Cho, H.; Candeias, S.; Pellegrini, T.; Abad, A.; Tjalve, M.; Trancoso, I.; Dias, M.S. Automatically Recognising European Portuguese Children’s Speech. In Computational Processing of the Portuguese Language. PROPOR 2014. Lecture Notes in Computer Science; Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.d.G., Eds.; Springer: Cham, Switzerland, 2014; pp. 1–11.
Meinedo, H.; Caseiro, D.; Neto, J.; Trancoso, I. AUDIMUS.MEDIA: A Broadcast News Speech Recognition System for the European Portuguese Language. In Proceedings of the Computational Processing of the Portuguese Language; Mamede, N.J., Trancoso, I., Baptista, J., das Graças Volpe Nunes, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2003; pp. 9–17.
Meinedo, H.; Abad, A.; Pellegrini, T.; Neto, J.; Trancoso, I. The L2F Broadcast News Speech Recognition System. In Proceedings of the Fala 2010 Conference, Vigo, Spain, 10–12 November 2010.
Aguiar de Lima, T.; Da Costa-Abreu, M. A survey on automatic speech recognition systems for Portuguese language and its variations. Comput. Speech Lang. 2020, 62, 101055.
Gris, L.R.S.; Casanova, E.; Oliveira, F.S.d.; Soares, A.d.S.; Candido-Junior, A. Desenvolvimento de um modelo de reconhecimento de voz para o Português Brasileiro com poucos dados utilizando o Wav2vec 2.0. In Anais do Brazilian e-Science Workshop (BreSci); SBC: Porto Alegre, Brazil, 2021; pp. 129–136.
Stefanel Gris, L.R.; Casanova, E.; de Oliveira, F.S.; da Silva Soares, A.; Candido Junior, A. Brazilian Portuguese Speech Recognition Using Wav2vec 2.0. In Proceedings of the Computational Processing of the Portuguese Language; Pinheiro, V., Gamallo, P., Amaro, R., Scarton, C., Batista, F., Silva, D., Magro, C., Pinto, H., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 333–343.
Macedo Quintanilha, I. End-to-End Speech Recognition Applied to Brazilian Portuguese Using Deep Learning. Master’s Thesis, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil, 2017.
Quintanilha, I.M.; Netto, S.L.; Biscainho, L.W.P. An open-source end-to-end ASR system for Brazilian Portuguese using DNNs built from newly assembled corpora. J. Commun. Inf. Syst. 2020, 35, 230–242.
Amodei, D.; Anubhai, R.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Chen, J.; Chrzanowski, M.; Coates, A.; Diamos, G.; et al. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. In International Conference on Machine Learning; PMLR: London, UK, 2015; pp. 173–182.
Tejedor-García, C.; Escudero-Mancebo, D.; González-Ferreras, C.; Cámara-Arenas, E.; Cardeñoso-Payo, V. TipTopTalk! Mobile Application for Speech Training Using Minimal Pairs and Gamification. 2016. Available online: https://uvadoc.uva.es/handle/10324/27857 (accessed on 13 March 2023).