A Rule-Based Grapheme-to-Phoneme Conversion System: Comparison
Please note this is a comparison between versions V1 by Piotr Kłosowski and V2 by Dean Liu.

Natural language processing often requires grapheme-to-phoneme (G2P) conversion of an orthographic text. G2P converts strings of graphemes to corresponding sequences of phonetic transcription characters, directly from orthographic representations and it is crucial for many applications in various areas of speech and language processing.

  • grapheme-to-phoneme conversion
  • speech recognition
  • language corpus

12. Problem Formulation

The process of converting graphemes to phonemes in orthographic text involves converting a string of orthographic characters into a corresponding string of phonetic transcription characters (representing phonemes or allophones) [1][2]. A ‘grapheme’ is any of the units of any writing system for any language, a term coined by analogy with the ‘phoneme’ of a spoken language [2][26]. Graphemes include alphabetic letters, typographic ligatures, numerical digits, punctuation marks, and other individual symbols of writing systems. Since the orthographic text is the only source of pronunciation information in the process of converting graphemes into phonemes, this process must be based on appropriate formal rules, depicting the correct pronunciation of orthographic strings in a given language [3][27].
Phonemes are usually written in specially designed alphabets. The most widely used alphabet is the International Phonetic Alphabet (IPA) [4][28]. For the Polish language, as with other Slavic languages, a special transcription system, called the Slavistic Phonetic Alphabet (SPA), is most frequently used [5][29]. The other very commonly used phonetic alphabet is the Speech Assessment Methods Phonetic Alphabet (SAMPA) [6][30]. SAMPA is a machine-readable phonetic alphabet, using 7-bit printable ASCII characters, based on the IPA. Table 1 presents the phonemic inventory of Polish with examples, in the SPA, IPA, and SAMPA phonetic alphabets and corresponds to the set of phonemes.
Table 1. The set of Polish phonemes with examples, written in the SPA, IPA, and SAMPA phonetic alphabets.
 The set of Polish phonemes with examples, written in the SPA, IPA, and SAMPA phonetic alphabets, which corresponds to the set of phonemes used for the purpose of this study.
13][36,37]. The first implementation of a grapheme-to-phoneme conversion algorithm for Polish, designed for the machine ODRA 1204, was made in 1971 by M. Warmus [14][38].

2. Grapheme-to-phoneme Conversion

  • [17,19];
  • Developing more efficient word-based and phoneme-based statistical language models for speech recognition applications in Polish [38][37][18,19];
  • Application of deep learning methods to language modelling and speech recognition [39][40][20,21].
  Phonetic Alphabet Example of
No. Symbols Occurrence
  [SPA] [IPA] [SAMPA] in Polish
1 [e] [ɛ] [e] serce
2 [a] [ɑ] [a] baba
3 [o] [ɔ] [o] oko
4 [t] [t] [t] trawa
5 [n] [n] [n] noc
6 [y] [ɨ] [I] syty
7 [i̯] [j] [j] jajo
8 [i] [i] [i] wici
9 [r] [r] [r] rok
10 [s] [s] [s] sok
11 [v] [v] [v] wada
12 [p] [p] [p] praca
13 [u] [u] [u] buk
14 [m] [m] [m] mama
15 [k] [k] [k] kot
16 [ń] [ɲ] [n’] koń
17 [d] [d] [d] dudek
18 [l] [l] [l] lato
19 [u̯] [ɫ] [w] łysy
20 [š] [ʃ] [S] szyszka
21 [f] [f] [f] fala
22 [z] [z] [z] koza
23 [c] [ʦ͡] [ts] cacko
24 [b] [b] [b] baba
25 [g] [g] [g] godło
26 [ś] [ɕ] [s’] siano
27 [ć] [ʨ͡] [ts’] ciasto
28 [] [ʝ] [x] higiena
29 [č] [ʧ͡] [tS] czarny
30 [ž] [ʒ] [Z] każdy
31 [] [] [e ] ręka
32 [ḱ] [c] [k’] kino
33 [] [ʥ͡] [dz’] dziedzic
34 [ʒ] [ʣ͡] [dz] nadzy
35 [ź] [ʑ] [z’] ziarno
36 [ǵ] [ɟ] [g’] magiczny
37 [] [ʤ͡] [dZ] droże
Automatic grapheme-to-phoneme conversion is not a new problem. The first linguist who noted it, and tried to provide a solution for a particular language (Czech), was H. Kučera [7][31]. Research on solutions to the automatic grapheme-to-phoneme conversion problem have also been initiated for other languages [8][9][10][32,33,34]. In Poland, the first linguist who wrote about the possibility of phonetic interpretation of text by machines was W. Doroszewski in 1969 [11][35]. The largest contributions to solving the problem of automatic grapheme-to-phoneme conversion for Polish, were the publications of Maria Steffen-Batóg [12][
  • Automatic conversion of graphemes into phonemes in orthographic texts is not only a technical issue, consisting in developing appropriate algorithms for converting graphemes into phonemes, but also a serious linguistic problem. Only specialists in linguistics and phonetics of a given language are able to formulate appropriate rules for converting graphemes into phonemes for speech [15][51];
  • An additional complication is that automatic conversion of graphemes to phonemes is a language-specific problem with different spelling and pronunciation conventions within the same language [16][17][18][19][55,68,69,70];
  • Effective solutions for automatic grapheme-to-phoneme conversion in one language may not help solve the same problems for a different language. There is not only one language and technical problem of automatic conversion of graphemes to phonemes to be solved, but many different problems with different levels of difficulty that should be solved for each language separately [15][51];
  • Automatic grapheme-to-phoneme conversion is widely used not only in speech synthesis, but also in speech recognition [20][21][3,53];
  • A separate, but very important problem is the evaluation of grapheme-to-phoneme conversion processes [21][22][53,71]. Evaluation and validation of grapheme-to-phoneme conversion implementations is a laborious and time-consuming process. All problems registered for the G2P implementation discussed in this paper were positively resolved;
  • The G2P implementation developed for this research is not the only one for Polish [3][23][24][25][26][27,39,41,43,45], however only one of the others is available for free use [24][41];
  • The author of the paper analysed for comparison the only available application for the Polish language, named Transcriber [24][41]. The application was implemented in the C++ programming language. The implemented method uses a dictionary of 5018 words and 767 defined conversion rules. For comparison, the software presented by the author in this paper was implemented in Python programming language, 975 conversion rules were implemented and the dictionary is very limited and plays only a supporting role. This means that TransFon has implemented 208 more transcription rules, which is over 27% more. The application failed to compile due to the lack of inclusion in the source code of the appropriate libraries that were used by the programmer to create the application. This made it impossible to evaluate the correctness of the application and seriously hindered the comparison with the software created by the researcheauthor of the paper; However, based on the analysis of the application’s source code, you can see that the principle of the application is also rule-based, but the author of the Transcriber application tried to refine and improve the application’s performance by adding new words to the dictionary (exceptions). The author of the TransFon application, on the other hand, tried to add and supplement transcription rules in a similar way as is known in the literature. This is evidenced by the dictionary size used in both applications;
  • The G2P system presented here could be used for Polish corpus development;
  • The G2P implementation presented here did not exploit any similar pre-existing tools [27][48];
  • It is worth noting that the solutions presented here for the development of language and speech corpora in Polish are not the only ones and publications on this subject are available [28][29][72,73];
  • Of particular interest are the results presented in publications by Grażyna Demenko et al. [23][30][31][32][33][34][35][39,62,63,64,65,66,67].
The grapheme-to-phoneme conversion system developed and its ability to create phonemic language corpora for Polish open up further opportunities for research on improving automatic speech recognition in Polish. The plan for further research towards achieving this goal, using the phonemic language corpus developed, includes:
  • Performing a better and more detailed statistical analysis of the Polish language based on the phonemic language corpus developed [36][37]