Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet 1 in Polish Text-to-Speech Synthesis Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet in Polish Text-to-Speech Synthesis1 Grażyna Demenko Institute of Linguistics, Adam Mickiewicz University lin@amu.edu.pl Mikołaj Wypych Institute of Computing Science, Poznań University of Technology wypych@amu.edu.pl Emilia Baranowska Institute of Linguistics, Adam Mickiewicz University emiszal@amu.edu.pl STRESZCZENIE Prezentowana praca dotyczy implementacji reguł transkrypcji fonematycznej i fonetycznej języka polskiego dla potrzeb technologii mowy (w module systemu syntezy mowy, który obecnie powstaje na Uniwersytecie Adama Mickiewicza i Politechnice Poznańskiej). Raport przedstawia propozycję modyfikacji alfabetu SAMPA dla języka polskiego, procedurę weryfikacji dotychczas istniejących reguł transkrypcji oraz dodatkowy słownik wyjątków. Opisano także działanie poszczególnych części modułu. ABSTRACT The paper concerns the automatic generation of broad and narrow transcription of the Polish language in the module of a concatenative TTS system under development at Adam Mickiewicz University and Poznań Technical University, Poland. First, we propose modified SAMPA extension for the Polish language. Then, the verification procedure of existing Polish grapheme-to-phoneme rules is presented. The additional dictionary of exceptions is thoroughly described. Finally, the article focuses on a rule compiler, a rule applier and dedicated development environment. 1. Introduction Automatic grapheme-to-phoneme (GTP) conversion modules, which assign phonemic/phonetic transcription to graphemic word forms of a given 1 This work was partially supported by research grant 7T11C00921 from the Polish Committee of Scientific Research. 2 Speech and Language Technology. Volume 7 language, meet many purposes in all speech processing systems. Rapidly developing automatic services which rely on automatic speech recognition (ASR) need real-time access to knowledge about pronunciation – phonetic transcription enables the system to select proper acoustic models for the input utterance evaluation. In text-to-speech (TTS) systems, a linguistic transcription of an input text is required to select appropriate units (e.g. diphones) which are concatenated and generate the intended signal in the final step of synthesis. Automatic speech segmentation algorithms, needed for various linguistic purposes, operate on input automatic broad/narrow transcriptions provided along with speech signals. The definition of the relation between the orthographic and transcription layer may be obtained on the basis of rules and/or a dictionary. It happens quite frequently that both these methods are used concurrently in one module. If a dictionary search fails GTP rules are activated as a default strategy. Since Polish is characterized by a relatively regular relationship between the orthographic text and its phonemic/phonetic transcription, it seems that the rule-based approach will be more favorable in its case. Nevertheless, the dictionary part is still needed to cope with exceptions to the rules. The transcription rules and dictionaries may be formulated manually by human experts ([1], [2], [3]) or derived automatically from a corpus of texts and their transcriptions by machine learning algorithms, e.g. feedforward neural networks ([4], [5]). In 1973, Steffen-Batogowa published in [1] a set of expert GTP conversion rules for Polish which were implemented for the first time by Warmus [6]. Since then many different GTP modules based on the Steffen-Batogowa's rules have been presented (cf. [7], [8] for examples and a survey of solutions, [9]). The goal of the present project was to build a practical market-oriented GTP module starting from the rules proposed in [2] and [3]. 2. Expressing GTP relation 2.1. Transcription tables Let X* denote a set of all finite strings of elements from the set X, including an empty string – the string of the length 0. Let X+ denote a set of all finite strings from X, excluding an empty string. Let G be a set of elements acceptable in input strings (e.g., a set of Latin letters). Similarly, let P be a set of elements acceptable in output strings (e.g. a set of phoneme symbols). A transcription table T[1..m][1..n] is a matrix of m rows and n columns in which the cells meet the following set of requirements: T[1][1]G, T[i][1]G+ for i[2..m], T[1][j]G+ for j[2..n], T[2..n][2..m]P*. Each cell indexed as (i,j)[2..m]x[2..n] states that if T[1][1] in an input string is Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet 3 in Polish Text-to-Speech Synthesis preceded by a string that belongs to T[i][1] (the left context set) and is followed by a string that belongs to T[1][j] (the right context set), then T[1][1] should be transcribed as T[i][j]. The transcription tables proposed by Steffen-Batogowa and Nowakowski (cf. [2], [3]) do not contain overlapping contexts. This solution increases the entropy of the context sets. In present solution if certain contexts can be recognized simultaneously, we give a priority to the rules that are formed by longer contexts. As a consequence zero length context can be used to specify default columns or rows. Similarly to [7], we enable more than one variant of transcription for a grapheme in a given context (what results in T[2..n][2..m] P*). Context sets are defined by means of expressions described in [3]. For any A,BG*, we define the sum of sets A and B, denoting it by A+B, and a concatenation of sets A and B, denoted as A*B which contains any string in the form of abG*, where aA, bB and ab is a result of concatenation of a and b. We use the digit 1 to denote an empty string. For example, {1,con}*{1,cat} produces {1, con, cat, concat}. The rule tables are stored in a UTF8 encoded file in an XML format or in a plain text format. The plain text format uses a semicolon as a column delimiter and a next-line character as a row delimiter. Elements from G and P are represented by strings of Unicode characters. Examples of transcription tables can be found in 3.3. 2.2. Dictionary An exception is a matrix E[1..n][1..3] with n rows and 3 columns, where E[i][1]G, E[i][2]{0,1} and E[i][3]P* for i[1..n]. The exception in the form of E[1..n][1..3] means that if the string E[1..n][1] matches an input substring then, for each i[1..n], if E[i][2] is equal to 1, E[i][1] should be transcribed as E[i][3]. Note that there is no need to specify the transcriptions for all the graphemes in an exception but only for the but only for the locations where errors occurred. The remaining graphemes are transcribed according to the rules (see 4.1). The dictionary is a set of exceptions. Exceptions are stored in a UTF8 encoded file in a XML format or in a plain text format. The plain text formatted file consists of lines, each containing a single exception. Rows of an exception are separated by spaces and saved in one of the two forms: E[i][1]:E[i][3] or E[i][1] depending whether E[i][2] is equal to 1 or 0. 3. GTP relation for Polish 3.1 Input and output character sets Two sets of characters were precisely defined for the exact GTP mapping for the Polish language – an input set of characters and an output phonetic/phonemic alphabet. 4 Speech and Language Technology. Volume 7 The input set of symbols for Polish was defined here as a set of the following symbols: X={a, ą, b, c, ć, d, e, ę, f, g, h, i, j, k, l, ł, m, n, ń, o, ó, p, q, r, s, ś, t, u, v, w, x, y, z, ź, ż, #, ##}2. One hash substitutes interword spaces in a string of orthographic symbols and two hashes denote sentence final punctuation marks. The modified computer-readable SAMPA alphabet, which codes the IPA symbols into the ASCII characters and seems to be a good solution for technical applications, was adopted as an output phonetic alphabet of the module. An inventory of 39 phonemes was employed for broad transcription and a set of 87 allophones was established for narrow transcription of Polish (cf. Table 3.1 and Table 3.2). The authors verified the existing phonetic notations and transcription rules for Polish on the basis of: (1) the literature describing the Polish phonological system and Polish rules of pronunciation (cf. e.g. [10], [11], [12], [13], [14], [15], [16]) and (2) the results of the acoustic segmentation for a few hundred utterances produced by 50 speakers brought up in two Polish cities (Poznań and Kraków; cf. [17]). Due to certain assumptions concerning the Polish phonological system and some technical reasons, the following modifications were made to the original Polish version of SAMPA: 0 A grapheme “y” is transcribed as /y/, not as /I/. It was judged that the symbol /I/ is less legible (its form differs from other symbols for Polish vowels) despite being more similar to an original IPA symbol (cf. [18]). 1 Affricates are represented by two symbols joined by a ^ symbol being the equivalent of an IPA tie bar . This linking character is essential in a string of phonemes to differentiate affricates from consonant clusters consisting from two phonemes, e.g. nadzieja /nad^z’eja/ 'hope' vs. nadziemny /nadz’emny/ 'overhead'. Guidelines to the SAMPA transcription [19] suggest that “(…) it is designed to be uniquely parsable. As with the ordinary IPA, a string of SAMPA symbols does not require spaces between successive symbols.” Nevertheless, lack of the tie bar in affricates in Polish version of the alphabet forced a transcriber to separate individual phonemes in a string of characters by single spaces. 2 On the basis of present knowledge of Polish phonetics, our own detailed acoustic analyses and experience derived from annotation of Polish speech databases, it was established that the phoneme counterparts of a grapheme ę as well as a grapheme ą are two separate phonemes not only when followed by plosives or affricates, but also before fricatives and (partially) when being in the word final position. 2 Notice that {q, v, x} are not standard Polish graphemes, they appear only in loanwords. It seemed to be more economical later to operate on them in the rules rather than to place words containing them in a module’s dictionary. Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet 5 in Polish Text-to-Speech Synthesis Thus, it was resigned from original SAMPA symbols /e~/ and /o~/ before fricative consonants /f v s z S Z x/ and when occurring word finally, in favour of biphonemic notation /ew~/, /ow~/ before these consonants and /e/, /ow~/ when word finally. The monophonemic notation /e~/ and /o~/ before palatal fricatives /s’ z’/ was changed into two-phoneme transcription /ej~/ or /oj~/ respectively, although the alternative transcription with ‘hard’ nasal resonance /w~/ is also acceptable. Figure 1 shows a spectrogram of /e~/ before palatal fricatives /s’/ and /z’/ (in kęsia, kęzia). The higher parts of the spectrogram illustrate biphonemic realization as /ej~/. Figure 2 shows a spectrogram of /e~/ before fricatives /s/ and /z/ (in kęsa, kęza). The lower parts of the spectrogram illustrate biphonemic realization as /ew~/. k e j~ s’ a k e j~ z’ Figure 1. A Spectrogram illustrating kęsia, kęzia. a 6 Speech and Language Technology. Volume 7 k e w~ s a k e w~ z a Figure 2. A Spectrogram illustrating kęsa, kęza. 3 4 The adoption of the above-mentioned solution assumes for practical reasons that /w~/ and /j~/ should be treated as individual phonemes in Polish although their phonological status remains unclear [12]. These two phonemes were not taken into consideration in the original version of Polish SAMPA. Phonemes /c/ and /J/ were also added to the proposed SAMPA extension as the phonemes essential to the description of acoustic differences between velar consonants /k g/ and, respectively, palatal consonants /c J/, as in word pairs: kat /kat/ vs. kit /cit/ or drogę /droge/ vs. drogie /droJje/. MAPPING OF PHONETIC ALPHABETS FOR POLISH IPA – – SAMPA ORTOGRAPHY TRANSCRIPTION ORTO TRANSSAMPA EXTENSION GRAPHY CRIPTION i I e a o u e~ o~ p b t pit typ test pat pot puk gęś wąs pik byt test pit tIp test pat pot puk g e~ s’ v o~ s pik bIt test i y e a o u – – p b t pit typ test pat pot puk – – pik byt test pit typ test pat pot puk – – pik byt test Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet 7 in Polish Text-to-Speech Synthesis d k g – – f v s z S Z s’ z’ x ts dz tS dZ ts’ dz’ m n n’ N l r w j – – dym kit gen – – fan wilk syk zbir szyk żyto świt źle hymn cyk dzwon czyn dżem ćma dźwig mysz nasz koń pęk luk ryk łyk jak – – dIm kIt gen – – fan vilk sIk zbir SIk ZIto s’ v i t z’ l e xImn ts I k dz v o n tS I n dZ e m ts’ m a dz’ v i k mIS naS k o n’ peNk luk rIk wIk jak – – d k g c J f v s z S Z s’ z’ x t^s d^z t^S d^Z t^s’ d^z’ m m n’ N l r w j w~ j~ dym kat gen kiedy giełda fan wilk syk zbir szyk żyto świt źle hymn cyk dzwon czyn dżem ćma dźwig mysz nasz koń pęk luk ryk łyk jak ciąża więź dym kat gen cjedy Jjewda fan vilk syk zbir Syk Zyto s’fit z’le xymn t^syk d^zvon t^Syn d^Zem t^s’ma d^z’vik myS naS kon’ peNk luk ryk wyk jak t^s’ow~Za vjej~s’ Table 3.1. Computer coding of Polish phonemes. The IPA symbols are mapped into the ASCII characters: SAMPA and extended-SAMPA symbols. 8 Speech and Language Technology. Volume 7 PHONEME i y e a o u j j~ w w~ l r m n n’ N x v f ALLOPHONES i i~3 y y~ e e] e~ e]~ a a] a~ a]~ o o] o~ o]~ u u] u~ u]~ j j~ w w/ w, w~ l l/ l, r r/ r, m m/ m, n n/ n, n– n’ n’/ N N/ N, N,/ x x/ x, x,/ v v, f f, PHONEME z z’ s s’ Z S d^z t^s d^z’ t^s’ d^Z t^S b p d t g J k c ALLOPHONES z z, z’ s s, s’ Z Z, S S, d^z d^z, t^s t^s, d^z’ t^s’ d^Z d^Z, t^S t^S, b b, p p, d d, d– t t, t– g J k c Table 3.2 A list of allophones of the Polish language (the modified-SAMPA symbols). 3.2. Transcription rules An editor for rule editing and testing was created. The editor transcribes example texts by means of the defined rules and edits them according to the predefinded rules of transcription and amendments made by experts. The following iterative verification procedure was applied to the transcription rules: 0 Choose an orthographic text to be used in the process of rule testing. 1 Transcribe the text by means of the currently available rules. 2 Erase those substrings, both in the text and its transcription, which are known to be translated properly on the basis of the transcribed text corpus. 3 Correct the remaining part of the transcription. Append the resulting correct transcription to the corpus. 3 DIACRITICS: ~ ] / , – nasalisation or nasality raised articulation devoicing palatalization retracted (here: alveolarized) e~ e] l/ l, t– w~ Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet 9 in Polish Text-to-Speech Synthesis 4 5 Reformulate the rules according to point 4. Test the new rules on the corpus. Steps 3 and 4 were performed by an expert while the others were performed automatically. Two corpora of test material were gathered for the above purpose. One of them included a few dozen written texts downloaded from electronic versions of Polish nationwide newspapers and the other contained a few thousand single words and phrases obtained from: exercises in textbooks on phonetic transcription, dictionaries, and a set of linguistic 'traps' and exceptions for Polish (cf. [3] & [9], [14]). The input texts had not gone through earlier processing. The errors detected were classified according to the causes generating them and removed in turn. The performance of the rules was evaluated by ‘words correct’ and ‘symbols correct’ scoring metric. The percentage of words transcribed accurately in a given set of texts was measured before and after any modifications to the rules. Error analysis on a symbol level enabled us to eliminate whole categories of errors, especially in the initial phase of verification – mainly costly misprints in the transcription tables. We observed also how transcription accuracy varies depending on the type of texts in the test set (e.g. national news vs international news with many foreign words from the same Polish newspaper). We could not compare the performance of our GTP module with accuracy figures for previous implementations of Steffen-Batogowa’s rules as all authors use different evaluation techniques: diferent input corpora, scoring as well as incommensurate output phoneme/phone inventories. To sum up, the applied assessment strategy can be described as subjective testing but it appeared to be sufficient for our purposes. Although the resulting rule-based transcription was evaluated as high-quality one, it appeared that an exception dictionary is highly desirable to deal mostly with words of foreign origin not adopted to Polish pronunciation. Moreover, the tables of rewriting rules presented in [2] and [3] were extended to enable transcription which reflects two types of Polish pronunciation: Warsaw (W) and Krakow-Poznan (K-P) ones. The main difference between these two pronunciation varieties concerns voicing in interword assimilations. Alternative transcription was obtained by supplementary rewriting rules and the modification of some subsets of graphemes. This is an example of such an extension and reorganisation: The original table: %ć1 Table 29, part 1 ć ; (1+M+D)K+T-{s,c} X ; t^s' ; ; (1+M+D)(G+E)+DV d^z' 10 Speech and Language Technology. Volume 7 Present modification: %ć1 Table 29, part 1 ć ; (1+M+D)K9+T-{s,c} ; X ; t^s' ; (M+D)(G9+E)+DV+G9 ; (1+M+D)H9 d^z' ; d^z'|t^s' Capital letters (also followed by numbers) represent here certain subsets of graphemes. When there is a morphological border between the elements of a twographeme cluster <nk>, appropriate automatic transcription reflecting double plausible pronunciation (/nk/ in W or /Nk/ in K-P) is much harder to obtain. In such cases, the correct transcription for W is extracted from an additional set of minimal context needed to recognize words containing a grapheme n before the suffix starting from -k- (||-ek-). Additionally, the GTP rules for graphemes v, q and x were formulated and added to the set of tables, e.g.: %x Table 37, part 1 x ; i ; D ; X - (i+D) X ; ks, ; gz ; ks In short, the verification procedure helped (1) remove typo errors in original rules, (2) formulate new rules covering the differences in standard Polish pronunciation (W & K-P), (3) add rules for additional graphemes, (4) determine phenomena too irregular to be embedded in the rule tables. At every stage, we took into account the main principle of GTP rule formulation, i.e. strive for maximum productivity and manageable organisation of a rule set. 3.3. Exceptions On the basis of the error detection procedure, the authors compiled a preliminary lexicon of exceptions. It contained (a) Polish loanwords that kept their original foreign pronunciation and spelling and (b) the words of Polish spelling featuring exceptional pronounciation of some grapheme combinations (e.g. irregular dielektryk /dielektryk/ vs. regular diecezja /djet^sezja/; marzła /marzwa/ vs. marzyła /maZywa/). The criteria for entering a word along with its correct transcription into the preliminary version of the dictionary were as follows: 0 Words that could not be (efficiently) transcribed using the set of GTP rules. New rules would violate the existing ones. However, as a result of adding {v, q, x} to the original set of graphemes and formulation of GTP rules for them, the words such as votum, video, VAT, boutique, taxi, matrix, etc. were excepted from the initial lexicon. 1 Foreign words that were well-grounded in the Polish language. The majority of the listed words are also the entries of the newest dictionary of loanwords in Polish [20]. Some words that where not covered by the Sobol’s dictionary were included into the lexicon only on the basis of Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet11 in Polish Text-to-Speech Synthesis their newspaper’s usage, e.g., e-mail vs. airmail, copywriter vs. copyright, tour (operator) vs. tournee or outsourcing vs. outsider. 2 Common nouns. Proper nouns were put in the list only if they appeared in expressions such as choroba Alzheimera ‘Alzheimer’s disease’, skala Beauforta ‘Beaufort scale’ or eau de Cologne. Most loanwords which retained their original spelling and pronunciation in Polish come from English, particularly those relatively new in Polish, e.g., dubbing, hardware, factoring, cartridge, pub, etc. Although many borrowings had adopted to Polish orthography & pronunciation rules, their original spelling had to be included into the lexicon because its functions in our language alongside, e.g., diler – dealer, interfejs – interface, recykling – recycling, kolaż – collage, dżip – jeep, blef – bluff. Polish vs. loan homographs are notorious “false friends” of automatic transcription. The following homographic but not homophonic pairs are classic examples: Eng. Pol. lady /lejdi/4 ‘woman’ – /lady/ pl ‘counter in a shop’, baby /bejbi/ ‘child’ – /baby/ pl coll ‘woman’, band /bent/ ‘musicians’ – /bant/ pl gen ‘gang’. Currently, such loan words appear in the additional lexicon only as parts of longer borrowed expressions, e.g. first lady, baby-sitter, jazz band. Syntactic or semantic processing of inter-word contexts would be needed to generate correct transcription of such mixed expressions as żelazna lady ‘iron lady’, band Jarka ‘Jarek’s music band/gen. Jarek’s gangs’ or moje baby ‘my baby/my women’. An alternative solution is to automatically generate all plausible pronunciations of a given item and then to select the best matching variant using context dependent models. Proper names in Polish showing variant pronunciation due to e.g. the insertion of foreign expressions, as it often is in the case of company names and product names, were beyond the scope of the present work. Moreover, it seems that any string composed of graphemes and figures can become a proper name. They can be correctly transcribed only with the aid of a large data corpus by lexicon lookup technique (cf. [21] & [22]). The dictionary’s final version codes each entry as a minimal string of symbols sufficient for the recognition and correct GTP conversion of a given exception. On the other hand, the entry’s length must not interfere with the rule-based transcription of regular words. Therefore, some entries could not be maximally simplified, e.g. copy /kopy/ without hashes was not derived from {#hardcopy#, #copywriter# & #copyright#} because it would violate regular transcription of a word obcopylność /opt^sopylnos’t^s’/. As it was mentioned above, the program takes only ‘repairs’ from the dictionary; the remaining context is generated from the rules. This solution 4 Polish pronunciation of a loanword (SAMPA representation) 12 Speech and Language Technology. Volume 7 reduces dictionary storage requirements. The exceptions can be understood also as a kind of rewriting rules, however, they are too specialised to be included into the ‘baseform’ rule set. For example, a word yuppie, incorrectly transcribed as /juppie/ according to the main rules, gets correct form /japi/ thanks to a dictionary entry {y u:a p:1 p i:i e:1}. Consider also the following examples of entries: c:k l o: u d:d, e:i s:z i:aj g: n l a u:u r k m a r:r z:z l m y s ł:w/ # n a d:d z:z m p o l i:i a k r w:w a:o t e r p r o:u o: f w:w, i n d s u:e r f z o m b i:i e: % % % % % % % % % % clou design laurk marzl mysł# nadzm poliakr waterproof windsurf zombie The entry { m a r:r z:z l } gives correct transcription of such words as zamarzli, przymarzliście, odmarzliśmy, etc. The entry {m y s ł:w/ #} provides correct GTP conversion of nouns pomysł, zamysł, przemysł, etc. and does not violate rule-driven conversion of verbs przyniósł, poszedł, etc. The entries such as { n a d:d z:z m } or { p o l i:i a k r } are stored to avoid the necessity of text pre-edition and insertion of a separator (e.g. | in [3]) between exceptionally pronounced grapheme combinations, e.g. in compound words such as nad|zmysłowy, poli|akrylany (compare them with regularly transcribed nadzy, folia). 4. Software The resulting GTP module implementation for Polish, called PolPhone, is a set of portable ISO C++5 source code files. They can be split into two classes: The GTP module skeleton files and the data files that drive transcription process for Polish. The data files are based on sets of rules and dictionaries. The source codes can be compiled using a C++ compiler. As a result, an executable file or a dynamic library is obtained. 4.1. GTP module skeleton The term filter denotes an object that gets data from a source stream and puts data into a target stream. A few useful filter types were defined in the described GTP skeleton. A graph of filters connected by streams yields the GTP module skeleton. We use the following types of filters: 5 Source codes were successfully compiled with Borland C++ Builder 5.0 Update 1 and Microsoft Visual C++ 6.0 under Windows XP system and with GNU C++ 3.2 and Borland Kylix 3 under RedHat Linux 7.3. Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet13 in Polish Text-to-Speech Synthesis 0 identity filter copies the input stream to the output stream without any changes to the data, 1 transcoding filter converts the codepage encoding of streams, 2 preformatting filter prepares an input text to the form expected by the transcription rules. It includes: downcasing the letters, replacing sequences of blank spaces and digits with a single hash sign, replacing punctuation marks with a double hash sign. 3 transcribing filter consists of three transcribers. The transcriber works on the basis of rules defined in the transcription tables (cf. 2.1 and 4.2). The transcriber consists of two deterministic finite state transducers (FST) and a 2-dimensional decoding matrix. For a given character in the input stream, it applies one FST to preceding chars (its right context) and the another one to the following chars (its left context). The FST produces a single integer if the context is recognized. Decoding matrix cells contain characters that are placed in the output stream only if both FTS have not failed. The transcriber fails if at least one of the transducers fail or the determined decoding matrix cell contains null. Such a design of the transcriber is enforced by the trade-off between its size and time efficiency. Although building a single deterministic FST that would take over the transcriber’s role for a set of rules in presented project is possible, it would require enormous amount of memory. The transcribing filter can be defined as a chain of three transcribers: An exception transcriber, a rule transcriber and a default transcriber. They are applied to each character of the input stream in the same order that they were itemized in. The subsequent transcriber is executed only if the preceding transcriber fails. The presented design allows the resulting GTP module to get no more characters from the input stream than it is necessary to obtain a desired number of phonemes (which may require reading a few characters ahead). On the other hand, it is possible to put a given number of characters as input and read all the resulting phonemes until the first that would be dependent on characters that are not available yet. Since the linguistic knowledge incorporated in the skeleton of the GTP is kept at a minimum level, the module is language universal. 14 Speech and Language Technology. Volume 7 8-bit input stream 16-bit input stream ISO8859-2 to Windows1250 transcoding filter identity filter Unicode to Windows 1250 transcoding filter input selector preformatting filter transcribing filter (grapheme to phone) identity filter transcribing filter (phone to phoneme) broad/narrow transcription selector identity filter 8-bit output stream Windows 1250 to Unicode transcoding filter 16-bit output stream Figure. A diagram of GTP skeleton. 4.2. GTP module data compilation The data compilation unit converts a set of files containing the transcription tables and dictionaries into source files supplementing the GTP module skeleton. The sets of transcription tables are translated into regular expressions and decoding matrices expressed as C++ arrays by means of the program by Wypych [8]. The FSA6 Finite State Automaton Utilities [23] are used to build, determinize and optimize the transducers from regular expressions. The post-processing stage yields a set of C++ source code files containing arcs of transducers and decoding a matrix suitable for the GTP module skeleton. 5. Conclusions The paper has presented the implementation of Steffen-Batóg and Nowakowski’s rule-driven approach to transcription in the GTP module of speech processing systems for the Polish language. In our module, the traditional set of manually-derived rules is aided by a moderate-sized Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet15 in Polish Text-to-Speech Synthesis exception dictionary. The dictionary seems to be indispensable for the system to eliminate the weaknesses of the rules when dealing with a growing usage of loanwords in Polish. The presented implementation aimed at achieving maximum effectiveness, i.e. producing a computationally efficient “knowledge base” giving high quality transcription at low time and memory consumption in a market-ready application. We decided to modify the SAMPA alphabet for Polish. This computer-readable alphabet seems to be a good solution but we extended it to meet our approach to the inventory of Polish phonemes and allophones. The program was thorougly tested but we do not consider the automatic transcription of Polish as a solved issue. Possible future refinements should concentrate on addressing the following points: (1) the irregular transcription of proper names because it remains a major weakness of the present module, (2) dictionary extension (now approximatelly 500 entries), (3) generation of the phonetic transcription of other plausible pronunciation variants resulting from e.g. various speech tempi (here we assumed a moderate tempo; faster one could mean considerable reduction of the consonant clusters) or less attention paid to pronunciation correctness, (4) the addition of prosodic transcription. The third point is of great importance for automatic speech recognition purposes. REFERENCES [1] Steffen-Batóg, M., (1973) The problem of automatic phonemic transcription of written Polish, in: „Biuletyn Fonograficzny XIV”, Warszawa – Poznań. [2] Steffen-Batogowa, M., (1975) Automatyzacja fonematycznej tekstów polskich, Warszawa: PWN. [3] Steffen-Batóg, M. & Nowakowski, P. (1993) An algorithm for phonetic transcription of orthographic texts in Polish, in: “Studia Phonetica Posnaniensia. Vol. 3”, Steffen-Batóg, M., Awedyk, W., (Eds.), Poznań: Wydawnictwo Naukowe UAM. [4] Daelemans, W., van den Bosch, A., (1997) LanguageIndependent Data-Oriented Grapheme-to-Phoneme Conversion, in: “Progress in Speech Synthesis”, New York: Springer-Verlag. [5] Black, A. W., Lenzo, K, Pagel, V., (1998) Issues in Building General Letter to Sound Rules, in: The Third ESCA Workshop in Speech Synthesis. transkrypcji 16 Speech and Language Technology. Volume 7 [6] Warmus, M., (1973) Program na maszynę ODRA 1204 dla automatycznej transkrypcji fonematycznej tekstów języka polskiego, in: „Zastosowanie maszyn matematycznych do badań nad językiem natyralnym”, Bolc, L. (Ed.), Warszawa: Wydawnictwo Uniwersytetu Warszawskiego. [7] Jassem, K. (1996) A phonemic transcription & syllable division rule engine. Onomastica-Copernicus Research Colloquium, Edinburgh’96. [8] Wypych, M., (1999) Implementacja algorytmu transkrypcji fonematycznej, in: “Speech and Language Technology. Vol. 3”, Poznań: Polskie Towarzystwo Fonetyczne. [9] Nowak, I., (1991) Automatyczna transkrypcja polszczyzny nieregionalnej (odmiana północno-wschodnia i południowozachodnia), Warszawa: Prace IPPT PAN, 21/1991. [10] Dukiewicz, L., (1978) Polskie głoski nosowe, Warszawa: PWN. [11] Dukiewicz, L., Sawicka., J., (1995) Fonetyka i fonologia, in: „Gramatyka współczesnego języka polskiego”, Wróbel, H., (Eds.), Warszawa: PWN. [12] Jassem, W., (1973) Podstawy Fonetyki Akustycznej, Warszawa: PWN. [13] Jassem, W., (1998) An Acoustical Linear-Predictive and Statistical Discriminant Analysis of Polish Fricatives and Affricates, in: „Speech and Language Technology”, vol.2, Poznań: Polskie Towarzystwo Fonetyczne. [14] Ostaszewska, D., Tambor, J., (2001) Fonetyka i fonologia współczesnego języka polskiego, Warszawa: Wydawnictwo Naukowe PWN. [15] Karaś, M., Madejowa, M., (Eds.), (1977) Słownik wymowy polskiej, Warszawa, Kraków: PWN. [16] Madejowa, M., (1989) Zasady współczesnej wymowy polskiej, in: „Biuletyn Audiofonologii. Vol. 2-4”, Warszawa: Polski Komitet Audiofonologii. [17] Demenko, G., Grocholewski, S., (2001) Text independent speaker verification based on segmental and suprasegmental features, in: “Proceedings of the 5th World Multiconference on Systemics, Cybernetics and Informatics”, SCI: Orlando. [18] The IPA homepage, http://www.arts.gla.ac.uk/IPA/ipa.html. [19] Wells, J., The SAMPA homepage. http://www.phon.ucl.ac.uk/home/sampa/home.htm. Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet17 in Polish Text-to-Speech Synthesis [20] Sobol, E., (red.), (2002) Słownik wyrazów obcych, Warszawa: Wydawnictwo Naukowe PWN. [21] Łobacz, P., (1999) Problemy automatyzacji transkrypcji fonematycznej nazw własnych współczesnej polszczyzny, in: Linguam amicabilem facere. Ludvico Zabrocki in memoriam, Bańczerowski, J. and Zgółka, T. (Eds.), “Seria Językoznawcza nr 22”, Poznań: Wydawnictwo Naukowe UAM. [22] Demuynck, K., Laureys, T., (2002) Automatic generation of phonetic transcriptions for large speech corpora, www.cnts.uia.ac.be/cnts/pdf/ 20021001.2492.ICSLP2002paper.pdf. [23] van Noord, G., The FSA6 homepage. http://odur.let.rug.nl/~vannoord/Fsa/ [24] Allen, J., Hunnicutt, M. Sh., Klatt, D., (1987) From text to speech. The MITalk system. Cambridge University Press.