Specifications of Building Polish Lexica for Application in ASR and TTS Systems Agnieszka Wagner Department of Phonetics, Institute of Linguistics, Adam Mickiewicz University in Poznań, Poland wagner@amu.edu.pl The goal The goal is to create large lexica with phonetic and linguistic information for an efficient use in the field of speech technology and especially in the speech recognition framework. General specifications of Polish lexica Current standards of creating lexica for speech technology applications developed in the scope of LC-STAR project (“http://www.lc-star.com”) are adopted. In order to cover a broad range of lexical domains, three lexica including common words (at least 50.000 entries), proper names (45.000) and special application words (5.000) should be collected. Since Polish belongs to highly inflected languages it is proposed to extend the common word lexicon in order to provide enough phonetic and linguistic knowledge for a speech recognition system. Additionally, a small lexicon of foreign words is added. It contains vocabulary characteristic of legal texts in order to make the lexica more usable for application in the Jurisdic system designed for dictation of legal texts and police reports. The common words (CW) lexicon The CW lexicon is derived from a text corpus covering five lexical domains: sports & games, news, finance, culture & entertainment, consumer information. The corpus is created on the basis of electronically available text sources and media: daily newspapers, weekly magazines, specialist journals, instruction or operating manuals, data from the internet - especially online magazines. The text corpus must include no less than 10 Mio of tokens over the lexical domains and each domain must be represented by at least 1 Mio of tokens. The graph below shows distribution of tokens and distinct tokens (in brackets) in the five lexical domains in the Polish text corpus. Linguistic content The linguistic content refers to: - phonetic representation: phonetic transcription (multiple pronunciations are coded), syllable & word boundaries, lexical stress - morphological & grammatical information: inflected POS (+attributes) and invariant POS derived from a large inflectional dictionary (Lubaszewski et al. 2001) - lexical representation available via the lexical domains represented in the CW and SAP lexica Structure of the lexica Lexica consist of a set of entry group elements. An entry group refers to a generic spelling (word form) and is characterized by: - one orthography, - zero or more alternative spelling elements - one or more entry, compound entry or abbreviation elements element From this corpus a word list is extracted which must contain at least 50.000 entries representing the most frequent words. To this end, word lists specific of a given domain must reach a target of 96% self coverage on the common words of the corpora used for this domain and the final list covering all five domains must reach a target of 96% self coverage on the common words of the whole corpus (cf. Ziegenhein et al. 2004). The resulting CW list contains 92 608 entries. The additional common words lexicon The goal of creating an additional CW lexicon was to ensure high lexical coverage and to provide enough phonetic and linguistic information for a speech recognition system. The lexicon is derived using the same procedure as the LC-Star-based CW lexicon, but instead of a text corpus a word list generated from lemmas selected from universal and frequency dictionaries of Polish was used. After checking the coverage of the two CW lists and removing the duplicates, the list generated from dictionaries contains 126 526 entries. The foreign words lexicon The reason for creation of the foreign word lexicon was to ensure that words and phrases of a foreign origin that may not show up frequently in the CW text corpus, but occur commonly in legal texts are present in the final lexica. Altogether 1246 entries were collected from a dictionary of foreign words and a large corpus of legal texts. The special application words (SAP) lexicon Creation of the SAP lexicon ensures that entries which were unlikely to occur frequently in large text corpora or were deleted from the word lists during tokenization of CW text corpus are included in the final lexica. The SAP lexicon is derived partly from the CW text corpus (e.g., abbreviations, acronyms) and partly from other text sources and media such as thematic dictionaries, technical documents and web portals. The vocabulary collected in the SAP lexicon (5177 entries in total) is representative of eight major lexical domains divided into smaller sub-domains (examples are given in brackets). definition features entry specific grammatical meaning of a vocabulary entry one POS + attributes, one lemma, one phonetic transcription, lexical domain for SAP entries compound entry a multiple-token entry e.g., zapalenie płuc (EN: pneumonia); occurs only in proper names and SAP lexicon and must be broken into its components one phonetic transcription, two or more entry elements abbreviation occurs only in the SAP lexicon the actual expansion with at least one fully specified entry or compound entry <ENTRY_GROUP orthography=”akademik”> <ENTRY> <NOM Type=”common” Gender=”masculine” Type=”personal” Number=”singular” Case=”nominative”> <LEMMA>akademik</LEMMA> <PHONETIC>a-ka-d”e-mik</PHONETIC> </ENTRY> <ENTRY> <NOM Type=”common” Gender=”masculine” Type=”impersonal” Number=”singular” Case=”nominative”> <LEMMA>akademik</LEMMA> <PHONETIC>a-ka-d”e-mik</PHONETIC> </ENTRY> </ENTRY_GROUP> <ENTRY_GROUP orthography=”godz.”> <ALT_SPELL orthography=”g.”> <ALT_SPELL orthography=”h”> <ABB> <EXP expansion="godzina"> <ENTRY> <NOM Gender="feminine" Number="singular" Case="nominative"/> <LEMMA>godzina</LEMMA> <PHONETIC> g o - d^z' "i - n a </PHONETIC> </ENTRY> </EXP> </ABB> </ENTRY_GROUP>