poster

advertisement
Specifications of Building Polish Lexica for Application in ASR and TTS Systems
Agnieszka Wagner
Department of Phonetics, Institute of Linguistics, Adam Mickiewicz University in Poznań, Poland
wagner@amu.edu.pl
The goal
The goal is to create large lexica with phonetic and linguistic information for an
efficient use in the field of speech technology and especially in the speech
recognition framework.
General specifications of Polish lexica
Current standards of creating lexica for speech technology applications
developed in the scope of LC-STAR project (“http://www.lc-star.com”) are
adopted. In order to cover a broad range of lexical domains, three lexica
including common words (at least 50.000 entries), proper names (45.000) and
special application words (5.000) should be collected. Since Polish belongs to
highly inflected languages it is proposed to extend the common word lexicon in
order to provide enough phonetic and linguistic knowledge for a speech
recognition system. Additionally, a small lexicon of foreign words is added. It
contains vocabulary characteristic of legal texts in order to make the lexica more
usable for application in the Jurisdic system designed for dictation of legal texts
and police reports.
The common words (CW) lexicon
The CW lexicon is derived from a text corpus covering five lexical domains:
sports & games, news, finance, culture & entertainment, consumer information.
The corpus is created on the basis of electronically available text sources and
media: daily newspapers, weekly magazines, specialist journals, instruction or
operating manuals, data from the internet - especially online magazines.
The text corpus must include no less than 10 Mio of tokens over the lexical
domains and each domain must be represented by at least 1 Mio of tokens.
The graph below shows distribution of tokens and distinct tokens (in brackets) in
the five lexical domains in the Polish text corpus.
Linguistic content
The linguistic content refers to:
- phonetic representation: phonetic transcription (multiple pronunciations are
coded), syllable & word boundaries, lexical stress
- morphological & grammatical information: inflected POS (+attributes) and
invariant POS derived from a large inflectional dictionary (Lubaszewski et
al. 2001)
- lexical representation available via the lexical domains represented in the
CW and SAP lexica
Structure of the lexica
Lexica consist of a set of entry group elements. An entry group refers to a generic
spelling (word form) and is characterized by:
- one orthography,
- zero or more alternative spelling elements
- one or more entry, compound entry or abbreviation elements
element
From this corpus a word list is extracted which must contain at least 50.000
entries representing the most frequent words.
To this end, word lists specific of a given domain must reach a target of 96% self
coverage on the common words of the corpora used for this domain and the final
list covering all five domains must reach a target of 96% self coverage on the
common words of the whole corpus (cf. Ziegenhein et al. 2004). The resulting
CW list contains 92 608 entries.
The additional common words lexicon
The goal of creating an additional CW lexicon was to ensure high lexical
coverage and to provide enough phonetic and linguistic information for a speech
recognition system.
The lexicon is derived using the same procedure as the LC-Star-based CW
lexicon, but instead of a text corpus a word list generated from lemmas selected
from universal and frequency dictionaries of Polish was used. After checking the
coverage of the two CW lists and removing the duplicates, the list generated from
dictionaries contains 126 526 entries.
The foreign words lexicon
The reason for creation of the foreign word lexicon was to ensure that words and
phrases of a foreign origin that may not show up frequently in the CW text
corpus, but occur commonly in legal texts are present in the final lexica.
Altogether 1246 entries were collected from a dictionary of foreign words and a
large corpus of legal texts.
The special application words (SAP) lexicon
Creation of the SAP lexicon ensures that entries which were unlikely to occur
frequently in large text corpora or were deleted from the word lists during
tokenization of CW text corpus are included in the final lexica.
The SAP lexicon is derived partly from the CW text corpus (e.g., abbreviations,
acronyms) and partly from other text sources and media such as thematic
dictionaries, technical documents and web portals.
The vocabulary collected in the SAP lexicon (5177 entries in total) is representative
of eight major lexical domains divided into smaller sub-domains (examples are
given in brackets).
definition
features
entry
specific grammatical meaning
of a vocabulary entry
one POS + attributes,
one lemma, one
phonetic transcription,
lexical domain for SAP entries
compound
entry
a multiple-token entry e.g.,
zapalenie płuc (EN: pneumonia);
occurs only in proper names and
SAP lexicon and must be broken
into its components
one phonetic transcription,
two or more entry elements
abbreviation
occurs only in the SAP lexicon
the actual expansion
with at least one fully specified
entry or compound entry
<ENTRY_GROUP orthography=”akademik”>
<ENTRY>
<NOM Type=”common” Gender=”masculine” Type=”personal”
Number=”singular” Case=”nominative”>
<LEMMA>akademik</LEMMA>
<PHONETIC>a-ka-d”e-mik</PHONETIC>
</ENTRY>
<ENTRY>
<NOM Type=”common” Gender=”masculine” Type=”impersonal”
Number=”singular” Case=”nominative”>
<LEMMA>akademik</LEMMA>
<PHONETIC>a-ka-d”e-mik</PHONETIC>
</ENTRY>
</ENTRY_GROUP>
<ENTRY_GROUP orthography=”godz.”>
<ALT_SPELL orthography=”g.”>
<ALT_SPELL orthography=”h”>
<ABB>
<EXP expansion="godzina">
<ENTRY>
<NOM Gender="feminine" Number="singular" Case="nominative"/>
<LEMMA>godzina</LEMMA>
<PHONETIC> g o - d^z' "i - n a </PHONETIC>
</ENTRY>
</EXP>
</ABB>
</ENTRY_GROUP>
Download