1. Introduction - Polskie Towarzystwo Fonetyczne

advertisement
Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet 1
in Polish Text-to-Speech Synthesis
Implementation of Grapheme-to-Phoneme Rules
and Extended SAMPA Alphabet
in Polish Text-to-Speech Synthesis1
Grażyna Demenko
Institute of Linguistics, Adam Mickiewicz University
lin@amu.edu.pl
Mikołaj Wypych
Institute of Computing Science, Poznań University of Technology
wypych@amu.edu.pl
Emilia Baranowska
Institute of Linguistics, Adam Mickiewicz University
emiszal@amu.edu.pl
STRESZCZENIE
Prezentowana praca dotyczy implementacji reguł transkrypcji
fonematycznej i fonetycznej języka polskiego dla potrzeb technologii
mowy (w module systemu syntezy mowy, który obecnie powstaje na
Uniwersytecie Adama Mickiewicza i Politechnice Poznańskiej). Raport
przedstawia propozycję modyfikacji alfabetu SAMPA dla języka
polskiego, procedurę weryfikacji dotychczas istniejących reguł
transkrypcji oraz dodatkowy słownik wyjątków. Opisano także działanie
poszczególnych części modułu.
ABSTRACT
The paper concerns the automatic generation of broad and narrow
transcription of the Polish language in the module of a concatenative
TTS system under development at Adam Mickiewicz University and
Poznań Technical University, Poland. First, we propose modified
SAMPA extension for the Polish language. Then, the verification
procedure of existing Polish grapheme-to-phoneme rules is presented.
The additional dictionary of exceptions is thoroughly described. Finally,
the article focuses on a rule compiler, a rule applier and dedicated
development environment.
1. Introduction
Automatic grapheme-to-phoneme (GTP) conversion modules, which
assign phonemic/phonetic transcription to graphemic word forms of a given
1
This work was partially supported by research grant 7T11C00921 from the
Polish Committee of Scientific Research.
2
Speech and Language Technology. Volume 7
language, meet many purposes in all speech processing systems. Rapidly
developing automatic services which rely on automatic speech recognition
(ASR) need real-time access to knowledge about pronunciation – phonetic
transcription enables the system to select proper acoustic models for the
input utterance evaluation. In text-to-speech (TTS) systems, a linguistic
transcription of an input text is required to select appropriate units (e.g.
diphones) which are concatenated and generate the intended signal in the
final step of synthesis. Automatic speech segmentation algorithms, needed
for various linguistic purposes, operate on input automatic broad/narrow
transcriptions provided along with speech signals.
The definition of the relation between the orthographic and
transcription layer may be obtained on the basis of rules and/or a dictionary.
It happens quite frequently that both these methods are used concurrently in
one module. If a dictionary search fails GTP rules are activated as a default
strategy. Since Polish is characterized by a relatively regular relationship
between the orthographic text and its phonemic/phonetic transcription, it
seems that the rule-based approach will be more favorable in its case.
Nevertheless, the dictionary part is still needed to cope with exceptions to
the rules.
The transcription rules and dictionaries may be formulated manually by
human experts ([1], [2], [3]) or derived automatically from a corpus of texts
and their transcriptions by machine learning algorithms, e.g. feedforward
neural networks ([4], [5]).
In 1973, Steffen-Batogowa published in [1] a set of expert GTP
conversion rules for Polish which were implemented for the first time by
Warmus [6]. Since then many different GTP modules based on the
Steffen-Batogowa's rules have been presented (cf. [7], [8] for examples and a
survey of solutions, [9]).
The goal of the present project was to build a practical market-oriented
GTP module starting from the rules proposed in [2] and [3].
2. Expressing GTP relation
2.1. Transcription tables
Let X* denote a set of all finite strings of elements from the set X,
including an empty string – the string of the length 0. Let X+ denote a set of
all finite strings from X, excluding an empty string. Let G be a set of
elements acceptable in input strings (e.g., a set of Latin letters). Similarly, let
P be a set of elements acceptable in output strings (e.g. a set of phoneme
symbols).
A transcription table T[1..m][1..n] is a matrix of m rows and n columns
in which the cells meet the following set of requirements: T[1][1]G,
T[i][1]G+ for i[2..m], T[1][j]G+ for j[2..n], T[2..n][2..m]P*. Each cell
indexed as (i,j)[2..m]x[2..n] states that if T[1][1] in an input string is
Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet 3
in Polish Text-to-Speech Synthesis
preceded by a string that belongs to T[i][1] (the left context set) and is
followed by a string that belongs to T[1][j] (the right context set), then
T[1][1] should be transcribed as T[i][j].
The transcription tables proposed by Steffen-Batogowa and
Nowakowski (cf. [2], [3]) do not contain overlapping contexts. This solution
increases the entropy of the context sets. In present solution if certain
contexts can be recognized simultaneously, we give a priority to the rules
that are formed by longer contexts. As a consequence zero length context
can be used to specify default columns or rows. Similarly to [7], we enable
more than one variant of transcription for a grapheme in a given context
(what results in T[2..n][2..m] P*).
Context sets are defined by means of expressions described in [3]. For
any A,BG*, we define the sum of sets A and B, denoting it by A+B, and a
concatenation of sets A and B, denoted as A*B which contains any string in
the form of abG*, where aA, bB and ab is a result of concatenation of a
and b. We use the digit 1 to denote an empty string. For example,
{1,con}*{1,cat} produces {1, con, cat, concat}.
The rule tables are stored in a UTF8 encoded file in an XML format or
in a plain text format. The plain text format uses a semicolon as a column
delimiter and a next-line character as a row delimiter. Elements from G and
P are represented by strings of Unicode characters. Examples of
transcription tables can be found in 3.3.
2.2. Dictionary
An exception is a matrix E[1..n][1..3] with n rows and 3 columns,
where E[i][1]G, E[i][2]{0,1} and E[i][3]P* for i[1..n]. The exception
in the form of E[1..n][1..3] means that if the string E[1..n][1] matches an
input substring then, for each i[1..n], if E[i][2] is equal to 1, E[i][1] should
be transcribed as E[i][3]. Note that there is no need to specify the
transcriptions for all the graphemes in an exception but only for the but only
for the locations where errors occurred. The remaining graphemes are
transcribed according to the rules (see 4.1).
The dictionary is a set of exceptions. Exceptions are stored in a UTF8
encoded file in a XML format or in a plain text format. The plain text
formatted file consists of lines, each containing a single exception. Rows of
an exception are separated by spaces and saved in one of the two forms:
E[i][1]:E[i][3] or E[i][1] depending whether E[i][2] is equal to 1 or 0.
3. GTP relation for Polish
3.1 Input and output character sets
Two sets of characters were precisely defined for the exact GTP
mapping for the Polish language – an input set of characters and an output
phonetic/phonemic alphabet.
4
Speech and Language Technology. Volume 7
The input set of symbols for Polish was defined here as a set of the
following symbols: X={a, ą, b, c, ć, d, e, ę, f, g, h, i, j, k, l, ł, m, n, ń, o, ó, p,
q, r, s, ś, t, u, v, w, x, y, z, ź, ż, #, ##}2. One hash substitutes interword
spaces in a string of orthographic symbols and two hashes denote sentence
final punctuation marks.
The modified computer-readable SAMPA alphabet, which codes the
IPA symbols into the ASCII characters and seems to be a good solution for
technical applications, was adopted as an output phonetic alphabet of the
module. An inventory of 39 phonemes was employed for broad transcription
and a set of 87 allophones was established for narrow transcription of Polish
(cf. Table 3.1 and Table 3.2).
The authors verified the existing phonetic notations and transcription
rules for Polish on the basis of: (1) the literature describing the Polish
phonological system and Polish rules of pronunciation (cf. e.g. [10], [11],
[12], [13], [14], [15], [16]) and (2) the results of the acoustic segmentation
for a few hundred utterances produced by 50 speakers brought up in two
Polish cities (Poznań and Kraków; cf. [17]).
Due to certain assumptions concerning the Polish phonological system
and some technical reasons, the following modifications were made to the
original Polish version of SAMPA:
0 A grapheme “y” is transcribed as /y/, not as /I/. It was judged that the
symbol /I/ is less legible (its form differs from other symbols for Polish
vowels) despite being more similar to an original IPA symbol (cf. [18]).
1 Affricates are represented by two symbols joined by a ^ symbol being
the equivalent of an IPA tie bar  . This linking character is essential
in a string of phonemes to differentiate affricates from consonant
clusters consisting from two phonemes, e.g. nadzieja /nad^z’eja/ 'hope'
vs. nadziemny /nadz’emny/ 'overhead'. Guidelines to the SAMPA
transcription [19] suggest that “(…) it is designed to be uniquely
parsable. As with the ordinary IPA, a string of SAMPA symbols does
not require spaces between successive symbols.” Nevertheless, lack of
the tie bar in affricates in Polish version of the alphabet forced a
transcriber to separate individual phonemes in a string of characters by
single spaces.
2 On the basis of present knowledge of Polish phonetics, our own
detailed acoustic analyses and experience derived from annotation of
Polish speech databases, it was established that the phoneme
counterparts of a grapheme ę as well as a grapheme ą are two separate
phonemes not only when followed by plosives or affricates, but also
before fricatives and (partially) when being in the word final position.
2
Notice that {q, v, x} are not standard Polish graphemes, they appear only in
loanwords. It seemed to be more economical later to operate on them in the rules
rather than to place words containing them in a module’s dictionary.
Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet 5
in Polish Text-to-Speech Synthesis
Thus, it was resigned from original SAMPA symbols /e~/ and /o~/
before fricative consonants /f v s z S Z x/ and when occurring word
finally, in favour of biphonemic notation /ew~/, /ow~/ before these
consonants and /e/, /ow~/ when word finally. The monophonemic
notation /e~/ and /o~/ before palatal fricatives /s’ z’/ was changed into
two-phoneme transcription /ej~/ or /oj~/ respectively, although the
alternative transcription with ‘hard’ nasal resonance /w~/ is also
acceptable.
Figure 1 shows a spectrogram of /e~/ before palatal fricatives /s’/ and
/z’/ (in kęsia, kęzia). The higher parts of the spectrogram illustrate
biphonemic realization as /ej~/. Figure 2 shows a spectrogram of /e~/
before fricatives /s/ and /z/ (in kęsa, kęza). The lower parts of the
spectrogram illustrate biphonemic realization as /ew~/.
k e
j~
s’
a
k
e
j~
z’
Figure 1. A Spectrogram illustrating kęsia, kęzia.
a
6
Speech and Language Technology. Volume 7
k e
w~
s
a
k
e
w~
z
a
Figure 2. A Spectrogram illustrating kęsa, kęza.
3
4
The adoption of the above-mentioned solution assumes for practical
reasons that /w~/ and /j~/ should be treated as individual phonemes in
Polish although their phonological status remains unclear [12]. These
two phonemes were not taken into consideration in the original version
of Polish SAMPA.
Phonemes /c/ and /J/ were also added to the proposed SAMPA
extension as the phonemes essential to the description of acoustic
differences between velar consonants /k g/ and, respectively, palatal
consonants /c J/, as in word pairs: kat /kat/ vs. kit /cit/ or drogę /droge/
vs. drogie /droJje/.
MAPPING OF PHONETIC ALPHABETS FOR POLISH
IPA






–
–



SAMPA
ORTOGRAPHY
TRANSCRIPTION
ORTO
TRANSSAMPA
EXTENSION GRAPHY CRIPTION
i
I
e
a
o
u
e~
o~
p
b
t
pit
typ
test
pat
pot
puk
gęś
wąs
pik
byt
test
pit
tIp
test
pat
pot
puk
g e~ s’
v o~ s
pik
bIt
test
i
y
e
a
o
u
–
–
p
b
t
pit
typ
test
pat
pot
puk
–
–
pik
byt
test
pit
typ
test
pat
pot
puk
–
–
pik
byt
test
Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet 7
in Polish Text-to-Speech Synthesis






























d
k
g
–
–
f
v
s
z
S
Z
s’
z’
x
ts
dz
tS
dZ
ts’
dz’
m
n
n’
N
l
r
w
j
–
–
dym
kit
gen
–
–
fan
wilk
syk
zbir
szyk
żyto
świt
źle
hymn
cyk
dzwon
czyn
dżem
ćma
dźwig
mysz
nasz
koń
pęk
luk
ryk
łyk
jak
–
–
dIm
kIt
gen
–
–
fan
vilk
sIk
zbir
SIk
ZIto
s’ v i t
z’ l e
xImn
ts I k
dz v o n
tS I n
dZ e m
ts’ m a
dz’ v i k
mIS
naS
k o n’
peNk
luk
rIk
wIk
jak
–
–
d
k
g
c
J
f
v
s
z
S
Z
s’
z’
x
t^s
d^z
t^S
d^Z
t^s’
d^z’
m
m
n’
N
l
r
w
j
w~
j~
dym
kat
gen
kiedy
giełda
fan
wilk
syk
zbir
szyk
żyto
świt
źle
hymn
cyk
dzwon
czyn
dżem
ćma
dźwig
mysz
nasz
koń
pęk
luk
ryk
łyk
jak
ciąża
więź
dym
kat
gen
cjedy
Jjewda
fan
vilk
syk
zbir
Syk
Zyto
s’fit
z’le
xymn
t^syk
d^zvon
t^Syn
d^Zem
t^s’ma
d^z’vik
myS
naS
kon’
peNk
luk
ryk
wyk
jak
t^s’ow~Za
vjej~s’
Table 3.1. Computer coding of Polish phonemes.
The IPA symbols are mapped into the ASCII characters: SAMPA and
extended-SAMPA symbols.
8
Speech and Language Technology. Volume 7
PHONEME
i
y
e
a
o
u
j
j~
w
w~
l
r
m
n
n’
N
x
v
f
ALLOPHONES
i i~3
y y~
e e] e~ e]~
a a] a~ a]~
o o] o~ o]~
u u] u~ u]~
j
j~
w w/ w,
w~
l l/ l,
r r/ r,
m m/ m,
n n/ n, n–
n’ n’/
N N/ N, N,/
x x/ x, x,/
v v,
f f,
PHONEME
z
z’
s
s’
Z
S
d^z
t^s
d^z’
t^s’
d^Z
t^S
b
p
d
t
g
J
k
c
ALLOPHONES
z z,
z’
s s,
s’
Z Z,
S S,
d^z d^z,
t^s t^s,
d^z’
t^s’
d^Z d^Z,
t^S t^S,
b b,
p p,
d d, d–
t t, t–
g
J
k
c
Table 3.2 A list of allophones of the Polish language
(the modified-SAMPA symbols).
3.2. Transcription rules
An editor for rule editing and testing was created. The editor transcribes
example texts by means of the defined rules and edits them according to the
predefinded rules of transcription and amendments made by experts. The
following iterative verification procedure was applied to the transcription
rules:
0 Choose an orthographic text to be used in the process of rule testing.
1 Transcribe the text by means of the currently available rules.
2 Erase those substrings, both in the text and its transcription, which are
known to be translated properly on the basis of the transcribed text
corpus.
3 Correct the remaining part of the transcription. Append the resulting
correct transcription to the corpus.
3
DIACRITICS:
~
]
/
,
–
nasalisation or nasality
raised articulation
devoicing
palatalization
retracted (here: alveolarized)
e~
e]
l/
l,
t–
w~
Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet 9
in Polish Text-to-Speech Synthesis
4
5
Reformulate the rules according to point 4.
Test the new rules on the corpus.
Steps 3 and 4 were performed by an expert while the others were
performed automatically.
Two corpora of test material were gathered for the above purpose. One
of them included a few dozen written texts downloaded from electronic
versions of Polish nationwide newspapers and the other contained a few
thousand single words and phrases obtained from: exercises in textbooks on
phonetic transcription, dictionaries, and a set of linguistic 'traps' and
exceptions for Polish (cf. [3] & [9], [14]). The input texts had not gone
through earlier processing. The errors detected were classified according to
the causes generating them and removed in turn.
The performance of the rules was evaluated by ‘words correct’ and
‘symbols correct’ scoring metric. The percentage of words transcribed
accurately in a given set of texts was measured before and after any
modifications to the rules. Error analysis on a symbol level enabled us to
eliminate whole categories of errors, especially in the initial phase of
verification – mainly costly misprints in the transcription tables.
We observed also how transcription accuracy varies depending on the
type of texts in the test set (e.g. national news vs international news with
many foreign words from the same Polish newspaper). We could not
compare the performance of our GTP module with accuracy figures for
previous implementations of Steffen-Batogowa’s rules as all authors use
different evaluation techniques: diferent input corpora, scoring as well as
incommensurate output phoneme/phone inventories. To sum up, the applied
assessment strategy can be described as subjective testing but it appeared to
be sufficient for our purposes. Although the resulting rule-based
transcription was evaluated as high-quality one, it appeared that an exception
dictionary is highly desirable to deal mostly with words of foreign origin not
adopted to Polish pronunciation.
Moreover, the tables of rewriting rules presented in [2] and [3] were
extended to enable transcription which reflects two types of Polish
pronunciation: Warsaw (W) and Krakow-Poznan (K-P) ones. The main
difference between these two pronunciation varieties concerns voicing in
interword assimilations. Alternative transcription was obtained by
supplementary rewriting rules and the modification of some subsets of
graphemes. This is an example of such an extension and reorganisation:
The original table:
%ć1 Table 29, part 1
ć ; (1+M+D)K+T-{s,c}
X ;
t^s'
;
;
(1+M+D)(G+E)+DV
d^z'
10
Speech and Language Technology. Volume 7
Present modification:
%ć1 Table 29, part 1
ć ; (1+M+D)K9+T-{s,c} ;
X ;
t^s'
;
(M+D)(G9+E)+DV+G9 ; (1+M+D)H9
d^z'
; d^z'|t^s'
Capital letters (also followed by numbers) represent here certain subsets of
graphemes.
When there is a morphological border between the elements of a twographeme cluster <nk>, appropriate automatic transcription reflecting double
plausible pronunciation (/nk/ in W or /Nk/ in K-P) is much harder to obtain.
In such cases, the correct transcription for W is extracted from an additional
set of minimal context needed to recognize words containing a grapheme n
before the suffix starting from -k- (||-ek-).
Additionally, the GTP rules for graphemes v, q and x were formulated
and added to the set of tables, e.g.:
%x Table 37, part 1
x ; i
;
D ; X - (i+D)
X ; ks, ;
gz ; ks
In short, the verification procedure helped (1) remove typo errors in
original rules, (2) formulate new rules covering the differences in standard
Polish pronunciation (W & K-P), (3) add rules for additional graphemes, (4)
determine phenomena too irregular to be embedded in the rule tables. At
every stage, we took into account the main principle of GTP rule
formulation, i.e. strive for maximum productivity and manageable
organisation of a rule set.
3.3. Exceptions
On the basis of the error detection procedure, the authors compiled a
preliminary lexicon of exceptions. It contained (a) Polish loanwords that
kept their original foreign pronunciation and spelling and (b) the words of
Polish spelling featuring exceptional pronounciation of some grapheme
combinations (e.g. irregular dielektryk /dielektryk/ vs. regular diecezja
/djet^sezja/; marzła /marzwa/ vs. marzyła /maZywa/). The criteria for
entering a word along with its correct transcription into the preliminary
version of the dictionary were as follows:
0 Words that could not be (efficiently) transcribed using the set of GTP
rules. New rules would violate the existing ones. However, as a result
of adding {v, q, x} to the original set of graphemes and formulation of
GTP rules for them, the words such as votum, video, VAT, boutique,
taxi, matrix, etc. were excepted from the initial lexicon.
1 Foreign words that were well-grounded in the Polish language. The
majority of the listed words are also the entries of the newest dictionary
of loanwords in Polish [20]. Some words that where not covered by the
Sobol’s dictionary were included into the lexicon only on the basis of
Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet11
in Polish Text-to-Speech Synthesis
their newspaper’s usage, e.g., e-mail vs. airmail, copywriter vs.
copyright, tour (operator) vs. tournee or outsourcing vs. outsider.
2 Common nouns. Proper nouns were put in the list only if they appeared
in expressions such as choroba Alzheimera ‘Alzheimer’s disease’, skala
Beauforta ‘Beaufort scale’ or eau de Cologne.
Most loanwords which retained their original spelling and
pronunciation in Polish come from English, particularly those relatively new
in Polish, e.g., dubbing, hardware, factoring, cartridge, pub, etc. Although
many borrowings had adopted to Polish orthography & pronunciation rules,
their original spelling had to be included into the lexicon because its
functions in our language alongside, e.g., diler – dealer, interfejs – interface,
recykling – recycling, kolaż – collage, dżip – jeep, blef – bluff.
Polish vs. loan homographs are notorious “false friends” of automatic
transcription. The following homographic but not homophonic pairs are
classic examples:
Eng.
Pol.
lady /lejdi/4 ‘woman’
–
/lady/ pl ‘counter in a shop’,
baby /bejbi/ ‘child’
–
/baby/ pl coll ‘woman’,
band /bent/ ‘musicians’
–
/bant/ pl gen ‘gang’.
Currently, such loan words appear in the additional lexicon only as
parts of longer borrowed expressions, e.g. first lady, baby-sitter, jazz band.
Syntactic or semantic processing of inter-word contexts would be needed to
generate correct transcription of such mixed expressions as żelazna lady
‘iron lady’, band Jarka ‘Jarek’s music band/gen. Jarek’s gangs’ or moje
baby ‘my baby/my women’. An alternative solution is to automatically
generate all plausible pronunciations of a given item and then to select the
best matching variant using context dependent models.
Proper names in Polish showing variant pronunciation due to e.g. the
insertion of foreign expressions, as it often is in the case of company names
and product names, were beyond the scope of the present work. Moreover, it
seems that any string composed of graphemes and figures can become a
proper name. They can be correctly transcribed only with the aid of a large
data corpus by lexicon lookup technique (cf. [21] & [22]).
The dictionary’s final version codes each entry as a minimal string of
symbols sufficient for the recognition and correct GTP conversion of a given
exception. On the other hand, the entry’s length must not interfere with the
rule-based transcription of regular words. Therefore, some entries could not
be maximally simplified, e.g. copy /kopy/ without hashes was not derived
from {#hardcopy#, #copywriter# & #copyright#} because it would violate
regular transcription of a word obcopylność /opt^sopylnos’t^s’/.
As it was mentioned above, the program takes only ‘repairs’ from the
dictionary; the remaining context is generated from the rules. This solution
4
Polish pronunciation of a loanword (SAMPA representation)
12
Speech and Language Technology. Volume 7
reduces dictionary storage requirements. The exceptions can be understood
also as a kind of rewriting rules, however, they are too specialised to be
included into the ‘baseform’ rule set. For example, a word yuppie,
incorrectly transcribed as /juppie/ according to the main rules, gets correct
form /japi/ thanks to a dictionary entry {y u:a p:1 p i:i e:1}. Consider also the
following examples of entries:
c:k l o: u
d:d, e:i s:z i:aj g: n
l a u:u r k
m a r:r z:z l
m y s ł:w/ #
n a d:d z:z m
p o l i:i a k r
w:w a:o t e r p r o:u o: f
w:w, i n d s u:e r f
z o m b i:i e:
%
%
%
%
%
%
%
%
%
%
clou
design
laurk
marzl
mysł#
nadzm
poliakr
waterproof
windsurf
zombie
The entry { m a r:r z:z l } gives correct transcription of such words as
zamarzli, przymarzliście, odmarzliśmy, etc. The entry {m y s ł:w/ #}
provides correct GTP conversion of nouns pomysł, zamysł, przemysł, etc.
and does not violate rule-driven conversion of verbs przyniósł, poszedł, etc.
The entries such as { n a d:d z:z m } or { p o l i:i a k r } are stored to avoid
the necessity of text pre-edition and insertion of a separator (e.g. | in [3])
between exceptionally pronounced grapheme combinations, e.g. in
compound words such as nad|zmysłowy, poli|akrylany (compare them with
regularly transcribed nadzy, folia).
4. Software
The resulting GTP module implementation for Polish, called PolPhone,
is a set of portable ISO C++5 source code files. They can be split into two
classes: The GTP module skeleton files and the data files that drive
transcription process for Polish. The data files are based on sets of rules and
dictionaries. The source codes can be compiled using a C++ compiler. As a
result, an executable file or a dynamic library is obtained.
4.1. GTP module skeleton
The term filter denotes an object that gets data from a source stream
and puts data into a target stream. A few useful filter types were defined in
the described GTP skeleton. A graph of filters connected by streams yields
the GTP module skeleton.
We use the following types of filters:
5
Source codes were successfully compiled with Borland C++ Builder 5.0
Update 1 and Microsoft Visual C++ 6.0 under Windows XP system and with GNU
C++ 3.2 and Borland Kylix 3 under RedHat Linux 7.3.
Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet13
in Polish Text-to-Speech Synthesis
0
identity filter copies the input stream to the output stream without any
changes to the data,
1 transcoding filter converts the codepage encoding of streams,
2 preformatting filter prepares an input text to the form expected by the
transcription rules. It includes: downcasing the letters, replacing
sequences of blank spaces and digits with a single hash sign, replacing
punctuation marks with a double hash sign.
3 transcribing filter consists of three transcribers.
The transcriber works on the basis of rules defined in the transcription
tables (cf. 2.1 and 4.2). The transcriber consists of two deterministic finite
state transducers (FST) and a 2-dimensional decoding matrix. For a given
character in the input stream, it applies one FST to preceding chars (its
right context) and the another one to the following chars (its left context).
The FST produces a single integer if the context is recognized. Decoding
matrix cells contain characters that are placed in the output stream only if
both FTS have not failed. The transcriber fails if at least one of the
transducers fail or the determined decoding matrix cell contains null. Such
a design of the transcriber is enforced by the trade-off between its size and
time efficiency. Although building a single deterministic FST that would
take over the transcriber’s role for a set of rules in presented project is
possible, it would require enormous amount of memory. The transcribing
filter can be defined as a chain of three transcribers: An exception
transcriber, a rule transcriber and a default transcriber. They are applied to
each character of the input stream in the same order that they were itemized
in. The subsequent transcriber is executed only if the preceding transcriber
fails.
The presented design allows the resulting GTP module to get no more
characters from the input stream than it is necessary to obtain a desired
number of phonemes (which may require reading a few characters ahead).
On the other hand, it is possible to put a given number of characters as input
and read all the resulting phonemes until the first that would be dependent on
characters that are not available yet.
Since the linguistic knowledge incorporated in the skeleton of the GTP
is kept at a minimum level, the module is language universal.
14
Speech and Language Technology. Volume 7
8-bit input stream
16-bit input stream
ISO8859-2 to Windows1250
transcoding filter
identity filter
Unicode to Windows 1250
transcoding filter
input selector
preformatting filter
transcribing filter
(grapheme to phone)
identity filter
transcribing filter
(phone to phoneme)
broad/narrow
transcription
selector
identity filter
8-bit output stream
Windows 1250 to Unicode
transcoding filter
16-bit output stream
Figure. A diagram of GTP skeleton.
4.2. GTP module data compilation
The data compilation unit converts a set of files containing the
transcription tables and dictionaries into source files supplementing the GTP
module skeleton.
The sets of transcription tables are translated into regular expressions
and decoding matrices expressed as C++ arrays by means of the program by
Wypych [8]. The FSA6 Finite State Automaton Utilities [23] are used to
build, determinize and optimize the transducers from regular expressions.
The post-processing stage yields a set of C++ source code files containing
arcs of transducers and decoding a matrix suitable for the GTP module
skeleton.
5. Conclusions
The paper has presented the implementation of Steffen-Batóg and
Nowakowski’s rule-driven approach to transcription in the GTP module of
speech processing systems for the Polish language. In our module, the
traditional set of manually-derived rules is aided by a moderate-sized
Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet15
in Polish Text-to-Speech Synthesis
exception dictionary. The dictionary seems to be indispensable for the
system to eliminate the weaknesses of the rules when dealing with a growing
usage of loanwords in Polish. The presented implementation aimed at
achieving maximum effectiveness, i.e. producing a computationally efficient
“knowledge base” giving high quality transcription at low time and memory
consumption in a market-ready application. We decided to modify the
SAMPA alphabet for Polish. This computer-readable alphabet seems to be a
good solution but we extended it to meet our approach to the inventory of
Polish phonemes and allophones. The program was thorougly tested but we
do not consider the automatic transcription of Polish as a solved issue.
Possible future refinements should concentrate on addressing the following
points: (1) the irregular transcription of proper names because it remains a
major weakness of the present module, (2) dictionary extension (now
approximatelly 500 entries), (3) generation of the phonetic transcription of
other plausible pronunciation variants resulting from e.g. various speech
tempi (here we assumed a moderate tempo; faster one could mean
considerable reduction of the consonant clusters) or less attention paid to
pronunciation correctness, (4) the addition of prosodic transcription. The
third point is of great importance for automatic speech recognition purposes.
REFERENCES
[1]
Steffen-Batóg, M., (1973) The problem of automatic phonemic
transcription of written Polish, in: „Biuletyn Fonograficzny XIV”,
Warszawa – Poznań.
[2]
Steffen-Batogowa, M., (1975)
Automatyzacja
fonematycznej tekstów polskich, Warszawa: PWN.
[3]
Steffen-Batóg, M. & Nowakowski, P. (1993) An algorithm for
phonetic transcription of orthographic texts in Polish, in: “Studia
Phonetica Posnaniensia. Vol. 3”, Steffen-Batóg, M., Awedyk, W.,
(Eds.), Poznań: Wydawnictwo Naukowe UAM.
[4]
Daelemans, W., van den Bosch, A., (1997)
LanguageIndependent Data-Oriented Grapheme-to-Phoneme Conversion, in:
“Progress in Speech Synthesis”, New York: Springer-Verlag.
[5]
Black, A. W., Lenzo, K, Pagel, V., (1998) Issues in Building General
Letter to Sound Rules, in: The Third ESCA Workshop in Speech
Synthesis.
transkrypcji
16
Speech and Language Technology. Volume 7
[6]
Warmus, M., (1973) Program na maszynę ODRA 1204 dla
automatycznej transkrypcji fonematycznej tekstów języka polskiego,
in: „Zastosowanie maszyn matematycznych do badań nad językiem
natyralnym”, Bolc, L. (Ed.), Warszawa: Wydawnictwo Uniwersytetu
Warszawskiego.
[7]
Jassem, K. (1996) A phonemic transcription & syllable division rule
engine. Onomastica-Copernicus Research Colloquium, Edinburgh’96.
[8]
Wypych, M., (1999) Implementacja algorytmu transkrypcji
fonematycznej, in: “Speech and Language Technology. Vol. 3”,
Poznań: Polskie Towarzystwo Fonetyczne.
[9]
Nowak, I., (1991) Automatyczna transkrypcja polszczyzny
nieregionalnej (odmiana północno-wschodnia i południowozachodnia), Warszawa: Prace IPPT PAN, 21/1991.
[10] Dukiewicz, L., (1978) Polskie głoski nosowe, Warszawa: PWN.
[11] Dukiewicz, L., Sawicka., J., (1995) Fonetyka i fonologia, in:
„Gramatyka współczesnego języka polskiego”, Wróbel, H., (Eds.),
Warszawa: PWN.
[12] Jassem, W., (1973) Podstawy Fonetyki Akustycznej, Warszawa: PWN.
[13] Jassem, W., (1998) An Acoustical Linear-Predictive and Statistical
Discriminant Analysis of Polish Fricatives and Affricates, in: „Speech
and Language Technology”, vol.2, Poznań: Polskie Towarzystwo
Fonetyczne.
[14] Ostaszewska, D., Tambor, J., (2001) Fonetyka i fonologia
współczesnego języka polskiego, Warszawa: Wydawnictwo Naukowe
PWN.
[15] Karaś, M., Madejowa, M., (Eds.), (1977) Słownik wymowy polskiej,
Warszawa, Kraków: PWN.
[16] Madejowa, M., (1989) Zasady współczesnej wymowy polskiej, in:
„Biuletyn Audiofonologii. Vol. 2-4”, Warszawa: Polski Komitet
Audiofonologii.
[17] Demenko, G., Grocholewski, S., (2001) Text independent speaker
verification based on segmental and suprasegmental features, in:
“Proceedings of the 5th World Multiconference on Systemics,
Cybernetics and Informatics”, SCI: Orlando.
[18] The IPA homepage, http://www.arts.gla.ac.uk/IPA/ipa.html.
[19] Wells, J., The SAMPA homepage.
http://www.phon.ucl.ac.uk/home/sampa/home.htm.
Implementation of Grapheme-to-Phoneme Rules and Extended SAMPA Alphabet17
in Polish Text-to-Speech Synthesis
[20] Sobol, E., (red.), (2002) Słownik wyrazów obcych, Warszawa:
Wydawnictwo Naukowe PWN.
[21] Łobacz, P., (1999) Problemy automatyzacji transkrypcji
fonematycznej nazw własnych współczesnej polszczyzny, in: Linguam
amicabilem facere. Ludvico Zabrocki in memoriam, Bańczerowski, J.
and Zgółka, T. (Eds.), “Seria Językoznawcza nr 22”, Poznań:
Wydawnictwo Naukowe UAM.
[22] Demuynck, K., Laureys, T., (2002) Automatic generation of phonetic
transcriptions for large speech corpora, www.cnts.uia.ac.be/cnts/pdf/
20021001.2492.ICSLP2002paper.pdf.
[23] van Noord, G., The FSA6 homepage.
http://odur.let.rug.nl/~vannoord/Fsa/
[24] Allen, J., Hunnicutt, M. Sh., Klatt, D., (1987) From text to speech.
The MITalk system. Cambridge University Press.
Download