Lecture 4 CS4705 Sound Systems and Text-to- Speech

advertisement
Lecture 4
CS4705
Sound Systems and Text-toSpeech
CS 4705
Sound Systems of Language
• Phonetics
– The sounds (phones) of the world’s languages, the
phonemes they map to, and how they are produced
• Phonology
– Rules that govern how phones are realized differently in
different contexts
• Technologies:
– Automatic Speech Recognition (ASR) systems take
sounds as input and output word hypotheses
– Text-to-Speech (TTS) systems take text as input and
produce speech
Letters and Sounds
• same spelling = different sounds
o comb, tomb, bomb
c court, center, cheese
oo blood, food, good
s reason, surreal, shy
• same sound = different spellings
[i] sea, see, scene, receive, thief
[s] cereal, same, miss
[u] true, few, choose, lieu, do [ay] prime, buy, rhyme, lie
• combination of letters = single sound
ch child, beach
oo good, foot
th that, bathe
gh laugh
• single letter = combination of sounds
x exit, Texas
u use, music
• ‘silent’ letters
k knife, know
e moose, bone
p psycho, pterodactyl
gh through
Articulators
teeth
lips
Alveolar ridge
palate
velum
uvula
pharyngeal
larynx
vocal folds:glottis
trachea
Articulators in action
(Sample from the Queen’s University / ATR Labs
X-ray Film Database)
“Why did Ken set the soggy net on top of his deck?”
Vocal fold vibration
[UCLA Phonetics Lab demo]
Places of articulation
dental
labial
alveolar post-alveolar/palatal
velar
uvular
pharyngeal
laryngeal/glottal
http://www.chass.utoronto.ca/~danhall/phonetics/sammy.html
Articulatory parameters for
English consonants (in ARPAbet)
MANNER OF ARTICULATION
PLACE OF ARTICULATION
bilabial
stop
p
labio- inter- alveolar palatal velar glottal
dental dental
b
t d
k g q
fric.
f
v th dh s
z sh zh
affric.
ch jh
nasal
m
n
approx
w
l/r
flap
h
ng
y
dx
VOICING: voiceless voiced
American English vowel space
HIGH
iy
uw
ix
ih
FRONT
ux
ax
eh
ah
ae
uh
ao
aa
LOW
BACK
Acoustic landmarks
[p][ix][t] [ih][sh] [ax][n][p] [ae] [t][s] [iy][n] [s] [ae] [l][iy]
“Patricia and Patsy and Sally”
[p]
[ix]
[t]
[ih]
Syllables
• Syllabification important for
– pronunciation: deny/denim
– speaking rate calculation: syllables per second
– word recognition in ASR
• (onset) + nucleus + (coda):
–
–
–
–
cat
a
at
to
• Lexical stress: primary, secondary, terciary
– telephone
Phonological Rules
• Not all instances of a given phone [x] sound/look
alike
• Phoneme /x/ may have many allophones
• Phonological rules map phonemes in context to
allophones, e.g.
– simple rules: /{t,d}/ --> []/ V’ _ V
– FSA’s, FST’s
– declarative constraints: t:   V’ _ V
Allophones of /t/
• What we would consider a single ‘sound’ can be
pronounced differently depending on the phonetic
context. For example, the phoneme /t/:
Figure 4.8: Jurafsky & Martin (2000), page 104.
Application: Word Pronunciation for TTS
• Pronouncing dictionaries (the: [‘dhax],[‘dhiy])
• Problems:
–
–
–
–
–
–
Homographs (bass/bass, wind/wind, desert/desert)
Abbreviation (dr., st.)
Numbers (2125551212)
Acronyms (NAACL, IDIAP)
Morphological variation (unrelentingly)
Proper names and unknown words
• rules + dictionaries/dictionaries + rules
• Hybrid model:
– FSTs model individual word pronunciation in lexicon
(e.g. reg-noun-stem entry c:k a:ae t:t)
– FSAs model morphology (e.g. reg-noun-stem + s)
– FSTs for pronunciation rules (e.g. s--> z)
– special rules to model name and acronym pronunciation
– default letter2sound rules for other words
Inventive (and sometimes useful) Approaches
for Pronouncing Unknown Words
• Rhyming analogy: varoom/room, todo/dodo
• Linguistic origin: Infiniti, vingt, Perez
• Abbreviation expansion:
– spacious living/dining rm w/frplc/dining room with
fireplace
– pls?
Summary
• Phones realize phonemes in different contexts
– Different places and manners of articulation result in
acoustic differences that can be detected by ASR
systems as well as people
• Versatile FSTs can model phonological as well as
morphological and spelling systems
• Many creative approaches toward pronunciation
modeling for TTS
• Next time: Read Ch 5
Download