Lecture 4 CS4705 Sound Systems and Text-toSpeech CS 4705 Sound Systems of Language • Phonetics – The sounds (phones) of the world’s languages, the phonemes they map to, and how they are produced • Phonology – Rules that govern how phones are realized differently in different contexts • Technologies: – Automatic Speech Recognition (ASR) systems take sounds as input and output word hypotheses – Text-to-Speech (TTS) systems take text as input and produce speech Letters and Sounds • same spelling = different sounds o comb, tomb, bomb c court, center, cheese oo blood, food, good s reason, surreal, shy • same sound = different spellings [i] sea, see, scene, receive, thief [s] cereal, same, miss [u] true, few, choose, lieu, do [ay] prime, buy, rhyme, lie • combination of letters = single sound ch child, beach oo good, foot th that, bathe gh laugh • single letter = combination of sounds x exit, Texas u use, music • ‘silent’ letters k knife, know e moose, bone p psycho, pterodactyl gh through Articulators teeth lips Alveolar ridge palate velum uvula pharyngeal larynx vocal folds:glottis trachea Articulators in action (Sample from the Queen’s University / ATR Labs X-ray Film Database) “Why did Ken set the soggy net on top of his deck?” Vocal fold vibration [UCLA Phonetics Lab demo] Places of articulation dental labial alveolar post-alveolar/palatal velar uvular pharyngeal laryngeal/glottal http://www.chass.utoronto.ca/~danhall/phonetics/sammy.html Articulatory parameters for English consonants (in ARPAbet) MANNER OF ARTICULATION PLACE OF ARTICULATION bilabial stop p labio- inter- alveolar palatal velar glottal dental dental b t d k g q fric. f v th dh s z sh zh affric. ch jh nasal m n approx w l/r flap h ng y dx VOICING: voiceless voiced American English vowel space HIGH iy uw ix ih FRONT ux ax eh ah ae uh ao aa LOW BACK Acoustic landmarks [p][ix][t] [ih][sh] [ax][n][p] [ae] [t][s] [iy][n] [s] [ae] [l][iy] “Patricia and Patsy and Sally” [p] [ix] [t] [ih] Syllables • Syllabification important for – pronunciation: deny/denim – speaking rate calculation: syllables per second – word recognition in ASR • (onset) + nucleus + (coda): – – – – cat a at to • Lexical stress: primary, secondary, terciary – telephone Phonological Rules • Not all instances of a given phone [x] sound/look alike • Phoneme /x/ may have many allophones • Phonological rules map phonemes in context to allophones, e.g. – simple rules: /{t,d}/ --> []/ V’ _ V – FSA’s, FST’s – declarative constraints: t: V’ _ V Allophones of /t/ • What we would consider a single ‘sound’ can be pronounced differently depending on the phonetic context. For example, the phoneme /t/: Figure 4.8: Jurafsky & Martin (2000), page 104. Application: Word Pronunciation for TTS • Pronouncing dictionaries (the: [‘dhax],[‘dhiy]) • Problems: – – – – – – Homographs (bass/bass, wind/wind, desert/desert) Abbreviation (dr., st.) Numbers (2125551212) Acronyms (NAACL, IDIAP) Morphological variation (unrelentingly) Proper names and unknown words • rules + dictionaries/dictionaries + rules • Hybrid model: – FSTs model individual word pronunciation in lexicon (e.g. reg-noun-stem entry c:k a:ae t:t) – FSAs model morphology (e.g. reg-noun-stem + s) – FSTs for pronunciation rules (e.g. s--> z) – special rules to model name and acronym pronunciation – default letter2sound rules for other words Inventive (and sometimes useful) Approaches for Pronouncing Unknown Words • Rhyming analogy: varoom/room, todo/dodo • Linguistic origin: Infiniti, vingt, Perez • Abbreviation expansion: – spacious living/dining rm w/frplc/dining room with fireplace – pls? Summary • Phones realize phonemes in different contexts – Different places and manners of articulation result in acoustic differences that can be detected by ASR systems as well as people • Versatile FSTs can model phonological as well as morphological and spelling systems • Many creative approaches toward pronunciation modeling for TTS • Next time: Read Ch 5