presentation_v00

Mandarin Chinese Speech Recognition Mandarin Chinese  Tonal language (inflection matters!)       Monosyllabic language    1st tone – High, constant pitch (Like saying “aaah”) 2nd tone – Rising pitch (“Huh?”) 3rd tone – Low pitch (“ugh”) 4th tone – High pitch with a rapid descent (“No!”) “5th tone” – Neutral used for de-emphasized syllables Each character represents a single base syllable and tone Most words consist of 1, 2, or 4 characters Heavily contextual language Mandarin Chinese and Speech Processing  Accoustic representations of Chinese syllables  Structural Form  (consonant) + vowel + (consonant) Mandarin Chinese and Speech Processing  Phone Sets  Initial/final phones [1] e.g. Shi, ge, zi = (shi + ib), (ge + e), (z + if)  Initial phones: unvoiced    1 phone Final phones: voiced (tone 1-5)  Can consist of multiple phones Mandarin Chinese and Speech Processing   Strong tonal recognition is crucial to distinguish between homonyms [3] (especially w/o context) Creating tone models is difficult  Discontinuities exist in the F0 contour between voiced and unvoiced regions Prosody  Prosody: “the rhythmic and intonational aspect of language” [2]   Embedded Tone Modeling[4] Explicit Tone Modeling[4] Tone Modeling  Embedded Tone Modeling   Tonal acoustic units are joined with spectral features at each frame [4] Explicit Tone Modeling  Tone recognition is completed independently and combined after post-processing [4] Tone Modeling   Pitch, energy, and duration (Prosody) combined with lexical and syntactic features improves tonal labeling Coarticulation   Variations in syllables can cause variations in tone: Bu4 + Dui4 = Bu2 Dui4 (wrong) Ni3 + Hao3 = Ni2 Hao3 (hello) Emebedded Tone Modeling: Two Stream Modeling Ni, Liu, Xu  Spectral Stream –MFCC’s (Mel frequency cepstral coefficients)  Describe vocal tract information  Distinctive for phones (short time duration)  Pitch/Tone Stream – requires smoothing      Describe vibrations of the vocal chords Independent of Spectral features d/dt(pitch) aka tone and d2/dt2(pitch) are added Embedded in an entire syllable Affected by coarticulation (requires a longer time window) – i.e. Sandhi Tone – context dependency Embedded Tone Modeling: Two Stream Modeling [4]  Tonal Identification Features      F0 Energy Duration Coarticulation (cont. speech) Initially use 2 stream embedded model followed by explicit modeling during lattice rescoring (alignment?)  Explicit tone modeling uses max. entropy framework [4] (discriminative model) Explicit Tone Modeling [4] No. 1 Feature Description Duration of current, previous, and following syllables # of Features 3 2 3 Previous syllable is or is not sp 4 Statistical Parameters of pitch and log-energy of current syllable (i.e. max, min, mean, etc.) 10 5 Normalized max and mean of pitch and energy in each syllable in the context window 12 6 7 Location of current syllable within word Slope and intercept of F0 contour of current syllable, its delta, and delta-delta Tones of preceding and proceding syllables 1 6 1 2 Other Work Chang, Zhou, Di, Huang, & Lee [1]  3 Methods  Powerful Language Model (no tone modeling)   Embedded 2 Stream    CER = 7.32% Tone Stream + Feature Stream CER = 6.43% Embedded 1 Stream  Developed Pitch extractor   pitch track added to feature vector CER = 6.03% Other Work Qian, Soong [3]   F0 contour smoothing Multi-Space Distribution (MSD)  Models 2 prob. Spaces   Unvoiced: Discrete Voiced (F0 Contour): Continuous Other Work Lamel, Gauvain, Le, Oparin, Meng [6]  Multi-Layer Perceptron Features   Compare Language Models    Combined with MFCC’s and Pitch features N-Gram: Back-off Language Model Neural Network Language Model Language Model Adaptation Other Work O. Kalinli [7]  Replace prosodic features with biologically inspired auditory attention cues   Cochlear filtering, inner hair cell, etc. Other features are extracted from the auditory spectrum     Intensity Frequency contrast Temporal contrast Orientation (phase) Other Work Qian, Xu, Soong [8]  Cross-Lingual Voice Transformation   Phonetic mapping between languages Difficult for Mandarin and English  Very different prosodic features References [1] Eric Chang, Jianlai Zhou, Shuo Di, Chao Huang, & Kai-fu Li, “Large Vocabulary Mandarin Speech Recognition with different Approached in Modeling Tones” [2] Meriam-Webster Dictionary, http://www.merriam-webster.com/ [3] Yao Qian & Frank Soong, “A Multispace Distribution (MSD) and Two Stream Tone Modeling Approach to Mandarin Speech Recognition”, Science Direct, 2009 [4]Chongjia Ni, Wenju Liu, & Bo Xu, “Improved Large vocabulary Mandarin Speech Recognition using Prosodic and Lexical Information in Maximum Entropy Framework” [5] Yi Liu & Pascale Fung, “Pronunciation Modeling for Spontaneous Mandarin Speech Recognition”, International Journal of Speech Technology, 2004 [6] Lori Lamel, J.L. Gauvain, V.B. Le, I. Oparin, S. Meng, “Improved Models For Mandarin Speech to Text Transcription, ICASSP, 2011 [7] O. Kalinli, “Tone and Pitch Accent Classification Using Auditory Attention Cues”, ICASSP, 2011

presentation_v00

Related documents

Products

Support

presentation_v00

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib