Mandarin Chinese Speech Recognition Mandarin Chinese Tonal language (inflection matters!) Monosyllabic language 1st tone – High, constant pitch (Like saying “aaah”) 2nd tone – Rising pitch (“Huh?”) 3rd tone – Low pitch (“ugh”) 4th tone – High pitch with a rapid descent (“No!”) “5th tone” – Neutral used for de-emphasized syllables Each character represents a single base syllable and tone Most words consist of 1, 2, or 4 characters Heavily contextual language Mandarin Chinese and Speech Processing Accoustic representations of Chinese syllables Structural Form (consonant) + vowel + (consonant) Mandarin Chinese and Speech Processing Phone Sets Initial/final phones [1] e.g. Shi, ge, zi = (shi + ib), (ge + e), (z + if) Initial phones: unvoiced 1 phone Final phones: voiced (tone 1-5) Can consist of multiple phones Mandarin Chinese and Speech Processing Strong tonal recognition is crucial to distinguish between homonyms [3] (especially w/o context) Creating tone models is difficult Discontinuities exist in the F0 contour between voiced and unvoiced regions Prosody Prosody: “the rhythmic and intonational aspect of language” [2] Embedded Tone Modeling[4] Explicit Tone Modeling[4] Tone Modeling Embedded Tone Modeling Tonal acoustic units are joined with spectral features at each frame [4] Explicit Tone Modeling Tone recognition is completed independently and combined after post-processing [4] Tone Modeling Pitch, energy, and duration (Prosody) combined with lexical and syntactic features improves tonal labeling Coarticulation Variations in syllables can cause variations in tone: Bu4 + Dui4 = Bu2 Dui4 (wrong) Ni3 + Hao3 = Ni2 Hao3 (hello) Emebedded Tone Modeling: Two Stream Modeling Ni, Liu, Xu Spectral Stream –MFCC’s (Mel frequency cepstral coefficients) Describe vocal tract information Distinctive for phones (short time duration) Pitch/Tone Stream – requires smoothing Describe vibrations of the vocal chords Independent of Spectral features d/dt(pitch) aka tone and d2/dt2(pitch) are added Embedded in an entire syllable Affected by coarticulation (requires a longer time window) – i.e. Sandhi Tone – context dependency Embedded Tone Modeling: Two Stream Modeling [4] Tonal Identification Features F0 Energy Duration Coarticulation (cont. speech) Initially use 2 stream embedded model followed by explicit modeling during lattice rescoring (alignment?) Explicit tone modeling uses max. entropy framework [4] (discriminative model) Explicit Tone Modeling [4] No. 1 Feature Description Duration of current, previous, and following syllables # of Features 3 2 3 Previous syllable is or is not sp 4 Statistical Parameters of pitch and log-energy of current syllable (i.e. max, min, mean, etc.) 10 5 Normalized max and mean of pitch and energy in each syllable in the context window 12 6 7 Location of current syllable within word Slope and intercept of F0 contour of current syllable, its delta, and delta-delta Tones of preceding and proceding syllables 1 6 1 2 Other Work Chang, Zhou, Di, Huang, & Lee [1] 3 Methods Powerful Language Model (no tone modeling) Embedded 2 Stream CER = 7.32% Tone Stream + Feature Stream CER = 6.43% Embedded 1 Stream Developed Pitch extractor pitch track added to feature vector CER = 6.03% Other Work Qian, Soong [3] F0 contour smoothing Multi-Space Distribution (MSD) Models 2 prob. Spaces Unvoiced: Discrete Voiced (F0 Contour): Continuous Other Work Lamel, Gauvain, Le, Oparin, Meng [6] Multi-Layer Perceptron Features Compare Language Models Combined with MFCC’s and Pitch features N-Gram: Back-off Language Model Neural Network Language Model Language Model Adaptation Other Work O. Kalinli [7] Replace prosodic features with biologically inspired auditory attention cues Cochlear filtering, inner hair cell, etc. Other features are extracted from the auditory spectrum Intensity Frequency contrast Temporal contrast Orientation (phase) Other Work Qian, Xu, Soong [8] Cross-Lingual Voice Transformation Phonetic mapping between languages Difficult for Mandarin and English Very different prosodic features References [1] Eric Chang, Jianlai Zhou, Shuo Di, Chao Huang, & Kai-fu Li, “Large Vocabulary Mandarin Speech Recognition with different Approached in Modeling Tones” [2] Meriam-Webster Dictionary, http://www.merriam-webster.com/ [3] Yao Qian & Frank Soong, “A Multispace Distribution (MSD) and Two Stream Tone Modeling Approach to Mandarin Speech Recognition”, Science Direct, 2009 [4]Chongjia Ni, Wenju Liu, & Bo Xu, “Improved Large vocabulary Mandarin Speech Recognition using Prosodic and Lexical Information in Maximum Entropy Framework” [5] Yi Liu & Pascale Fung, “Pronunciation Modeling for Spontaneous Mandarin Speech Recognition”, International Journal of Speech Technology, 2004 [6] Lori Lamel, J.L. Gauvain, V.B. Le, I. Oparin, S. Meng, “Improved Models For Mandarin Speech to Text Transcription, ICASSP, 2011 [7] O. Kalinli, “Tone and Pitch Accent Classification Using Auditory Attention Cues”, ICASSP, 2011