Prosodic Modeling for Detecting Edit Disfluencies in Transcribing Spontaneous Mandarin Speech Che-Kuang Lin, Shu-Chuan Tseng* & Lin-Shan Lee College of EECS, National Taiwan University, Institute of Linguistics, Academia Sinica*, Taipei, Taiwan 1 Outline • • • • • • • Introduction Prosodic features IP detection models Latent prosodic modeling (LPM) LPM-based detection models Experiment results & further analysis Conclusion 2 Examples of disfluency considered in this paper (1/2) Overt repair 是(shi4) 進口(jin4kou3) is import 嗯(EN) [discourse particle] 出口(chu1kou3) 嗎(ma1) export [interrogative particle] Do you import * uhn export products? reparandum optional editing term resumption The disfluency interruption point (IP) (*) Abandoned utterances 它(ta1) 有(you3) 一個(yi2ge5) it has one 呃(E) [discourse particle] 有個(you3ge5) 度假村(du4jian4cun1) has a resort 那邊(ne4bian1) there 嘛(MA) [discourse particle ] It has a * eh there is a resort there. reparandum optional editing term resumption 3 Examples of disfluency considered in this paper (2/2) Direct repetition 因為(yin1wei4) 因為(yin1wei4) 它(ta1) 有(you3) because because it has 健身(jian4shen1) 中心(zhong1xin1) fitness center Because * because it has a fitness center. resumption The disfluency interruption point (IP) (*) reparandum Partial repetition 看(kan4) 電(dian4) watch electricity 看(kan4) 電視(dian4shi4) watch television 最近(zui4jin4) 有(you3) 新(xin1) 電影(dian4ying3) recently has new movie On the tele- * on the television, there is a new film recently. 4 reparandum resumption Introduction • One of the primary problem in spontaneous speech recognition is the presence of disfluencies • Accurate identification of various types of disfluencies – help the recognition process – provide structural information about the utterances • Purpose of this study – To identify useful and important features for interruption point (IP) detection – To analyze how these features are helpful in spontaneous Mandarin speech 5 Define a whole set of prosodic features for spontaneous Mandarin speech (1/2) • A set of features have been proposed (Shriberg, 2000) for English • Spontaneous Mandarin quite different from western languages – Tonal language nature • The same syllable with different tones represents different characters PCA-based pitch contour smoothing and pitch-related features – Mono-syllabic structure • Every character has its own meaning and is pronounced as a monosyllable • A word is composed of one to several characters (or syllables) bi-character (bi-syllabic) word boundary syllable boundary 6 mono-character (mono-syllabic) Define a whole set of prosodic features for spontaneous Mandarin speech (2/2) • Every syllable boundary (rather than word boundary) is considered as candidate for IP • Define a whole set of prosodic features for each syllable boundary and use them to detect the IPs bi-character (bi-syllabic) word boundary syllable boundary mono-character (mono-syllabic) 7 Syllable-wise pitch contour smoothing Conversion to vectors fundamental frequency (Hz) • We first proposed to use Principal Component Analysis (PCA) for efficient pitch contour smoothing 400 Tone 2 300 Tone 3 Tone 4 200 100 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 …… frame number v = c1 x + c2 y PC1 PCA V’ = y w1 PC1 fundamental frequency (Hz) Projection x 400 Tone 2 300 Tone 3 Tone 4 200 8 100 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Feature Definitions – Pitch-related Prosodic Features (1/2) • • • fundamental frequency (Hz) • • The average pitch value within the syllable The maximum difference of pitch value within the syllable The average of absolute values of pitch variations within the syllable The magnitude of pitch reset for boundaries The difference of such feature values of adjacent syllable boundaries ( P1-P2 , d1-d2 , etc.) 400 Tone 2 300 Tone 4 d2 d1 200 Tone 3 P1 P2 100 1 • 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 frame number A total of 54 pitch-related features were obtained 9 Feature Definitions – Duration-related Prosodic Features (2/2) A B C D E fundamental frequency(Hz) • deviation from normal speaking rhythmic structure is important syllable boundary 200 A B pause a pause C b syllable boundary D E 100 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 begin of utterance Pause duration b Average syllable duration frame number syllable duration ratio (D+E)/(B+C) or (D+E)/2 /C Combination of pause & syllable features (ratio or product) C*b , D*b, C/b, D/b (B+C+D+E)/4 or ( (D+E)/2 + C )/2 Average end of utterance Lengthening C / ( (A+B)/2 ) Standard deviation of feature values • A total of 38 duration-related features were obtained 10 Detection model 11 Decision Tree • Trees are grown on training data based on maximum entropy reduction criterion pitch offset< 12.99? [ IP, nonIP ] 0.2, 0.8 have pause? 0.6,0.4 syl_dur_ratio<4.5? 0.8,0.2 0.3,0.7 An illustrative decision tree for IP detection • Probability of IP is found by traversing across the trees down 12 to a certain leaf node Maximum Entropy Model (1/2) • Various problem-specific knowledge can be incorporated into the model through many properly designed feature functions • Feature functions fi ( can be binary or real valued ) – Take binary feature function for example fi ( x, y) 1 0 if some condition on x and y is satisfied otherwise x: prosodic features at the syllable boundary i: a set of features y: IP or non-IP If we have pause at the boundary (x) and the boundary is an IP (y) 13 Maximum Entropy Model (2/2) Known statistics Model the expectation of each feature functon fi obtained from the training data the expectation of each feature function fi with respect to the desired model Among all the distributions that satisfy the set of constraints, choose the one with the highest entropy 14 Integrating DT & Maxent (DT-ME) • We use decision trees built with the training data to derive the feature functions for maximum entropy model • First, grow deep and bushy trees from training data • Then, for each sample (training or testing) – Traverse across the trees down to certain leaves – Each leaf serves as a single (binary) feature function • i.e. whether the sample falls in this leaf (1) or not (0) feature vector 0 1 0 0 1 1 0 0 00 1 0 15 Latent Prosodic Modeling 16 Latent prosodic modeling (LPM) • Model the probabilistic behavior of prosodic features in terms of latent factors • Prosodic characters and terms: derived from prosodic recognized syllables features 1st pass speech recognition x : prosodic features VQ (23,5),(5,0),(23,5,0)… ( prosodic terms) 23,5,0,14,31,… ( prosodic characters) N-grams 17 Latent prosodic modeling (LPM) • Prosodic documents of three levels (segment, utterance & speaker): collections of prosodic terms d utt ,1 (prosodic documents on segment level) d seg ,1 d seg ,2 d seg ,3 … 23 5 0 d seg ,3 … d utt ,1 9 9 17 … d utt ,2 3 20 1 25 … … d spk ,1 11 14 31 d seg ,1 d seg ,2 (The segments are obtained from the best fitting piecewise linear function for the pitch contour) 2 17 13 6 … … (prosodic documents on speaker level) … d spk ,2 (prosodic documents on utterance level) prosodic characters: 23, 5, 0, 14, 31, … prosodic documents: dseg,i , dutt,i , dspk,i prosodic terms: (23,5), (5,0), (23,5,0)… 18 Latent prosodic modeling (LPM) • The relationship between the prosodic terms and prosodic documents are modeled via latent prosodic states in the probabilistic framework of Probabilistic Latent Semantic Analysis (PLSA) d1 d2 P(tk| zl) tk .. . … prosodic dN documents t1 t2 .. . … di prosodic states z1 z2 P(zl| di) z l zL tN’ prosodic terms L P(tk | di ) P(tk | zl ) P( zl | di ) , i, k l 1 19 Latent prosodic modeling (LPM) • The probabilities were trained with EM algorithm by maximizing the total likelihood function: N N' LT n(tk , di ) log P(tk | di ) , i 1 k 1 • The complicated behavior of the prosodic features can then be analyzed based on these probabilities – For instance: similarity measures SimLPM (di , d j ) P( z l | di ) P( zl | d j ) l [ P( z | d )] [ P( z | d )] 2 l l 2 i l l . j 20 LPM-based Detection model 21 LPM for IP detection • LPM-based model adaptation – LPM model is trained on the raw data and used for actively selecting relevant training data for a specific testing condition Latent Prosodic Space LPM Training Training corpus of training prosodic documents Pre-trained LPM Latent Prosodic Space Construction Latent Prosodic Space P( z3 | di ) Projection Compare & Select dk dj X X P( z1 | di ) Projection P ( z2 | d i ) testing prosodic documents HAC-based KNN-based selection selection … Detection Model Training LPM-adapted Detection Models 22 selected prosodic docs LPM-based Detection Model Adaptation LPM for IP detection • Anchor model training – Merging the associated prosodic documents for different classes of disfluency IPs into super-documents to be used in training Anchor models - The prosody of IP candidates was then compared against these Anchors - Can be used with training data selection mentioned above Anchor model training class 1 class cP( z3 | di ) P ( z2 | d i ) … . … . …… P( z1 | di ) classification 23 LPM for IP detection • Integration of LPM-adapted classification models with SVM – Two classification models: DT-ME or Anchor model – Adapted at segment, utterance, or speaker level Latent Prosodic Modeling segment-type utterance-type speaker-type models models models SVM combining the scores models without LPM decision 24 LPM for IP detection • LPM-based feature expansion for DT-ME • Two sets of features can be used • The probabilities of each prosodic state related to the prosodic document: P( zl | di ), l (F1) • The likelihood of the prosodic terms given the prosodic document: P(t tk di L k | di ) P(tk | zl ) P( zl | di ) (F2) tk di l 1 25 Experiment Results 26 Corpus Used in the Research • Mandarin Conversational Dialogue Corpus (MCDC) • 30 conversational dialogues (27 hours totally) • 8 dialogues out of the 30 were annotated with disfluencies – (8.2 hours totally, 9 female & 7 male speakers) • The summary of experiment data train test Data length 7.1hr 1.1hr Number of non-IPs 92189 14231 Number of IPs 3569 536 Chance of non-IPs 96.3% 96.4% 27 IP detection results Recall Precision Decision Tree 73.15 73.03 Integrated approach 56.38 81.95 decision tree achieved moderate and balanced recall and precision rates Integrated approach trades degraded recall for significantly better precision From the purpose of speech recognition, integrated approach is more appropriate • Incorrectly detected IP may cause recognition errors • Missing IP can be processed as usual 28 Analysis of Importance of Roles for Different Features 29 Identify important features for IP detection Exclude each single feature from the full set and then perform the complete IP detection process Detection performance degradation due to the missing of each single feature is obtained Original performance features Degraded performance 30 Investigation on how the two feature categories are related to the IP detection • The most serious performance degradation caused by removing one single feature from the two categories is shown abandoned utterances overt repair direct repetition partial repetition performance degradation 0.00 -5.00 -10.00 -15.00 -20.00 -25.00 -30.00 pitch-related features duration-related features for overt repair and partial repetition, pitch-related features play relatively more important role for IP detection for direct repetition IP detection, the duration-related features are more important for abandoned utterances IP detection, both features have 31 equally important impact Importance of each individual pitch-related feature The most serious performance degradation caused by removing one single pitch-related feature Disfluency Types Most Important Features / (recall degradation) Second Important Features / (recall degradation) 1.abandoned utts (a) / (-17.25) (b) / (-14.97) 2.overt repairs (c) / (-26.67) (a) / (-20.00) 3.direct repetition (d) / (-5.40) (e) / (-5.40) 4.partial repetition (b) / (-18.21) (f) / (-18.21) (a)(b)(c)(d)(e)(f): some specific features Average pitch value within a syllable (b),(d) 1.3.4. Maximum difference of pitch values within a syllable (e),(f)3.4. Magnitude of pitch reset for boundaries (a)1.2. 32 Importance of each individual duration-related feature The most serious performance degradation caused by removing one single duration-related feature Disfluency Types Most Important Features / (recall degradation) Second Important Features / (recall degradation) 1.abandoned utts (g) / (-17.25) (h) / (-14.97) 2.overt repairs (i) / (-13.33) (j) / (-13.33) 3.direct repetition (k) / (-8.10) (l) / (-8.10) 4.partial repetition (h) / (-16.33) (m) / (-16.33) (g)(h)(i)(j)(k)(l)(m) : some specific features Jointly considering both the syllable & pause duration is useful • The ratio of syllable duration to pause duration (g),(h),(k) 1.3.4. • The product of them (i),(j),(m)2.4. The character duration ratio across the boundary (l)3. Standard deviation of the product of syllable with pause duration 33 (m)4. Results for IP detection 34 IP detection experiment detection accuracy (%) • Three different feature sets tested80(using 75 Decision tree) 70 in – [feature set 1] The same as used 65 previous work – [feature set 2] The same as the60above but extracted for syllable boundaries 55 – [proposed feature set] Proposed 50 features extracted for syllable45 boundaries 40 ktr ktr: known transcription rec: recognition results with errors rec feature set 1 ktr rec feature set 2 proposed feature set 35 IP detection experiment detection accuracy (%) • Comparison of different IP detection approaches – Using the new feature set proposed here 97.6 97.5 97.4 DT Maxent DT-ME DT-Maxent 97.3 97.2 97.1 97.0 96.9 36 rec LPM for IP detection • IP detection accuracy using LPM-based DT-ME or anchor models (a) HAC plain seg utt spk all (b) segment-, utterance- & speaker-type models kNN utterance-type model IP Detection Acc(%) 85 plain 85 84 83 83 82 82 Acc(%) 84 81 81 80 80 79 79 78 78 DT-ME maxent anchor DT-ME maxent anchor 37 LPM for IP detection • (e) yielded the best result, when the finally enhanced DT-ME model combined with the anchor model by SVM IP detection Acc(%) 85 (a) maxent with SVM combiner 84.5 (F1) (b): (a) plus P(t|d)features 84 83.5 83 (F1) + (F1) (F2) (F2) 82.5 (F2) (c): (a) plus P(z|d)features (d): (a) plus P(t|d)-, (F1)+(F2) P(z|d)-features (e): (d) plus anchor 82 various combined methods 38 Conclusion • A whole set of features for disfluency IP detection is developed, tested and analyzed. • The most important features for each disfluency type were identified and discussed 39 Conclusion • A new disfluency IP detection model that incorporates decision trees into a maximum entropy model was developed • Latent prosodic modeling for analyzing speech prosody using a probabilistic framework of latent prosodic states is proposed to adapt IP detection models 40 Thanks for your attention 41