國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Phone Boundary Detection using Sample-based Acoustic Parameters Yih-Ru Wang Institute of Communication Engineering, National Chiao Tung University, Hsinchu, Taiwan, ROC 2011/7/12 NGASR研討會 1 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Outline • Motivation, Background • Why sample-based? • Sample-based Acoustic Parameters & Phone Boundary Detector • Experimental results • Conclusions and Future works 2011/7/12 NGASR研討會 2 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Motivation • Find the synchronous “clock” for Detection-based ASR, Computer Aided Language Learning(CALL) System Speech signal Speech Attribution Detectors Phone Boundary Detector Segment-based system Synchronous “clock” for the system Detection-based ASR, CALL system 2011/7/12 NGASR研討會 3 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Background • Tasks of Phonetic Segmentation – Phone alignment, 87% inclusion rate for 10 msec tolerance for experts – Phone boundary detection • Phone alignment : using Model-based method – HMM, MBE-HMM (Minimum Boundary Error HMM), HMM + fine tuning using SVM, … • Phone boundary detection : using Metric-based method – a measure of speech signal change – norm of delta MFCC feature vector (Rabiner, 2006) – KL distance or BIC of speech signal • The frame-based features, like MFCC, were used 2011/7/12 NGASR研討會 4 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Why sample-based? • Transient vs. Stationary • Accuracy and precision – especially for ‘short’ phones, e.g. plosives • Acoustic feature used high frequency resolution, like MFCC to ‘recognize’ phones in speech • To detect the pronunciation manner/position (acoustics) changes in speech signal increase time resolution and decrease frequency resolution of the features 2011/7/12 NGASR研討會 5 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • To find the useful measures of speech signal change in sample-based system – Sample-based Acoustic Parameters were proposed • PROs of sample-based method – Better accuracy and precision – Properly detect the boundary of short phones • CONs of sample-based method – Complexity of system? – Higher false alarm? 2011/7/12 NGASR研討會 6 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Sample-based Acoustic Parameters & Phone boundary detector • Sub-band signal envelope – Six sub-bands used for landmark detection (Liu, 1996) band-signal envelope Bandpass Filter Speech Signal Envelope detector O O O O O O O O O Bandpass Filter Envelope detector Bandpass freq. 5.0 – 8.0 k Hz 3.5 – 5.0 k Hz 2.0 – 3.5 k Hz 1.5 – 2.0 k Hz 0.8 – 1.2 k Hz 0.0 – 0.4 k Hz • ROR (rate of raising) of Sub-band signal envelope – The delta-term of a feature 2011/7/12 NGASR研討會 7 國立交通大學 電信工程研究所 National Chiao Tung University |Stop |Glide|Vowel |Nasal|Vowel Institute of Communication Engineering |Fricative |Fricative |Vowel |Nasal |Vowel |Silence Waveform Envelope Sub-band signal envelope 5.0 – 8.0 k Hz TIMIT: FDRW0/sx293 Please take this dirty table cloth to the cleaners for me 0.0 – 0.4 k Hz 2011/7/12 NGASR研討會 8 國立交通大學 電信工程研究所 ~20ms National Chiao Tung University Institute of Communication Engineering Please take this dirty cloth… ROR of signal envelope ROR of Sub-band signal envelope 2011/7/12 NGASR研討會 9 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • Norm of sub-band signal envelopes can be a useful measure of signal change • Sample-based spectral entropy can be defined as H s n Ei n log Ei n i where Ei n is the i-th normalized sub-band signal envelope • Sample-based spectral KL distance between speech signals at two adjacent times [n, n +1] can be defined as Ei n d KL n Ei n Ei n 1 log E n 1 i 1 i 6 2011/7/12 NGASR研討會 10 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • An example of sample-based spectral entropy and its ROR Sample-based Spectral entropy ROR of Spectral entropy 2011/7/12 NGASR研討會 11 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • An Example of sample-based spectral KL distance Sample-based spectral KL distance It can be used to find the signal change points more accurately and precisely. 2011/7/12 NGASR研討會 12 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • A MLP was used as the Phone Boundary detector • The block diagram of proposed training/test procedure Training Feature Extraction Test Speech Signal Sample-based Acoustic Features Candidate Target Labeling Candidate Pre-selection Initial boundary 2011/7/12 NGASR研討會 13 MLP-based Boundary Detector Refined boundary 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • Candidates Pre-selection – find all the speech samples, with index n, which satisfied d KL n 1 d KL n , d KL n d KL n 1 , d KL n Thd • Pre-selection can be used to reduce the complexity and FA of sample-based system. • After candidate pre-selection, a MLP was used as the boundary detector 2011/7/12 NGASR研討會 14 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • The AP features used for MLP detector DKL[ck] : KL distance for 2 Normal pdfs ck-1 ck Stable part Spectral KL distance measure ck+1 Time Candidate k Segment k-1 Segment k A 27-dim acoustic parameter vector for the kth candidate, at time ck , contains E c ; i 0, ,6 , d c , H c , H c , D c ES c , c , ES c , c ; l 0, ,6 , c c , c c i k l KL k 1 k 2011/7/12 NGASR研討會 l k k k 1 s k s k 15 k KL k 1 k 1 k k 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • Iterative training procedure Candidates from pre-selection Acoustic parameters of candidates [ ] Manual Labeling Initial boundary MLP-based boundary detectors Detector output 2011/7/12 NGASR研討會 16 Refined boundary / Target of MLP 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • 2nd stage : – use similarity measure of segmental acoustic signals Similarity measure between speech segments? C’k-1 C’k Stable part Spectral KL distance measure C’k+1 Candidate k Segment k-1 Time Segment k – Using GMM to model the pdf of a speech segment – The KL1 distance of CCGMM (Wang, 2004) Using a common GMM to represent the pdfs of two segments 2011/7/12 NGASR研討會 17 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • Similarity measure of two speech segments: – Discrete KL-1 distance of CCGMM coefficient L pk (o[n]) clk N (o[n]; kl , ) l 1 D1 (O1 | O2 ) p1 (o)ln p1 (o) do p1 (o)P1 2 (o)do; p2 (o) c D1 2 (O1 | O2 ) E1 P1 2 (o) c1i ln 1i i c2i 1 2 – Discrete KL-2 distance using CCGMM coefficient D (O1 , O2 ) 2011/7/12 NGASR研討會 c1i 1 E P ( o ) E P ( o ) c c ln 1 1 2 1i 2i c 2 2 1 2 i 2i 18 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering – Discrete KL-1 distance is the mean of log-likelihood of two pdfs – The similarity of two pdfs – Find high order statistics of log-likelihood pdfs (Wang, 2008) – Variance, skewness of log-likelihood pdfs 1/2 1 2 c1i 2 c1i ln 1 2 ; i 1 c2i S1 2 c1i c1i ln 2 1 i 1 c2 i N N 2011/7/12 NGASR研討會 2 3 1/3 19 12 1 ; 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering – use segmental similarity Similarity measure between speech segments? C’k-1 C’k Stable part Spectral KL distance measure C’k+1 Time Candidate k Segment k-1 Segment k A 30-dim acoustic parameter vector for the kth candidate, at time ck , contains E c ; i 0, ,6 , , , S , ES c , c , ES c , c ; l 0, ,6 , c i ck ck k l k 1 k 2011/7/12 NGASR研討會 l k ck ck ck ck ck ck k 1 k 20 , c c , Sc c ,output of 1st stage k k k ck 1 , ck 1 ck k 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Experimental Results • Database : TIMIT. • After candidates pre-selection, – 1 over 116 samples was selected – 0.9% MD due to candidate pre-selection • Performance of MLP boundary detector: TIMIT corpus Sample Phone boundary Training set 226727341 172461 Test set 82786737 62466 2011/7/12 NGASR研討會 21 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • Performance of the sample-based boundary detector FA rate 0.25 1-stage(MLP) 0.2 1-stage(RNN) 2-stage system 0.15 Rabiner[2006] HMM 0.1 0.05 0 0 0.05 0.1 0.15 0.2 0.25 MD rate 2011/7/12 NGASR研討會 22 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • An example of proposed phone boundary detector 2011/7/12 NGASR研討會 23 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • Accuracy of the sample-based boundary detector Inclusion rate 1 0.9 0.8 HMM 1-stage RNN(EER) 2-stage system 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 < 5 ms < 10 ms < 15 ms < 20 ms < 25 ms < 30 ms < 35 ms < 40 ms < 45 ms < 50 ms Absolute Boundary Error 2011/7/12 NGASR研討會 24 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • Compare to Dr. Rabiner’s work [2006] : Absolute error 5ms 10ms 15ms same frame ±1 frame Inclusion rate (1-stage) 41.5% 69.7% 81.1% 37.3% 77.0% Inclusion rate (2 stage) 42.1% 70.3% 81.9% 37.8% 77.8% Dr. Rabiner’s result : (22.8% 2011/7/12 NGASR研討會 25 59.2%) 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • Error analysis – MAE of detected boundary Affricate Fricative Affricate - Fricative 2.3/17.0 6.4/6.5* Stop Glide Vowel Nasal Silence 10.1/6.9* 7.3/10.0* 6.8/13.7 4.9/15.3* 6.1/12.8 7.2/7.0 13.6/13.1* 9.5/14.9 7.9/13.3 7.1/12.5 6.5/11.7 Stop - 6.1/7.3 12.4/12.0* 11.2/15.0 7.5/13.1 7.6/9.6 7.1/14.4 Glide - 7.0/9.5 10.4/12.8 11.0/21.2 7.9/13.6 6.4/11.2 6.3/12.7 Vowel - 6.3/9.8 7.9/11.8 6.8/11.5 6.9/13.6 Nasal 7.6/11.3* 6.2/8.2 11.1/13.2 11.6/15.3 7.2/13.3 5.6/11.2* 6.9/12.1 Silence 6.3/12.5 6.0/7.5 7.3/8.2 9.9/15.9 8.8/17.6 11.7/14.1 7.4/12.1 5.2/9.9 7.0/18.9 Overall : 7.6/12.4 • Sample-based/HMM system (unit ms) * no. of sample less than 100 2011/7/12 NGASR研討會 26 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • Accuracy of proposed method – Systems MAE/RMSE (ms) MAE/RMSE (frame) MAE/RMSE (normalized to phone duration) HMM 12.4/17.0 1.22/1.84 0.204/0.322 1-stage (RNN) 7.6/11.5 0.96/1.82 0.127/0.197 2011/7/12 NGASR研討會 27 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering • Error analysis (1 stage) – MDR and FAR Pronunciation manners next phone Affricate Fricative Affricate Fricative Stop Glide Vowel Nasal Silence Deletion Stop Glide Vowel Nasal Silence Insertion HMM - 0.0% 25.0% 11.8% 4.4% 0.0% 2.9% 9.2% RNN - 0.0% 0.0% 17.6% 5.7% 7.7% 8.0% 11.1% HMM 0.0% 2.3% 13.3% 10.6% 4.8% 5.0% 6.2% 7.5% RNN 0.0% 3.3% 16.3% 20.5% 8.2% 7.0% 7.4% 10.5% HMM - 1.6% 12.6% 14.1% 5.7% 4.1% 2.0% 3.8% RNN - 2.3% 14.9% 22.7% 8.2% 10.3% 2.6% 9.0% HMM - 2.8% 16.1% 28.2% 5.6% 4.5% 5.2% 3.8% RNN - 5.8% 6.3% 6.5% 6.6% 8.4% 7.3% 9.4% HMM - 2.9% 6.7% 6.6% 6.5% 10.3% 4.4% 7.8% RNN - 6.2% 9.2% 6.9% 6.4% 10.0% 7.3% 10.1% HMM 7.1% 3.5% 17.7% 7.8% 5.4% 2.5% 18.4% 5.7% RNN 7.1% 9.8% 16.1% 18.8% 8.3% 2.5% 8.7% 9.4% HMM 2.1% 0.9% 6.2% 6.6% 4.3% 4.8% 3.0% 5.7% RNN 5.0% 3.9% 10.2% 10.0% 7.6% 5.0% 3.0% 6.5% overall 2011/7/12 NGASR研討會 HMM : 6.4% (EER) Sample-based : 8.7% (EER) 28 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Conclusions & Future works • Several sampled-based acoustic parameters, which could properly model the speech signal change, were proposed • Using the sample-based APs in phone boundary detector, better precision and accuracy were achieved • Segment-based speech attribution detectors 2011/7/12 NGASR研討會 29 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Segment-based Attribution detector • Segment based Attribution Recognizer Operation point : 3% MDR, 20% FAR Coding each contour using Legendre polynomial 2011/7/12 NGASR研討會 30 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering – Set the operation point to low MD, high FA rate. 80123 segments / 62465 phones. – Feature extraction using the Legendre coefficients of the AP contours Legendre coefficients Stable part Segment k-1 2011/7/12 NGASR研討會 Legendre coefficients Legendre coefficients (dim 3*7) Candidate k Segment k 31 Time Segment k+1 國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering – Pre-limitary result Pronunciation manner Segment-based Recog. Rate (%) Frame-base Recog. Rate(%) Fricative 75.6 85.2 Stop 76.7 72.5 Glide 64.3 56.5 Vowel 90.3 89.0 Nasal 73.6 77.5 Silence 89.1 92.2 81.9 82.1 frame-based system using 9 frames feature. – Change into accuracy over time : 81.2% only 6 band-pass envelopes were used phone alignment 2011/7/12 NGASR研討會 32