ppt

advertisement
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
Phone Boundary Detection using
Sample-based Acoustic Parameters
Yih-Ru Wang
Institute of Communication Engineering,
National Chiao Tung University, Hsinchu, Taiwan, ROC
2011/7/12 NGASR研討會
1
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
Outline
• Motivation, Background
• Why sample-based?
• Sample-based Acoustic Parameters & Phone
Boundary Detector
• Experimental results
• Conclusions and Future works
2011/7/12 NGASR研討會
2
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
Motivation
• Find the synchronous “clock” for Detection-based ASR,
Computer Aided Language Learning(CALL) System
Speech signal
Speech Attribution
Detectors
Phone Boundary
Detector
Segment-based system
Synchronous “clock”
for the system
Detection-based ASR, CALL system
2011/7/12 NGASR研討會
3
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
Background
• Tasks of Phonetic Segmentation
– Phone alignment, 87% inclusion rate for 10 msec tolerance for experts
– Phone boundary detection
• Phone alignment : using Model-based method
– HMM, MBE-HMM (Minimum Boundary Error HMM), HMM + fine tuning
using SVM, …
• Phone boundary detection : using Metric-based method
– a measure of speech signal change
– norm of delta MFCC feature vector (Rabiner, 2006)
– KL distance or BIC of speech signal
• The frame-based features, like MFCC, were used
2011/7/12 NGASR研討會
4
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
Why sample-based?
• Transient vs. Stationary
• Accuracy and precision
– especially for ‘short’ phones, e.g. plosives
• Acoustic feature used high frequency resolution, like MFCC
 to ‘recognize’ phones in speech
• To detect the pronunciation manner/position (acoustics)
changes in speech signal
 increase time resolution and decrease frequency
resolution of the features
2011/7/12 NGASR研討會
5
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• To find the useful measures of speech signal change
in sample-based system
– Sample-based Acoustic Parameters were proposed
• PROs of sample-based method
– Better accuracy and precision
– Properly detect the boundary of short phones
• CONs of sample-based method
– Complexity of system?
– Higher false alarm?
2011/7/12 NGASR研討會
6
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
Sample-based Acoustic Parameters &
Phone boundary detector
• Sub-band signal envelope
– Six sub-bands used for landmark detection (Liu, 1996)
band-signal
envelope
Bandpass Filter
Speech Signal
Envelope detector
O
O
O
O
O
O
O
O
O
Bandpass Filter
Envelope detector
Bandpass freq.
5.0 – 8.0 k Hz
3.5 – 5.0 k Hz
2.0 – 3.5 k Hz
1.5 – 2.0 k Hz
0.8 – 1.2 k Hz
0.0 – 0.4 k Hz
• ROR (rate of raising) of Sub-band signal envelope
– The delta-term of a feature
2011/7/12 NGASR研討會
7
國立交通大學
電信工程研究所
National Chiao Tung University
|Stop
|Glide|Vowel |Nasal|Vowel
Institute of Communication Engineering
|Fricative |Fricative |Vowel |Nasal
|Vowel
|Silence
Waveform Envelope
Sub-band signal envelope
5.0 – 8.0 k Hz
TIMIT: FDRW0/sx293
Please take this dirty table cloth to the cleaners for me
0.0 – 0.4 k Hz
2011/7/12 NGASR研討會
8
國立交通大學
電信工程研究所
~20ms
National Chiao Tung University
Institute of Communication Engineering
Please take this dirty cloth…
ROR of signal envelope
ROR of Sub-band signal
envelope
2011/7/12 NGASR研討會
9
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• Norm of sub-band signal envelopes can be a useful measure
of signal change
• Sample-based spectral entropy can be defined as
H s  n     Ei  n  log  Ei  n 
i
where Ei n  is the i-th normalized sub-band signal envelope
• Sample-based spectral KL distance between speech signals at
two adjacent times [n, n +1] can be defined as
 Ei n 
d KL n    Ei n   Ei n  1 log 

E
n

1


i 1
 i

6
2011/7/12 NGASR研討會
10
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• An example of sample-based spectral entropy
and its ROR
Sample-based Spectral entropy
ROR of Spectral entropy
2011/7/12 NGASR研討會
11
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• An Example of sample-based spectral KL distance
Sample-based spectral KL distance
It can be used to find the signal change points
more accurately and precisely.
2011/7/12 NGASR研討會
12
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• A MLP was used as the Phone Boundary detector
• The block diagram of proposed training/test procedure
Training
Feature Extraction
Test
Speech Signal
Sample-based
Acoustic Features
Candidate
Target Labeling
Candidate
Pre-selection
Initial boundary
2011/7/12 NGASR研討會
13
MLP-based
Boundary Detector
Refined boundary
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• Candidates Pre-selection –
find all the speech samples, with index n, which
satisfied
d KL n  1  d KL n , d KL n   d KL n  1 , d KL n   Thd
• Pre-selection can be used to reduce the complexity
and FA of sample-based system.
• After candidate pre-selection, a MLP was used as the
boundary detector
2011/7/12 NGASR研討會
14
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• The AP features used for MLP detector
DKL[ck] : KL distance for 2 Normal pdfs
ck-1
ck
Stable part
Spectral KL
distance measure
ck+1
Time
Candidate k
Segment k-1
Segment k
A 27-dim acoustic parameter vector for the kth candidate, at time ck , contains
 E c ; i  0, ,6 , d c , H c , H c , D c 
 ES c , c  , ES c , c ; l  0, ,6 , c  c , c  c
i
k
l
KL
k 1
k
2011/7/12 NGASR研討會
l
k
k
k 1
s
k
s
k
15
k
KL
k 1
k 1
k
k
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• Iterative training procedure
Candidates from pre-selection
Acoustic parameters
of candidates
[
]
Manual Labeling
Initial boundary
MLP-based
boundary detectors
Detector output
2011/7/12 NGASR研討會
16
Refined
boundary /
Target of MLP
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• 2nd stage :
– use similarity measure of segmental acoustic signals
Similarity measure between speech segments?
C’k-1
C’k
Stable part
Spectral KL
distance measure
C’k+1
Candidate k
Segment k-1
Time
Segment k
– Using GMM to model the pdf of a speech segment
– The KL1 distance of CCGMM (Wang, 2004)
Using a common GMM to represent the pdfs of two segments
2011/7/12 NGASR研討會
17
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• Similarity measure of two speech segments:
– Discrete KL-1 distance of CCGMM coefficient
L
pk (o[n])   clk  N (o[n]; kl , )
l 1
D1 (O1 | O2 )   p1 (o)ln
p1 (o)
do  p1 (o)P1 2 (o)do;
p2 (o)
c
D1 2 (O1 | O2 )  E1  P1 2 (o)    c1i  ln  1i
i
 c2i

  1 2

– Discrete KL-2 distance using CCGMM coefficient
D (O1 , O2 ) 
2011/7/12 NGASR研討會
 c1i 
1
E
P
(
o
)

E
P
(
o
)

c

c

ln






 1 1 2
  1i 2i  c 
2
2 1
2
i
 2i 
18
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
– Discrete KL-1 distance is the mean of log-likelihood of
two pdfs
– The similarity of two pdfs
– Find high order statistics of
log-likelihood pdfs (Wang, 2008)
– Variance, skewness of
log-likelihood pdfs
1/2
 1 2


 c1i 
2
   c1i  ln    1 2   ;
 i 1  c2i 



S1 2

 c1i


  c1i  ln
 2 1 
 i 1  c2 i


N
N
2011/7/12 NGASR研討會
2
3 1/3




19
 12 
1
;
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
– use segmental similarity
Similarity measure between speech segments?
C’k-1
C’k
Stable part
Spectral KL
distance measure
C’k+1
Time
Candidate k
Segment k-1
Segment k
A 30-dim acoustic parameter vector for the kth candidate, at time ck , contains
 E c ; i  0, ,6 ,  , , S , 
 ES c , c  , ES c , c ; l  0, ,6 , c
i
ck ck
k
l
k 1
k
2011/7/12 NGASR研討會
l
k
ck ck
ck ck
ck ck
k 1
k
20
,  c c , Sc c ,output of 1st stage
k
k
k
 ck 1 , ck 1  ck
k
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
Experimental Results
• Database : TIMIT.
• After candidates pre-selection,
– 1 over 116 samples was selected
– 0.9% MD due to candidate pre-selection
• Performance of MLP boundary detector:
TIMIT corpus
Sample
Phone boundary
Training set
226727341
172461
Test set
82786737
62466
2011/7/12 NGASR研討會
21
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• Performance of the sample-based boundary detector
FA rate
0.25
1-stage(MLP)
0.2
1-stage(RNN)
2-stage system
0.15
Rabiner[2006]
HMM
0.1
0.05
0
0
0.05
0.1
0.15
0.2
0.25
MD rate
2011/7/12 NGASR研討會
22
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• An example of proposed phone boundary detector
2011/7/12 NGASR研討會
23
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• Accuracy of the sample-based boundary detector
Inclusion rate
1
0.9
0.8
HMM
1-stage RNN(EER)
2-stage system
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
< 5 ms
< 10 ms
< 15 ms
< 20 ms
< 25 ms
< 30 ms
< 35 ms
< 40 ms
< 45 ms
< 50 ms
Absolute Boundary Error
2011/7/12 NGASR研討會
24
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• Compare to Dr. Rabiner’s work [2006] :
Absolute error
5ms
10ms
15ms
same frame
±1 frame
Inclusion rate
(1-stage)
41.5%
69.7%
81.1%
37.3%
77.0%
Inclusion rate
(2 stage)
42.1%
70.3%
81.9%
37.8%
77.8%
Dr. Rabiner’s result : (22.8%
2011/7/12 NGASR研討會
25
59.2%)
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• Error analysis – MAE of detected boundary
Affricate Fricative
Affricate
-
Fricative 2.3/17.0
6.4/6.5*
Stop
Glide
Vowel
Nasal
Silence
10.1/6.9* 7.3/10.0* 6.8/13.7 4.9/15.3* 6.1/12.8
7.2/7.0 13.6/13.1* 9.5/14.9
7.9/13.3
7.1/12.5
6.5/11.7
Stop
-
6.1/7.3 12.4/12.0* 11.2/15.0 7.5/13.1
7.6/9.6
7.1/14.4
Glide
-
7.0/9.5
10.4/12.8 11.0/21.2 7.9/13.6
6.4/11.2
6.3/12.7
Vowel
-
6.3/9.8
7.9/11.8
6.8/11.5
6.9/13.6
Nasal
7.6/11.3*
6.2/8.2
11.1/13.2 11.6/15.3 7.2/13.3 5.6/11.2* 6.9/12.1
Silence
6.3/12.5
6.0/7.5
7.3/8.2
9.9/15.9
8.8/17.6
11.7/14.1 7.4/12.1
5.2/9.9
7.0/18.9
Overall : 7.6/12.4
• Sample-based/HMM system (unit ms)
* no. of sample less than 100
2011/7/12 NGASR研討會
26
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• Accuracy of proposed method –
Systems
MAE/RMSE
(ms)
MAE/RMSE
(frame)
MAE/RMSE
(normalized to
phone duration)
HMM
12.4/17.0
1.22/1.84
0.204/0.322
1-stage
(RNN)
7.6/11.5
0.96/1.82
0.127/0.197
2011/7/12 NGASR研討會
27
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
• Error analysis (1 stage) – MDR and FAR
Pronunciation
manners
next phone Affricate Fricative
Affricate
Fricative
Stop
Glide
Vowel
Nasal
Silence
Deletion
Stop
Glide
Vowel
Nasal
Silence
Insertion
HMM
-
0.0%
25.0%
11.8%
4.4%
0.0%
2.9%
9.2%
RNN
-
0.0%
0.0%
17.6%
5.7%
7.7%
8.0%
11.1%
HMM
0.0%
2.3%
13.3%
10.6%
4.8%
5.0%
6.2%
7.5%
RNN
0.0%
3.3%
16.3%
20.5%
8.2%
7.0%
7.4%
10.5%
HMM
-
1.6%
12.6%
14.1%
5.7%
4.1%
2.0%
3.8%
RNN
-
2.3%
14.9%
22.7%
8.2%
10.3%
2.6%
9.0%
HMM
-
2.8%
16.1%
28.2%
5.6%
4.5%
5.2%
3.8%
RNN
-
5.8%
6.3%
6.5%
6.6%
8.4%
7.3%
9.4%
HMM
-
2.9%
6.7%
6.6%
6.5%
10.3%
4.4%
7.8%
RNN
-
6.2%
9.2%
6.9%
6.4%
10.0%
7.3%
10.1%
HMM
7.1%
3.5%
17.7%
7.8%
5.4%
2.5%
18.4%
5.7%
RNN
7.1%
9.8%
16.1%
18.8%
8.3%
2.5%
8.7%
9.4%
HMM
2.1%
0.9%
6.2%
6.6%
4.3%
4.8%
3.0%
5.7%
RNN
5.0%
3.9%
10.2%
10.0%
7.6%
5.0%
3.0%
6.5%
overall
2011/7/12 NGASR研討會
HMM : 6.4% (EER)
Sample-based : 8.7% (EER)
28
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
Conclusions & Future works
• Several sampled-based acoustic parameters, which could
properly model the speech signal change, were proposed
• Using the sample-based APs in phone boundary detector,
better precision and accuracy were achieved
• Segment-based speech attribution detectors
2011/7/12 NGASR研討會
29
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
Segment-based Attribution detector
• Segment based Attribution Recognizer
Operation point :
3% MDR, 20% FAR
Coding each contour
using
Legendre polynomial
2011/7/12 NGASR研討會
30
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
– Set the operation point to low MD, high FA rate.
80123 segments / 62465 phones.
– Feature extraction
using the Legendre coefficients of the AP contours
Legendre coefficients
Stable part
Segment k-1
2011/7/12 NGASR研討會
Legendre coefficients
Legendre coefficients
(dim 3*7)
Candidate k
Segment k
31
Time
Segment k+1
國立交通大學
電信工程研究所
National Chiao Tung University
Institute of Communication Engineering
– Pre-limitary result
Pronunciation
manner
Segment-based
Recog. Rate (%)
Frame-base
Recog. Rate(%)
Fricative
75.6
85.2
Stop
76.7
72.5
Glide
64.3
56.5
Vowel
90.3
89.0
Nasal
73.6
77.5
Silence
89.1
92.2
81.9
82.1
frame-based system using 9 frames feature.
– Change into accuracy over time : 81.2%
only 6 band-pass envelopes were used
 phone alignment
2011/7/12 NGASR研討會
32
Download