here - Computer Science - University of New Mexico

advertisement
Dual-domain Hierarchical
Classification of Phonetic Time Series
Hossein Hamooni, Abdullah Mueen
University of New Mexico
Department of Computer Science
What is Phoneme?
Phonemes are very small units of intelligible sound (usually less than 200 ms).
Phonetic spelling is the sequence of phonemes that a word comprises.
Example:
 Coat ([kōt] /K OW T/)
 From ([frəm] /F R AH M/)
 impressive ([imˈpresiv] /IH M P R EH S IH V/)
2
Phoneme Classification
What is phoneme classification?
 Input: A short segment of audio signal.
 Output: What phoneme it is.
Phoneme classification is a complex task:
 More than 100 classes (based on International Phonetic Alphabet)
 Variation in speakers, dialects, accents, noise in the environment, etc.
Phoneme classification can be used in:
 Robust speech recognition
 Accent/dialect detection
 Speech quality scoring
3
Related Work
Different methods for phoneme classification have been used in the
literature:
 Hidden Markov model [Lee, 1989]
 Neural network [Schwarz, 2009]
 Deep belief network [Mohamed, 2012]
 Support vector machine [Salomon, 2001]
 Hierarchical
methods [Dekel, 2005]
[C. Lopes, F. Perdigao, 2011]
 Boltzmann machine [Mohamed, 2010]
Although data mining society has shown that k-NN classifiers can
work well on time series data, it hasn’t been tried on phoneme yet.
4
Our Dual-domain Approach
Time Domain:
 Using k-NN
 Dynamic Time Warping (DTW)
 Expensive
 Speed up by lower bounding
techniques
Frequency Domain:
 Using k-NN
 Euclidean distance between Melfrequency cepstrum coefficients
(MFCC)
 Fast
5
Real Example
6
Challenge
DTW is expensive (quadratic in time and space complexity)
We need to apply a speed up technique
 Solution: Lower bounding techniques
w
w
7
DTW Lower bounding
Resampling to equal length doesn’t always work !!!
8
DTW Lower bounding
81.8
3.5
81.6
3
81.4
2.5
81.2
81
c = 30%
80.8
80.6
80.4
80.2
Speedup
Accuracy(%)
 We use the prefix of the longer signal (Prefixed LB_Keogh)
 We show that Prefixed LB_Keogh is a lower bound if:
w > difference between lengths of two signals
 We set w = c * length of the longer signal
 We ignore all pairs of signals that don’t satisfy the above condition.
2
1.5
1
0.5
0
10 20 30 40 50 60 70 80 90 100
Window Size (c%)
2
4
6
8
10 12 14 16 18
Training Set Size
x104
9
Data Collection
 Data is publicly available.
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
AH
N
T
L
S
R
IH
K
IY
D
M
ER
EH
P
AE
B
AA
EY
F
AY
OW
SH
V
G
AO
Z
UW
NG
W
JH
HH
Y
CH
TH
AW
UH
OY
DH
ZH
Number of samples
 370,000 phonemes are segmented from:
10
Phoneme Segmentation
The Penn Phonetics Lab Forced Aligner (p2fa)
 Takes a signal and a transcript
 Produces timing segmentations (word level and phoneme level)
11
Accuracy (All layers)
 10-fold cross validation
 100 random phonemes in each fold
12
Accented Phoneme Classification
0.95
0.9
Accuracy
 British vs. American accent
 Using Oxford test set
 2-class classification problem
 No hierarchy
0.85
MFCC
0.8
DTW
0.75
0.7
0.65
0
0.5
1
1.5
2
2.5
Training Set Size
3
3.5
x 104
13
Conclusion
 We present a dual-domain hierarchical method for phoneme
classification.
 We generate a novel dataset of 370,000 phonemes.
 We achieve up to 73% accuracy rate for 39 classes.
 Our lower bounding technique gives us up to 3X speedup.
14
Thank You
Data and code available at:
http://cs.unm.edu/~hamooni/papers/Dual_2014
15
Download