Text Independent Speaker Identification Using Gaussian Mixture

advertisement
Chee-Ming Ting
Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff.
International Conference on Intelligent and Advanced Systems 2007
Jain-De,Lee

INTRODUCTION

GMM SPEAKER IDENTIFICATION SYSTEM

EXPERIMENTAL EVALUATION

CONCLUSION

Speaker recognition is generally divided into two tasks
◦ Speaker Verification(SV)
◦ Speaker Identification(SI)

Speaker model
◦ Text-dependent(TD)
◦ Text-independent(TI)

Many approaches have been proposed for TI speaker
recognition
◦ VQ based method
◦ Hidden Markov Models
◦ Gaussian Mixture Model

VQ based method

Hidden Markov Models
◦ State Probability
◦ Transition Probability

Classify acoustic events corresponding to HMM states
to characterize each speaker in TI task

TI performance is unaffected by discarding transition
probabilities in HMM models

Gaussian Mixture Model
◦ Corresponds to a single state continuous ergodic HMM
◦ Discarding the transition probabilities in the HMM models

The use of GMM for speaker identity modeling
◦ The Gaussian components represent some general speakerdependent spectral shapes
◦ The capability of Gaussian mixture to model arbitrary densities

The GMM speaker identification system consists of the
following elements
◦ Speech processing
◦ Gaussian mixture model
◦ Parameter estimation
◦ Identification

The Mel-scale frequency cepstral coefficients (MFCC)
extraction is used in front-end processing
Input Speech
Signal
FFT
Pre-Emphasis
Triangular
band-pass
filter
Frame
Logarithm
Mel-sca1e cepstral feature analysis
Hamming
Window
DCT

The Gaussian model is a weighted linear combination
of M uni-model Gaussian component densities
M


p( x |  )   wi bi ( x )
i 1

Where x is a D-dimensional vector
bi ( x), i  1,...,M are the component densities
wi , i=1,…,M are the mixture weights
M

The mixture weight satisfy the constraint that
w
i 1
i
1

Each component density is a D-variate Gaussian
function of the form

bi( x ) 
1   T 1  
exp{ ( x  i ) i ( x  i )}
D/2
1/ 2
(2 ) | i |
2
1

Where i is mean vector
i is covariance matrix

The Gaussian mixture density model are denoted as

  (wi , i , i ),i  1,...,M

Conventional GMM training process
Input training vector
LBG algorithm
N
EM algorithm
Convergence
End
Y
Input training
vector
Overall average
N
Y
m<M
Split
D’=D
Clustering
Cluster’s
average
Calculate
Distortion
N
(D-D’)/D< δ
Y
End

Speaker model training is to estimate the GMM
parameters via maximum likelihood (ML) estimation

p( X |  )   p( xt |  )
T
t 1

Expectation-maximization (EM) algorithm

1 T
wi   p(i | xt ,  )
T t 1


p
(
i
|
x
,

)
x

t
t
i  t T1

p
(
i
|
x
t 1
t , )

T

2
p
(
i
|
x
,

)
x

t
t
2
 i2  t T1


i

p
(
i
|
x
t 1
t , )
T

This paper proposes an algorithm consists of two steps

Cluster the training vectors to the mixture component
with the highest likelihood

Ci  arg maxbi ( x )
1i  M

Re-estimate parameters of each component
wi number of vectors classified in cluster i / total number of
 training vectors
i sample mean of vectors classified in cluster i.
i sample covariance matrix of vectors classified in cluster i

The feature is classified to the speaker Sˆ ,whose model
likelihood is the highest
Sˆ  arg max p( X | k )
1k  S

The above can be formulated in logarithmic term

ˆ
S  arg max  log p( xt | k )
T
1 k  S
t 1

Database and Experiment Conditions
◦ 7 male and 3 female
◦ The same 40 sentences utterances with different text
◦ The average sentences duration is approximately 3.5 s

Performance Comparison between EM and Highest
Mixture Likelihood Clustering Training
◦ The number of Gaussian components 16
◦ 16 dimensional MFCCs
◦ 20 utterances is used for training

Convergence condition | p( X | (k 1) )  p( X | (k ) ) | 0.03

The comparison between EM and highest likelihood
clustering training on identification rate
◦
◦
◦
◦
10 sentences were used for training
25 sentences were used for testing
4 Gaussian components
8 iterations

Effect of Different Number of Gaussian Mixture
Components and Amount of Training Data
◦ MFCCs feature dimension is fixed to 12
◦ 25 sentences is used for testing

Effect of Feature Set on Performance for Different
Number of Gaussian Mixture Components
◦ Combination with first and second order difference coefficients
was tested
◦ 10 sentences is used for training
◦ 30 sentences is used for testing

Comparably to conventional EM training but with less
computational time

First order difference coefficients is sufficient to
capture the transitional information with reasonable
dimensional complexity

The 12 dimensional 16 order GMM and using 5
training sentences achieved 98.4% identification rate
Download