Document 12917213

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 4- January 2016
A Comparison of Feather Extraction Methods
for Language Identification using GMM
Dr A.Nagesh
Professor&HOD
Mahatma Gandhi Institute of Technology
Abstract: In this paper the feature extraction
2. Feature Extraction Method
techniques used for language identification task and
The main purpose of the feature extraction process is
their performance evaluation are presented. The
to extract the most relevant information from the
Language identification is the task of identifying the
speech waveform and discard as much of the
language from short duration of speech utterance.
redundant information. In language recognition,
The important stage in LID task is feature extraction.
feature extraction is given a lot of importance
The LID performance mainly depends on features
because recognition performance depends heavily on
this step. The main goal of the feature extraction
used for the language identification task. There are
step is to compute a set of feature vectors
different features which are used in LID task are
providing a compact representation of the given
Linear Predictive Coding, Mel Frequency Cepstral
input signal. Through decades of research, many
coefficients MFCC and Perceptual PLP.
different feature representations of the speech signal
have been suggested and tried. In this paper presents
Keywords—Language Identification (LID), Linear
the review of LPC, MFC and PLP feature extraction
Predictive Coding (LPC), Mel-frequency cepstral
methods with GMM modeling is presented for the
evaluation of language identification performance.
coefficients (MFCC), Perceptual Linear Predictive
(PLP).
2.1 Linear Predictive Coding
Linear Predictive Coding (LPC) is one of the most
1. Introduction
powerful speech analysis techniques. The glottis (the
space between the vocal cords) produces the sound.,
Over the past three decades there is the
which is characterized by its intensity (loudness) and
tremendous development in the area of speech
frequency (pitch). The vocal tract (the throat, the
processing. Applications of speech processing
mouth and the nasal cavity) forms the tube, which is
include speech/ speaker recognition, language
characterized by its resonance, frequencies, which are
identification etc. Language identification system
called formants. The basic problem of the LPC
automatically identifies the specific language from
system is to determine the formants from the speech
a spoken utterance. The primary objective of
signal. The solution of this problem is a different
automatic language identification (LID) system is to
equation, which expresses each sample of the signal
determine the language identity from the uttered
as a linear combination of previous samples. Such an
speech. Due to several real-life applications of
equation is called a linear predictors i.e Linear
automatic LID systems such as, speech to speech
Predictive Coding. The coefficients of the different
translation systems, information retrieval from
equation( the predictive coefficients) characterize the
multilingual audio databases and multilingual speech
formants. Therefore, the LPC system needs to
recognition systems, it has become an active research
estimate these coefficients. The estimation is made
problem in India and abroad. Here our main focus is
by minimizing the mean square error between the
on LID cues based feature for language identification
predicted signal and the actual signal.
system.
Automatic language identification is an application of
pattern
recognition.
Automatic
language
Linear prediction models the output s(n) as the
identification system is like any other pattern
linear function of past outputs and present and past
recognition system. The language identification task
inputs. Assuming an all-pole model for the vocal
involves three phases namely feature extraction
tract, the signal s(n) can be expressed as a linear
phase, training phase and testing phase. Feature
extraction is a process of obtaining features from a
speech signal. Training is the process of familiarizing
the system with the characteristics of a language,
whereas testing is the actual identification task.
ISSN: 2231-5381
combination of past values and some input
shown below.
http://www.ijettjournal.org
p
s ( n)
u(n) as
ak s (n k ) Gu (n) where
k 1
Page 218
International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 4- January 2016
G is a gain factor. Now assuming that the input
u(n) is unknown, the signal s(n) can be predicted
only approximately from a linear weighted sum of
past samples. Let this approximation of s(n) be
p
sˆ(n) , where sˆ(n)
ak s (n k )
k 1
Then the error between the actual value
s(n) and
sˆ(n) is given by
e(n) s(n) sˆ(n) Gu(n)
the predicted value
e n
s n
sˆ n
The basic problem of linear prediction analysis is to
determine the set of predictor coefficient from the
speech signal so that the speech properties of the
digital filter match those of the speech wave form
within the analysis window. Cepstral weighted
feature vector is obtained for each frame by block
processing of continuous speech signal. To spectral
flatten the signal, the speech signal has been
subjected to the pre-emphasis procedure through a
first order digital filter whose transfer function is
given by
where
p
s n
ak s n k
k 1
The error transfer function is
The linear prediction method provides a robust,
reliable, and accurate method for estimating the
parameters. The computation involved in LPC
processing is considerably less than cepstrum
analysis.
Fig.1: Block diagram illustrating the steps involved in the computation of the of LPC
2.2Mel Frequency Cepstral coefficients
MFCC: Mel-frequency cepstral coefficients one of
the important LID cues is phonemic difference
between the languages. The MFCCs capability to
capture the phonetically important characteristics of
speech has motivated to use this feature vectors for
LID task. The Mel-frequency cepstral coefficients are
most widely used feature vectors in language
identification system. Based on the human auditory
perception, Stevens and Volkmann developed the
Mel-scale. The Mel-scale was used by Mermelstein
and Davis to extract features from the speech signal
for improved recognition performance. This section
describes how the input speech signal transforms into
a sequence of MFCC feature vectors. The steps
ISSN: 2231-5381
followed in the computation of the MFCC are as
follows:
1. The speech signal is sampled at 16 KHz.2. 25msec.
speech signal is taken as one segment (frame size)
and 10msec. as frame shift. 3. The speech signal is
analyzed over a short time analysis window. From
each short time analysis window, spectrum is
obtained by using Discrete Fourier Transform (DFT).
4. This spectrum when passed through mel-filters,
mel-spectrum is obtained. 5. The log of the mel
spectrum is taken.6. By performing spectral analysis
on mel-spectrum, MFCC are obtained.
Now entire signal is represented in the form of
feature vectors. These feature vectors are given as
input to the LID system. The block diagram for the
computation of the MFCC is shown in fig. 2.
http://www.ijettjournal.org
Page 219
International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 4- January 2016
Pre-Emphasis
&
Framing
Mel Frequency
Cepstral Coefficients
Power
Spectrum
DFT
Discrete
Cosine
Transform
Mel Scale
Filter Banks
Compute
Filter Bank
Energies
Logarithm
Fig.2 : Block diagram illustrating the steps involved in the computation of the Mel Frequency Cepstral
Coefficients (MFCC).
2.3 Perceptual Linear Predictive
Perceptual Linear Predictive features are among the
most frequently used features in speech processing.
The main objective of the original PLP is to describe
the psychophysics of human hearing more accurately
in the feature extraction process. The perceptual
linear predictive speech analysis technique is based
on the short-term spectrum of speech. They are the
based on the idea to use knowledge about human
perception to emphasize important speech
information is spectra while minimizing the
differences between speakers. The PLP consist of the
following steps:
1.The power spectrum is computed from the
windowed speech signal. 2. A frequency warping into
the Bark scale applied.3. The auditory warped
spectrum is convoluted with the power spectrum of
the simulated critical-band integration of human
hearing. 4. The smoothed spectrum is down-sampled
at intervals of ≈1 Bark. The three steps frequency
warping, smoothing and sampling (ii-iv) are
integrated into a single filter-bank called bark filterbank.5.An equal-loudness pre-emphasis weights the
filter-bank outputs to simulate the sensitivity of
hearing.6. The equalized values are transformed
according to the power law of stevens by raising each
to the power of 0.33. 7. The resulting auditorily
warped line spectrum is further processed by linear
prediction (LP).
Fig.3 : Block diagram illustrating the steps involved in the computation of the PLP.
3. Gaussian Mixture Model Training
In a GMM model, the probability distribution of the
observed data takes the form given by the following
equation,
ISSN: 2231-5381
http://www.ijettjournal.org
M
p( x |
)
pi bi ( x )
i 1
Page 220
International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 4- January 2016
M is the number of component densities, x
is a D dimensional observed data, bi (x ) is the
the
component density and pi is the mixture weight for
P X
where
i = 1, ..,

bi ( x )
maximum
P X
M
X
1
(2 ) D 2
12
1 
(x
2
exp
i

i
)'
1
i

(x

i
dimensional normal distribution with mean vector
and covariance matrix
satisfy the condition
i
i . The mixture weights
M
i 1
pi
1
represent positive scalar values.
can be collectively represented as
and therefore
These parameters
pi , i ,
estimation,
where
is maximized with respect to
i
for i = 1, … M . Each language in a language
identification system can be represented by one
distinct GMM and is referred by the language models
i , for i= 1, 2, 3,…..N, where N is the number of
languages.
is the conditional probability and vector
x1 , x2 ...xt
is the set of all feature vectors
3.1 Training the Model
Clusters are formed within the training data. Each
cluster is then represented with multiple Gaussian
probability distribution function (pdf). The union of
many such Gaussian pdfs’ is a GMM. The most
common approach to estimate the GMM parameters is
Table : LID performance for three language task using GMM.
Language
Method
Time
(sec.)
4
Identification Performance (%)
No. of Gaussians
6
8
12
16
32
45
50
43
49
55
48
55
60
52
49
54
46
51
56
52
57
63
55
52
57
50
54
59
55
60
67
60
55
58
54
57
62
60
64
70
61
59
60
55
61
66
61
66
72
63
60
65
57
64
69
64
69
74
66
40
50
40
42
54
43
45
57
45
49
58
50
52
60
51
57
65
53
LPC
English
MFCC
PLP
LPC
MFCC
PLP
LPC
MFCC
PLP
LPC
MFCC
PLP
ISSN: 2231-5381
1
2
3
1
.
belonging to a particular acoustic class. Since there is
no closed form solution to the maximum likelihood
estimation, convergence is guaranteed only when
large enough data is available. An iterative approach
is followed for computing the GMM model
parameters using Expectation-Maximization (EM)
algorithm. The aim of training is to obtain the mean,
variance, and weighting of each Gaussian distribution
( ).
The feature vectors are clustered using K-means
clustering algorithm. The number of clusters chosen
are 64, which are the approximately union of the
phonemes in all the languages under consideration.
Now the GMM is re-estimated using EM algorithm.
For each language Li , one GMM is created. This
procedure is repeated for all the languages under
consideration and separate GMMs are created.
)
bi (x ) denotes a D-
Each component density
likelihood
http://www.ijettjournal.org
Page 221
International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 4- January 2016
French
LPC
MFCC
PLP
LPC
MFCC
2
3
44
50
46
51
57
46
52
50
52
60
49
55
51
54
64
50
56
54
57
66
55
60
57
58
68
59
62
60
61
70
59
56
57
59
60
64
48
44
42
45
52
49
52
59
60
51
46
45
48
54
50
53
62
63
53
47
48
51
56
53
55
66
64
57
50
50
53
58
53
58
68
65
58
53
53
55
62
58
60
70
66
60
57
56
58
63
61
62
72
68
LPC
1
German
MFCC
PLP
LPC
MFCC
PLP
LPC
MFCC
PLP
2
3
Conclusion
LPC, MFCC and PLP are the most proposed acoustic
features used in language identification. The
performance of the LID system mainly depends on
the type of feature vectors are used for the task. In
this paper we are compared the language
identification performance analysis for different
feature extraction method using GMM. The LID
performance can be improved by combining more
features of the speech signal.
[9] Eliathamby Ambikairajah Haizhou Li, Liang Wang, Bo Yin
and Vidhyasaharan Sethu , Language identification: A Tutorial
IEEE Circuits and systems Magazine second quarter 2011.
REFERENCES:
[1]Vibha Tiwari, ―MFCC and its applications in speaker
recognition‖ , International Journal on Emerging Technologies
1(1): 19-22(2010).
[2] H. Hermansky, "Perceptual linear predictive (PLP) analysis of
speech", J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752,
Apr. 1990.
[3] H. Hermansky and N. Morgan, "RASTA processing of
speech", IEEE Trans. on Speech and Audio Proc., vol. 2, no. 4,
pp. 578-589, Oct. 1994
[4] Premakanthan P. and Mikhad W. B., ―Speaker
Verification/Recognition and the Importance of Selective
Feature Extraction: Review‖ , MWSCAS. Vol. 1, 57-61, 2001.
[5] B. S. Atal, ―Automatic Recognition of Speakers from their
Voices‖ , Proceedings of the IEEE, vol. 64, 1976, pp 460 – 475.
[6] Douglas A. Reynolds and Richard Rose, ―Robust Text
Independent Speaker Identification using Gaussian Mixture
Speaker Models‖ , IEEE transaction on Speech and Audio
Processing, Vol.3, No.1, January 1995.
[7] Bageshree V. Sathe-Pathak, Ashish R. Panat, ―Extraction
of Pitch and Formants and its Analysis to identify 3 different
emotional states of a person‖ , International Journal of Computer
Science Issues, Vol. 9, Issue 4, No 1, July 2012.
[8] D.A. Reynolds, ―Experimental evaluation of features for
robust speaker identification‖ , IEEE Trans. Speech Audio
Process. , vol. 2(4), pp. 639-43, Oct. 1994.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 222
Download