Isolated Digit Recognition Using in Ear Mahesh K. Patil ,

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 7 - Jun 2014
Isolated Digit Recognition Using in Ear
Microphone Data Using MFCC,VQ and HMM
Mahesh K. Patil1, Prof. Dr.(Mrs.) L.S. Admuthe2 , Prashant P. Zirmite3
1,3
PG student, Department of Electronics Engineering, Textile & Engineering Institute, Ichalkaranji
2
Head Department of Electronics Engineering, Textile & Engineering Institute, Ichalkaranji.
Abstract – This paper implements the isolated digit
recognizer in three steps. The first step performs the
endpoint detection and speech segmentation by short
term analysis. The second step involves the speech
feature extraction by using Mel Frequency Cepstral
Coefficients (MFCC) parameters. Finally the third step
involves the Vector Quantization (VQ) – Hidden
Markov Model (HMM) based classifier for isolated
digit recognition. An English spoken digit database
which contains number of speakers is used for the
testing and validation modules.
Keywords— Speaker recognition, MFCC,
Quantization, Hidden Markov Model.
Vector
I. INTRODUCTION
Speech is the natural and fundamental way
of communication for the humans. Technically
speaking the speech recognition refers to the
mechanism that stores some representation of
distinguishing characteristics of speech with a source
of input equipment, such as a microphone and further
processes these representations to match them to
incoming speech in an effort to interact with
machines, computers and/or human users.
In an isolated digit recognition system, the
digits are spoken in isolation with a distinct pause
between digits. This is the simplest form of
recognition strategy since an endpoint detector can be
easily used to recognize the boundaries of the digits.
The pauses between the digits typically are on the
order of 200ms so that they are not confused with
weak fricatives and gaps within digits. Furthermore
the user must be cooperative to make the isolated
digit recognition system successful.
There are two major stages within isolated
word recognition- a training stage and a testing stage.
Training stage involves teaching the system by
building its dictionary (database) i.e. an acoustic
model for each word that the system needs to
recognize. Testing stage involves ensure of matching
of input test speech signal with the reference database
and recognize the same. During this testing phase,
ISSN: 2231-5381
similar feature vectors are extracted from the input
test speech signal, and the degree of their match with
reference is obtained using some matching technique.
II. BLOCK DIAGRAM
Reference
speech samples
Feature Extraction
MFCC
Trained database
Training stage
Input from User
Feature Extraction
MFCC
Feature matching
using VQ
Testing Stage
HMM
Classification
Fig.1 System Block Diagram
Fig.1 shows the proposed flow diagram of
the system. In the training stage, different sample of
single word will be taken from a speaker, such
number of different words will be considered.
Feature vectors (set of MFCC’s) will be calculated
for this input reference samples. For each word its
Hidden Markov Model will be formed, based on
observed feature vectors.
In the testing stage, real time input can be
given to the system. It’s feature vectors will matched
with feature vectors from database using vector
quantization. This word recognition will be classified
using Hidden Markov Model as the recognition
algorithm.
III. FEATURE E XTRACTION
Fig.2 shows the block diagram of MFCC.
The MFCC is the most evident example of a feature
set that is extensively used in speech recognition. As
the frequency bands are positioned logarithmically in
MFCC, it approximates the human system response
more closely than any other system. The MFCC is
computed based on short term analysis and from each
http://www.ijettjournal.org
Page 322
International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 7 - Jun 2014
frame a MFCC vector is computed. In order to
extract the coefficients the speech sample is taken as
the input and hamming window is applied to
minimize the discontinuities of a signal. Then DFT
will be used to generate the Mel filter bank. Mel
frequency warping is used to get log total energy.
After warping the number of coefficients are
obtained. Finally the Discrete Cosine Transform is
applied to it as if it were a signal. The MFCC are the
amplitudes of the resulting spectrum.
N=256 (which is equivalent to 30ms windowing) and
M=100.
Windowing- Here each individual frame
from above step is windowed in order to minimize
signal discontinuities and spectral distortion. The
Hamming window is used. Frame size- typically 1025ms, frame shift is 5-10ms. The hamming window
is given as below( ) = 0.54 − 0.46
(2П ⁄ − 1)
…(1)
where 0 ≤ n ≤ N-1
Input Speech
Signal
3.
Discrete Fourier TransformIt converts each frame of N samples from
the time domain into the frequency domain. It gives
out the magnitude frequency response of each frame.
The N point DFT is given by-
Framing and
Windowing
[ ]=∑
DFT
[ ]
П
...(2)
where k= 0,1,2,---------(N-1)
Mel Frequency
Warping
4.
LOG
IDFT (Inverse
DFT)
Mel Cepstrum
MFCC Feature
Vector
Fig. 2 Steps involved in MFCC Feature Extraction
Steps involved in MFCC calculation are as shown
below
1. Pre-emphasisPre-emphasis means boosting the energy in
the high frequencies. To enhance the accuracy and
efficiency of the feature extraction process, speech
samples are normally pre-processed before features
are extracted. It includes digital filtering and speech
signal detection. Filtering includes pre-emphasis
filter and filtering out any noise using several
algorithms of digital filtering. Normally a one
coefficient FIR filter is known as pre emphasis filter.
2. Framing and windowingFraming- In this the continuous speech
signal is blocked into frames of N samples, with
adjacent frames being separated by M (M<N). It
continues until all the speech is accounted for within
one or more frames. Typical values for N and M are
ISSN: 2231-5381
Mel Frequency WarpingHuman perception of the frequency
contents of sounds for speech signals does not
follow a linear scale. For each tone with an
actual frequency f Hz, a subjective pitch is
measured on the mel scale. The formula to
compute the Mels for a given frequency f Hz is
given as-
( ) = 2595 ∗ log (1 + )
…(3)
5. LOG energy computationAs human’s ear’s response for speech signal
is a logarithmic in nature, we have to compute the
Log values for each Mel value.
6. Inverse Discrete Fourier Transform (IDFT) It converts each frame of N samples from
the frequency domain to again in time domain. It
expresses a finite sequence of data points in terms of
a sum of cosine functions oscillating at different
frequencies.
7. Mel CepstrumThe cepstral representation of the speech
spectrum provides a good representation of the local
spectral properties of the signal for the given frame
analysis, because the Mel spectrum coefficients are
real numbers. In this the Mel-spectrum scale is
converted back to standard frequency scale.
IV. FEATURE MATCHING- VECTOR
QUANTIZATION
Feature matching technique based on Vector
Quantization is proposed here. The acoustic vectors
formed by set of MFCC’s are clustered together. For
each clustered vector its center is calculated called a
http://www.ijettjournal.org
Page 323
International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 7 - Jun 2014
codeword. The collection of all codewords is used to
generate a codebook.
Thus in training phase, codebook is
generated for each input speech sample by clustering
one’s training acoustic vectors. In the testing phase,
an input utterance of an unknown speech is vector
quantized using each trained codebook and the vector
quantization distortion is computed. The unknown
speech sample corresponding to the VQ codebook
with smallest distortion is identified. It can be
implemented by either making use of K-means
algorithm or LBG algorithm.
Algorithm will be implemented as follows1. Design a 1-vector codebook. This is the centroid
for the entire set of training vectors.
2. Double the size of the codebook by splitting the
current codebook according to the rule=
( )
= ( )
Where n varies from 1 to the current size of the
codebook and c is the splitting parameter (0.01≤
c ≤ 0.05)
3. Use the iterative algorithm (k-means algorithm
or LBG algorithm) the best suitable to get the
best set of centroids for the split codebook.
4. Iterate steps 2 and 3 until a defined codebook
size is designed.
The output of VQ is the index of the
codebook vector nearest to the input vector.
V. SPEECH RECOGNITION CLASSIFIER
USING HIDDEN MARKOV MODEL
A HMM is a finite state machine having
fixed number of states. It is a statistical method of
characterizing the spectral properties of the frames of
a speech signal. In HMM, states of model cannot be
observed directly. Only the output symbols of the
state can be observed. Speech is described by
parameterized statistical models and recognition is
performed by considering the most likely model
which produces the given sequence of observations.
HMM is characterized by three matrices
namely λ=(A,B,П)
A is Transition probability matrix (NxN)
B is Observation symbol probability
distribution matrix (NxM)
П is Initial state distribution matrix (Nx1)
Where,
N- Number of states in HMM
M- Number of distinct observation per
words per state.
Complete parameter set λ= (A,B,П) can be calculated
for each word model.
ISSN: 2231-5381
λ1
HMM for
Digit 1
Probability
Computation
Spoken Digit
Input
λ2
Vector Quantized
MFCC’s
HMM for
Digit 2
Select
maximum
Probability
Computation
λv
HMM for
Digit v
Recognized
Digit
Probability
Computation
Fig.3 HMM based speech recognition.
VI. EXPERIMENTAL RESULTS
For the implementation of the digit
recognition system, the spoken digit ‘ONE’ is
considered. For this ‘ONE’, the obtained results are
as follows-
Fig.4 Original and filtered cropped speech signal
http://www.ijettjournal.org
Page 324
International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 7 - Jun 2014
VII. CONCLUSION
In this paper we have implemented isolated
digit recognition with the help of MFCC as speech
feature extraction method, Vector Quantization (VQ)
as speech feature matching technique and Hidden
Markov Model as a classifier. With the help of this
model it is possible to make more robust and efficient
and hence the performance of Speaker recognition
system is improved. Though the methods proposed in
this paper got better performance, there are still some
issues to be further investigated.
REFERENCES
[1]
[2]
Fig.5 Short term energy and zero crossing rate in endpoint
detection
[3]
TABLE 1
PARAMETERS FOR SPEAKER 1
MFCC
Signal
(Speaker 1)
Mean
Variance
One
Two
Three
Four
Five
Six
Seven
Eight
Nine
Ten
38.0097
37.8462
34.2821
48.9759
41.9466
28.8048
56.242
36.4913
38.3651
25.1202
422.344
395.1422
288.7179
493.6744
406.5802
250.0655
557.4576
314.8482
308.0966
235.3668
HMM
Likelihood
Values
-182.6
-150.01
-165.87
-173.94
-226.03
-190.68
-182.35
-202.32
-216.32
-200.17
TABLE 2
PARAMETER FOR SPEAKER 2
MFCC
Signal
(Speaker 2)
Mean
Variance
One
Two
Three
Four
Five
Six
Seven
Eight
Nine
Ten
182.1683
163.7754
163.2248
186.9938
187.3901
150.4516
185.1616
175.9932
209.5781
184.8656
1517.8
948.2908
977.5702
1748.8
1739.8
972.5804
1728.4
1316
1831.9
1326.8
ISSN: 2231-5381
[4]
[5]
[6]
[7]
[8]
[9]
HMM
Likelihood
Values
136.2
110.17
34.07
123.42
158.27
74.59
70.18
51.79
100.844
52.01
[10]
[11]
[12]
Lawrence Rabiner, Biing Hwang Juang,
Fundamental of
Speech Recognition, Copyright 1999 by AT&T.
GIN-DER WU AND YING LEI “ A Register
Array
based Low power FFT Processor for speech recognition”
Department of Electrical engineering national Chi Nan
university Puli545 Taiwan.
B. Yegnanarayana, S.R.M. Prasanna, J. M. Zachariah, and
C.S. Gupta, “Combining evidence from source,
suprasegmental and spectral features for a fixed- text
specker verification system,” IEEE Trans. Speech Audio
Process.,vol.13(4), pp. 575-82, July2005.
N.Uma Maheswari, A.P.Kabilan, R.Venkatesh, “A Hybrid
model of Neural Network Approach for Speaker
independent Word Recognition”, International Journal of
Computer Theory and Engineering, Vol.2, No.6, December,
2010 1793-8201.
Satyanarayana “short segment analysis of speech for
enhancement” institute of IIT Madras feb.2009
C.S.Myers and L.R.Rabiner, A Level Building Dynamic
Time Warping Algorithm for Connected Word Recognition,
IEEE Trans. Acoustics, Speech Signal Proc.,ASSP-29:284297,April 1981.
H.Sakoe and S.Chiba, Dynamic programming algorithm
optimization for spoken word recognition ,IEEE Trans.
Acoustics, Speech, Signal Proc., ASSP-26(1).1978
Santosh K.Gaikwad, Bharti W.Gawali and Pravin
Yannawar, “A Review on Speech Recognition Technique”,
International Journal of Computer Applications (0975 –
8887) Volume 10– No.3, November 2010
Keh-Yih Su et.al, Speech Recognition using weighted
HMM and subspace IEEE Transactions on Audio, Speech
and Language.
R.K.Moore, Twenty things we still don t know about
speech, Proc.CRIM/ FORWISS Workshop on Progress and
Prospects of speech Research an Technology, 1994.
Shigeru Katagiri et.al, A New hybrid algorithm for speech
recognition based on HMM segmentation and learning
Vector quantization, IEEE Transactions on Audio Speech
and Language processing Vol.1, No.4
L.R.Rabiner and B.H.jaung,” Fundamentals of Speech
Recognition Prentice-Hall, Englewood Cliff, New Jersy,
1993
http://www.ijettjournal.org
Page 325
Download