Generation of Real Time Database for Speech Recognition System Vivek Khalane

advertisement
International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016)
Generation of Real Time Database for Speech Recognition
System
Vivek Khalane#1, UmeshBhadade#2
#1
Assistant Professor, Department of Instrumentation Engineering,R.A.I.T, Nerul,India
#2
Professor, Department of Information Technology, S.S.B.T, COE, Jalgaon, India
Abstract—In
real
time
speech
recognition
environments, high quality speech segmentation is
required.In recent trends, speaker identification gives
satisfactory results when the training is done offline
using good quality database file. The most of work is
concentrated on the offline training structure.
However in some cases due to lack of prior
information offline training is not possible. To
generate the dynamic feature database, runtime
training approach is required. In this paper we
present speaker recognition system using polynomial/
GMM (Gaussian Mixture model) as a classifier and
MFCC (Mel Frequency CepstralCoefficeints) as
feature. The proposed system gives 97% accuracy for
five speaker’s database.
Keywords—Speech recognition, Dynamic database
creation, Speaker recognition.
I.INTRODUCTION
The complete discussion can be acknowledged by
any person but in case of machine, suppression of
speaker‘s speech is tough task and it is very difficult
to do without any prior training. The training and
creation of speaker database is required in real time
speaker recognition. The smart media player is used in
real time speaker recognition. It requires
acknowledging the unique speaker who is present in
the content during playback. For real time speaker
recognition, first stage to segment speaker speech for
training and apply for automatic speakeridentification.
Different types of algorithm are used in speak
recognition[6],[7],[8],[9],[10].
Mel
Frequency
Cepstral Coefficients (MFCCs) is widely acceptable
feature [1]. We apply Gaussian mixture models
(GMM) for classification [2].
Rest of paper is organized as follows, in section II
basic preliminaries required is explained. Section III
discusses the proposed algorithm for run time database
generation. Section IV and V describes the
experimental set-up and result respectively and in
section VI conclude the paper.
I. PRELIMINARIES
Basic preliminaries for speaker as below:Speaker
modelling is retrieved from model. The retrieved
model have different class hence, it can be solved like
any other classification problem. The classifier is help
to find out class as explained below.
A. MFCC features
To adapt speech specificities MFCC features are
described as a filter bank processing. Also deconvolution technique is used for modification of
conventional cepstrum[1]. The speech signal is
splitted into short time windows where we calculate
the Discrete Fourier Transform (DFT) of each window
for discrete time signal x(n) with length N, given by,
N 1
X k
w(n) x(n)e
kn / N
(1)
n 0
Forn=0,1,…….,N-1 where, k corresponds to the
frequencyf(k) = k /N where, is the sampling
frequency in Hertz and w(n) is a time-window. Here,
we chose the popular Hamming window as a time
window, given by,
(2)
The magnitude spectrum |X(k)| is now scaled in
both frequencyand magnitude. First, the frequency is
scaled logarithmically using the so-called Mel filter
bank H(k,m)and then the logarithm istaken, giving
N 1
X '(m)
ln(
X (k ) H (k .m) (3)
k 0
Finally, the MFCCs are obtained by computing the
DCT using,
M
1 (4)
c( I )
X '(m) cos( I
(m
))
M
2
m 1
For I = 1,2,….,M where c(I) is the Ith MFCC and M
is the numberof Mel filters used.
B. Classifiers
The classifier is known as algorithm that
implements concrete classification. In this paper we
consider two classifier as,
1) Polynomial Classifier
The polynomial classifier consists of feature vectors
The polynomial p(x) for given featured
vectorsX=[ x1,x2] as,
p(x) =
(5)
In general the polynomial basis is formed as,
xi1 xi 2
,
ISSN: 2231-5381
j2
http://www.ijettjournal.org
…..
xik
(6)
Page 350
International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016)
Where, k is less than or equal to the order of the
polynomial K. For each input feature vector the score
of matching is calculated using the inner product.
Finally the score is averaged out to get final output.
N
1
N
s
wt p ( x1 ) (7)
i 1
2)GMM classifier
GMM parameters are derived fromExpectationMaximization (EM) algorithm or Maximum A
Posteriori (MAP) estimation from well trained prior
model [3].A Gaussian mixture model is a weighted
sum of M component Gaussian densities as given by
the equation,
p( X
M
)
i
i 0
g( X
i
,
i
.) (8)
Where X is a D-dimensional data vector (i.e.
measurement or features),
i
, i= 1,2,…..M are the mixture weights, and
g( X
i
,
.)
, i= 1,2,….M are the component
Guassiandensities. Each component density is a DvarientGuassian function of the form,
g X
i
,
i
i
1
.
(2 )
exp
1
X
2
D
1
2
i
.
1
'
i
2
i
. X
i
activity detection algorithm. Furthersections are
separated into two parts which explained dynamic
database creation and speaker recognition.
B. Dynamic database creation
The Cluster of speech is used by segmenting speech
online for dynamic database creation. This is followed
by training speaker database to individual speakers for
recognition purpose. To detect speaker variable points
in the speech Segmentation is carried out. A repeated
portion of conversation starting from a individual
speaker is termed as segment [6], [7], [8], [9], [10].
Fig. 1 shows description about visual aid. The
observation of MFCC vectors within specific windows
gives segmentation. Observed MFCC vectors are
performed using segmentation for a prescribed data in
the window. We consider the speaker changes point at
center of window. First, MFCC vectors of 1 st and 2nd
half of prescribed window are used to compute GMM
model N1 and N2. A distance matrix BIC (Bayesian
Information Criteria) is measured between N1 and
N2[11].
The distance measure D is given as,
D
BIC ( N1 , N 2 ) BIC ( N 2 , N1 ) (11)
Obtained value is saved and window is shifted for
next repetition. For remaining vectors the array of
distance value is calculated. The adaptive threshold
technique is used to avoid the false change detection.
(9)
i w
Th W
With mean vector
i
and covariance matrix
.
i
.The mixture weights satisfy the constraint that
M
1 . The complete Gaussian mixture model is
i
i 0
parameterized by the mean vectors, covariance
matrices and mixture weights from all component
densities. These parameters are collectively
represented by the notation,
i
,
i
,
i
i j w
D(i )
2w 1
(12)
Where, W is weight and w is the number of
difference values taken into account of the left and
right local neighbor windows, ‗i‘ is the frame under
consideration.
We consider the speech segment which form cluster
belonging to individual speaker. The same clusters are
used to train algorithm as shown in Fig 2.
. i 1, 2,...M (10)
C. Polynomial classifier for speaker Recognition
For speaker recognition, polynomial based
classifier is used because of its computational
constrained to be diagonal.
simplicity [5]. For testing purpose we divide online
speech and calculate feature of every sample which is
II. PROPOSED ALGORITHM
used for matching purpose with available database.
In proposed algorithm the speaker cluster is created The input speech sample is divided into 20ms frame
by GMM based speaker segmentation. And it is used having 50% overlap time. The sharp truncation of data
for the dynamic database creation. The MFCC feature at the edge of frame window causes loss of
and polynomial based classifier is used for recognition. information which is avoided by overlapping. Then
frame is passed through low pass filter which remove
A. Preprocessing
noise. The MFCC features are extracted in above
The input speech consist of high frequency noise, section. It also calculates the Inverted Mel Frequency
such noise is removed by FIR low pass filter. The Cepstral Coefficients (IMFCC) [4],[5]. While testing
average of feature vectors over the given speech we extract features and apply it on individual
segment is used for training and testing of speakers. In polynomial speaker model in MFCC and IMFCC
such cases the silence may decrease accuracy of model we get two scores [12]. The scores the final
speech recognition. It can be removed using voice matching score is calculated as,
The covariance matrices
ISSN: 2231-5381
. can be full rank or
i
http://www.ijettjournal.org
Page 351
International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016)
cannot be considered to increase with number of test
samples. Fig. 3 shows the result of speaker
segmentation.
Fig. 1 Illustration of speaker change point detection.
Score
MFCCscore
1
IMFCCscore
(13)
We chose the value of
empirically. The
empirical value of
is selected as 0.56.
III. EXPERIMENTAL SETUP
The recording is done using MI Prime IV with
recording sampling frequency 16 KHz. It consists of
15 recordings of 5 speakers while recording each
speaker speaks one after another with pause of 3 sec.
The total duration of recording is approximately 20
sec. The sampling data consist of 20 to 30 utterances
from the same 5 speakers.
Fig. 3Difference measure graph with local extremes detected.
Table I gives result of recognition results
assumingsegmentation is accurate.
TABLE I
Experimentation Results for recognition.
Number
of
Speakers
5
5
Number
of Test
Samples
20
30
Sampling
Frequency
16kHz
16kHz
Accuracy of
Recognition
(Prototype)
96.5%
91.5%
V. CONCLUSIONS
We presented a speech recognition system with real
time database creation to introduce the dynamic
speaker recognition system. For the database creation
part, we proposed speech segmentation technique. The
proposed system is analyzed on the real time dataset.
By experimenting, we conclude that the proposed
method shows better result with the additional feature
of offline training systems.
In future, the results of recognition system can be
studied under environmental or background noise
effects in this work. The result accuracy of the
proposed algorithm can be evaluated in case of
multiple competing speakers speaking at same time.
REFERENCES
[1]
Davis, S. and Merlmestein, P., "Comparison of Parametric
Representations for Monosyllabic Word Recognition in
Continuously Spoken Sentences", IEEE Trans. on
ASSP(1980).
[2]
Reynolds D. A., 1995, Automatic speaker recognition using
Gaussian mixture speaker models, Lincoln Lab. J., 8( 2), 173192.
Amit S. Malegaonkar, Aladdin M. Ariyaeeinia, and
PerasiriyanSivakumaran, ‖Efficient Speaker Change Detection
Using Adapted Gaussian Mixture Models‖, IEEE Trans. on
Audio, Speech and language Processing, 15 (6) (2007).
Fig. 2 Dynamic database creation using speech segmentation.
IV. RESULTS AND DISCUSSIONS
The obtained results by proposed speech
segmentation algorithm are quite good. Without any
prior knowledge of number of speakers or the
speaking characteristics the proposed system generates
dynamic database. Hence the recognition accuracy
ISSN: 2231-5381
[3]
[4]
SandipanChakroborty and GoutamSaha, ‖ Improved TextIndependent Speaker Identification using Fused MFCC &
IMFCC Feature Sets based on Gaussian Filter‖, International
http://www.ijettjournal.org
Page 352
International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016)
Journal of Information and Communication Engineering 5:1
(2009).
[5]
Campbell W. M., Assaleh K. T. and Broun C. C.,‖Speaker
recognition with polynomial classifiers‖, IEEE Trans. on
Speech and Audio Processing, 10 (4), 2002, pp. 205-212.
[6]
H. Gish, M.-H. Siu, and R. Rohlicek, ―Segregation of speakers
for speech recognition and speaker identification,‖ in Proc.
IEEE Int. Conf. Acoust., Speech, Signal Process.,
(ICASSP‘91), 1991, vol. 2, pp. 873–876.
[7]
P. Delacourt and C. J. Wellekens, ―DISTBIC: A speakerbased segmentation for audio data indexing,‖ Speech
Commun., vol. 32, no. 1–2,pp. 111–126, 2000.
[8]
J. Ajmera et al., ―Robust speaker change detection,‖ IEEE
Signal Process. Lett., vol. 11, no. 8, pp. 649–651, Aug. 2003.
[9]
M. Roch and Y. Cheng, ―Speaker segmentation using MAPAdapted Bayesian Information Criterion,‖ in Proc. Speaker
Lang. Recognition Workshop (Odyssey), 2004, pp. 349–354.
[10] V. Karthik, D. S. Satish, and C. C. Sekhar, ―Speaker change
detection using support vector machine,‖ in Proc. 3rd Int.
Conf. Non-Linear Speech Process., Barcelona, Spain, Apr.
19–22, 2005, pp. 130–136.
[11] Patil,
BhushanDayaram,
YogeshManav,
and
PavanSudheendra. "Dynamic Database Creation for Speaker
Recognition System." Proceedings of International Conference
on Advances in Mobile Computing & Multimedia. ACM,
2013.
[12] Angal, Yogesh S., Pawan K. Ajmera, and Raghunath S.
Holambe. "Speech Recognition of Isolated Words in Noisy
Conditions Using Radon Transform and Discrete Cosine
Transform Based Features Derived from Speech
Spectrogram." Digital Signal Processing 4.5 (2012): 178-183.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 353
Download