Document 13759559

advertisement
1EREST HEIGHBOIJR DECISION RULE FOR VOWELAN]) DIGIT RECOGNITION
T.K. Raja and B. Yegnanarayana
Engineering
Department of Electrical Communication
Indian Institute of Science
Bangalore—560012 ,India
£BSTRACT
this paper, we shall discuss the perfor
mances of vowel and digit recognition
Minimum distance to mean is usually
schemes based on the autocorrelation as
used as a classificationrule in speech
the feature and the nearest neighbour dec.t.
and speaker recognition studies. In this
sion rule (}1NDR)[7Jfor classification.
For more information
on speech recognition
paper it is shown that the nearest neigh..
bou-r decision rule gives significant impr—
D.R. Reddyts paper
systems,refer
ovement in classification score for vowel
and digit recognition schemes. Autocorrel—
THE FEATURE AND THE CLASSIFICATION
ation coefficientsof lags two to five
RulE
form the feasampling instants are used
ture vector. Pour samples per class have
It is well known the autocorrelatlon
been used. Minimum squared Euclideari disand the power spectrum of the signal are
vector from the nearest
form a Fourier transform pair since the
tance of the
reference
chosen as the classification
speech information
characterised by the
For sustained vowels the recognition
spectral
envelope,
appears logical that
score is cent percent. Por the same fea,the autocorrelation coefficients could. as
ture the minimum distance to mean gives
well be used as feature. The LPC's are
derived from autocorrelation coefficients
70 7 recognition score. When the reference
by solving the autocorrelationnormal
samples of a given speaker is tested over
the vowels spoken by different speaker(up
equations.[8J Hence, instead of using
LPC's and other features derived from LP
to 10), this scheme gives the recognition
score of about 95 7. . For digits without
analysis, one can as well use directly the
autocorrelationcoefficients as a feature.
any time warping the recognition score of
In this study, we have used the normalized
about 86 7.to 92 7
obtained.
autocorrelation coefficients of lags two
to five sampling instants. The normalizaINTRODUCTION
tion necessary normalizethe gain
I.
variations in the speech signal. And also
The success of speech recognition by
feature is found be less susceptimachine depends on the selection of an
ble additive white noise and inter
and
intera speaker variations. The figure 1
appropriate feature and the classification
shows the autocorrelation feature
the
aJ.gorit1uu. Ideally, the feature selected
should be widely separated the feature
vowels ,Le/ approaches
true value when
affected by the
the signal to noise ratio(SNR)
space, while be
greater
environmental conditions, inter and intera
than 12 dB and this character remains the
same
all the vowels.
speaker variations. Several studies on
speech recognition, in particular, vowel
and digit recognition schemes, mainly use
Any general classification algorithm
to determine the weight vectthe spectral information. This spectral
required
ors of the discriminant functions
informationmay be formants or the linear
The efficient method of estimating the
prediction coefficients
Autocorrelation ooefficients[3Jof sigweight vectors is based on the availabilinals derived from filter banks and zero
ty of large number of training sanpies
order
obtain the convergence of weight
crossing rate (ZCR)[4j have also been
vectors. The procedure involved in estiused. Most of the speech recognition schemes use the minimum Euclideen distance to
mating the weight vectors are computatio—
means as the criterion for classification.
nally expensive. The Mahalanobis distance
criterion also equally complex as it
This criterion is although computational]y
is required to estimate the sample mean
simple, it assumes equal variance and uni—
the featu-.
modal Gaussian distribution
and variance which also needs large number
Several other classifiotion rules
of training samples. On the other hand,
the minimum Euclidean distance to mean
based on Itakura measure[&7,log spectral
criterion is simple but it requires the
distance[6Jhave also been reported. In
[9J.
II.
to
rule.
is
test
is
it
Is
is
to
o
this
to
least
its
in
for
is
for
is
[7J.
(LPO's),7.
in
to
is
res.
for
CH1285—6/78/0000—0731$oo.75@19781EEE
731
samples to be unimodal, equal variance
Gaussian &istribution. In all these cases
the features belong to different classes
are to be well separated in feature space.
The NNDR can be thought ol' as a compromise between these two extremes. This
rule only assumes the samples form a well
separated clusters in the feature space.
number of classes are seven.
dipthongs are articulated in
sequence
first by /A/
and then by /1/ or /U/ as the
be
case may
and hence it is necessary to
consider the initial portions of the utterances. Only four samples per class for
all the vowels per speaker is computed and
is stored in the paper tape. For digit
recognition, the autocorrelation coeff1—
dents (R(2) to R(5)) are computed in each
In
addition, the NNDR does not require
the tedious computational procedures of
estimating any of the parameters already
for
frame of duration 25.6 msec.
successive 16 frames. Thus the dimensionality
of the feature space is 64.
In this case
only 3 samples per class for all the 10
classes per speaker is stored in the paper
In all, these cases the do power is
tape.
removed from the over all power spectrum.
The sequence of feature extraction procedure is shown in the block diagram of
Figure 2.
discussed.
The basic principle
is that
its
assigns class to the test sample(or
test
to which its nearest
of NNDR
set)
in the design
neighbour
design set)
sainples(or
belongs. The mathematicadescription
this techniques is as follows:
x
is the
= II
of
2
—
(1)
x.
ifE—:z1 j_4DrT}]
di
where d12,
squared Euclidean s—
-tance
between the reference sample
and the test sample. X1. is the ith
sample reference belong to C.th class
3Both
and Xt is the test sample.
X1
and Xt are of same dimension.
If,
I_JL!JJ___UZDFIJ
Pig.2:
2
= II
X1
(dj)min
The lIFT and IDPT are used to remove the
DC power which varies from utterance to
then
utterance.
The testing of this recognition
scheme was spread over a period of nine
EJERThNT
III.
months.
Experiments have been conducted to
evaluate the autocorrelation feature
With flDR as a classificationalgorithm
using the lIP Fourier analyser system
545lB. The speech data is entered into
the system through it built in A/D conversion unit. The Sohur microphone
used as transducer and is kept at a
distance not more than 3 inch rem the
mouth. Both design and testing is done.in
the computer room environment. The back
ground noise is about 60 tB. Throughout
study sampling rate of 10 kHz and
a
this
rri—ii
Block diagram of feature extraction
cycle
(2)
—
Since the
f
IV.
A.
RESULTS
Sustained Vowel Recognition:
Each speakerts reference data on the
paper tape is transferred to the system's
memory one
time. 100 utterance of
each vowel from
the speakers are tested.
The test sequence procedure is shown
in the block diagram of Figure 3. The
at a all
is
X
a
Hanning window is used. The number of
speakers participated in this experiment
are 10 (8 male + 2 female). For sustned
vowel (/a/,/i/, /u/,/e/, and,/o/)recogni_
(*AfP&7Z
d
FIND
_____(S.$ZSlPv
__________
XttCJ
L
EIFEREIYCE
XLJ
3: Block diagram of recognition
Fig.
cycle
tion the autocorrelationcoefficients
(R(23 to R(5)) are computed from the
stable portion of the signal of duration
25.6 m.sc. Only four samples per class
for all the vowels per speaker is conipu..
ted and is stored in the paper tape. The
4.
dimensionalityof the feature space
When dipthongs /AI/, and /AU/ (in Indian
languages they are called vowels) are
included in already discussed vowel re—
cognition scheme, the contour of auto—
correlation coefficients (R(2) to R(5))
for four successive frames each of duration 25.6 msec. are used. The dimension—
ality of the feature spade is 16 and the
i
results of the experiment
shown in
Table—i, and is cent percent if the refe—
rence and the test vectors are from the
same speaker, otherwise; it is 96.2 7. on
an average taken over speakers. The recognition score remains unaltered
the SNR
is greater than 15 dB. The noise derived
from the random noise source is added to
the signal.
if
is
B.
Vowel Recognition including Dipthong
The experimental procedure described
in pars, IVA is repeated and the results
are shown in Table—2 and is 100 7.
the
if
732
reference and
test vectors are from the
sane speaker, otherwise; it is 89.9 7. on
an average taken over the speakers.
Digit Recognition:
C.
•
The experimentalprocedures descrirepeated but the nun—
ber of utterance per digit per speaker
is 50. The results of the experiments
are shown in Table 3 to 6. The recognition score for languages Hindi, Telugu,
nglisb, Kannada, and Tamil respectively
are 86.8 7. , 87.6
, 88.1 7. , 90.7
the reference and
and 92.2 /.
vectors are from the same speaker,other—
wise; it is 80.6
, 81.1 •/. , 81.9 .
82.3 7. ,
83.6 7. on an average taken
over the speakers. The Table 3 to 4 is
for the Tamil language which shows the
highest recognition rate and the Table 5
to 6 is for the Hindi language which
shows the lowest recognition rate.
bed in para IVA
is
if
d .
explained by refering to Pig. 4. Let
and
be
A1
different cluster
two
•/.
Fjgure4. Cluster area for class Cl
in two
distribution
C2
dimensional
feature
to
(centre of
misciassifies the sanles falling in the
shaded region. But the NNDR always classifies correctly
even if any of the saniiLee
minimum distance
mean
the cluster)criterion always
space. The
16. Pour utterances per speaker per
class from one female and three maleThe
speakers were taken for reference.
number of test utterances per class per
speaker i5 10. The recognition rate has
is
falls in the shaded region since it considers only the minimum distance to individual samples in the clusters.
VI.
gone upto 98.1 /. on an average taken
CONCLUSION
This study shows the importance of
NIDR together with Autocorrelation feature
in speech recognition. If the reference
samples that falls only on the periphery
of the clusters but well separated on the
periphery are selected, the recognition
performance can be improved and the
memory size to store the reference can
also considerablybe reduced.
speakers.
DISCUSSION ON RESIJMS
In the case of sustained vowel
3c
areas belong to Class C and C if the
variance are unequal an& unimoal Gaussian
for
this case the reference samples per class
V.
the
A2
test
The same exoeriment is repeated by
using Portran IV programming and IBM 360
computer
language Tamil only. In
over
mentioned that the NMDR works better than
the minimum distance to mean criterion.
The reason is that the NNDR needs
only
well separated clusters in the feature
space. The situation where the minimum
distance to mean criterion fails even if
the cluster are well separated can be
re.-
additive white noise
cognition under
to note
condition, it is interesting
between classes
that only the distance
increase uniformly but still retaining
the minimum distance to the class thati*
correctly classified under no additive
white noise condition.
In the case of digit recognition,
the recognition score looks to be better
ACKNOWLEDGEMENT
The authors wish to thank Prof.B.S.
Sarrna and Mr.
Ramakrishna, Dr. V.V.S.
T.V. Ananthapadinanabba
discussions.
for many useful
for the languages English,FLannada,and
Tamil. It may be due to the reason that
the speakers are from the Tamil
and aJ.so all the speakers are
origin
well conversant with English and Kannala
be due to the reason that
It thealso
formay
languages Kannada and Taaiil,all
the digit words are end with the same
all
/a/
the last
character namely /u/ and hence
character is redundant. As there is a
any digit word utterance
possibilitytheofduration
of 409.6 msec.,
exceeding
for
the loss of information at the endwords
languages Tamil and Kannada digit
capa—
niaynotaffect the recognition
the case in
/i/
/u/
/e/
/0/
/
lu 1W
100
/0/
/e/
.
100
90
5
1
4
7
94
3
•
96
Table—i: Vowel recognition(sustaimed)
Reference and test vectors from different speakers.
But this is not
bility.
other languages.
In the above discussion, it is
733
REFBRENCES
White and B.R.Neely,'SpeechRecog
nition Ecperiinent with Linear Prediction,Bandpass filtering and Dynamic
1. G.M.
/AI Ifi' 'ICI/
iA,/ lao
programming' ,IEEE Trans.Acoust. ,Speech,
and Signal Proc. ,Vol.ASSP—24,No.2,
pp.183—188, April 1976.
JAif
L..:
3
IEEE Trans.Acoust.,Speech and Signal
Vol. ASSP—23, pp. 67_72,Peb.975.
for
.46
3
45
5
49
44.
.1
same speaker reference)
0
12
3'
_O40,
1
41
4
44.
a
6
1
4
6
5
'l
7.
.32
classifi-.
cation and scene ana2yis', John Wiley
and Sofl, New York, 1973.
67
J47
TABIE:3.Confusiorj Matrix(Por Digit(Tamll)
:a
.3
.4
.6.
A.H. Gray Jr. and. j.D. Markel, 'Di5....
tance measures
speech processing',
IEEE Trans.Acoust.Speech,ond. Signal
Proc., Vol. ASSP—24, No.5, pp.380—391,
Oct. 1976.
and Hart,
.
•
Proc.,
Dud.a
93
107,
2
•8..,F,
.__9
3
7
prediction resi&.
principle applied to speech recognitbn
'Pattern
9 14
4,.F
47
1
6
5. P.Itakura, 'Minimum
7.
4
.4
J.S.
Bride?, 'Speech recognition using zero crossing measurements and sequence information', Proc.
Inst.Elec.Eng. Vol. 116, pp.617—623,
1969.
RD.
5,
4.
•_p45'
1.3.40 47
p.2235, 1968.
6.
6
&
Po
1
TABLE:2.ConfusionMatrix(For Vowel including dipthongs,different speaker reference)
.P.Purton, 'Speech recognition using
autocorrelatlon analysis', IEEE Trans.
Audio Electro Acoust., Vol. AU—16,
4. W.Begdel and
oi
Ioi
14:1 10,' lAW
5
S
7
fAil
2. M.R.Sambur,andL.R.Rabiner, 'A statis—
tical decision approach to the recognition of connected digits',EB Trans.
Acoust. ,Speecb,and Signal Proc. ,Vol.
ASSP, p.24, No.6, Dec. 1976.
3.
/00
'(1/
s
'
.33.,
2,
.3
2
12
:47
43
#21 5
•3U.
44
:1
TABL:4.Confusion Matriz(Por Digit(Tamil)
different speaker reference)
8. J. Llakhoul, 'Linear prediction: A
tutorial review', Proc.IEEE, Vol.63,
No.4, pp. 561580, April 1975.
04 52 3.41
42
a9
.f2:2
9. D.R. Reddy, 'Speech recognition by
Machine', Proc. IEEE, Vol. 64,p.501,
April 1976.
4.
7
8
48
2
44
•
44
.3
3
5617
a,
4
2
44,
2
1'.
TABLE:5.CofusionMatrix(For
same speaker
L0
1
.2.
3
4
3
7
FIGURE:1.Nojsein dB vs normalized Auto—
correlation Coefficients R(2) to R(5)
342
3a3
2
reference)
I
1
1
.3
40 7
3.40
,,i •47
Digit(Hiridi)
'S
9
43.
*7
6
t9
45
1. •7
2
'3512
140
'34: V
1
5
4.2
54
72
44
TABLE: 6.Confusion
different speaker Natrix(porDigjt(Hjflaj)
reference)
734
Download