1EREST HEIGHBOIJR DECISION RULE FOR VOWELAN]) DIGIT RECOGNITION T.K. Raja and B. Yegnanarayana Engineering Department of Electrical Communication Indian Institute of Science Bangalore—560012 ,India £BSTRACT this paper, we shall discuss the perfor mances of vowel and digit recognition Minimum distance to mean is usually schemes based on the autocorrelation as used as a classificationrule in speech the feature and the nearest neighbour dec.t. and speaker recognition studies. In this sion rule (}1NDR)[7Jfor classification. For more information on speech recognition paper it is shown that the nearest neigh.. bou-r decision rule gives significant impr— D.R. Reddyts paper systems,refer ovement in classification score for vowel and digit recognition schemes. Autocorrel— THE FEATURE AND THE CLASSIFICATION ation coefficientsof lags two to five RulE form the feasampling instants are used ture vector. Pour samples per class have It is well known the autocorrelatlon been used. Minimum squared Euclideari disand the power spectrum of the signal are vector from the nearest form a Fourier transform pair since the tance of the reference chosen as the classification speech information characterised by the For sustained vowels the recognition spectral envelope, appears logical that score is cent percent. Por the same fea,the autocorrelation coefficients could. as ture the minimum distance to mean gives well be used as feature. The LPC's are derived from autocorrelation coefficients 70 7 recognition score. When the reference by solving the autocorrelationnormal samples of a given speaker is tested over the vowels spoken by different speaker(up equations.[8J Hence, instead of using LPC's and other features derived from LP to 10), this scheme gives the recognition score of about 95 7. . For digits without analysis, one can as well use directly the autocorrelationcoefficients as a feature. any time warping the recognition score of In this study, we have used the normalized about 86 7.to 92 7 obtained. autocorrelation coefficients of lags two to five sampling instants. The normalizaINTRODUCTION tion necessary normalizethe gain I. variations in the speech signal. And also The success of speech recognition by feature is found be less susceptimachine depends on the selection of an ble additive white noise and inter and intera speaker variations. The figure 1 appropriate feature and the classification shows the autocorrelation feature the aJ.gorit1uu. Ideally, the feature selected should be widely separated the feature vowels ,Le/ approaches true value when affected by the the signal to noise ratio(SNR) space, while be greater environmental conditions, inter and intera than 12 dB and this character remains the same all the vowels. speaker variations. Several studies on speech recognition, in particular, vowel and digit recognition schemes, mainly use Any general classification algorithm to determine the weight vectthe spectral information. This spectral required ors of the discriminant functions informationmay be formants or the linear The efficient method of estimating the prediction coefficients Autocorrelation ooefficients[3Jof sigweight vectors is based on the availabilinals derived from filter banks and zero ty of large number of training sanpies order obtain the convergence of weight crossing rate (ZCR)[4j have also been vectors. The procedure involved in estiused. Most of the speech recognition schemes use the minimum Euclideen distance to mating the weight vectors are computatio— means as the criterion for classification. nally expensive. The Mahalanobis distance criterion also equally complex as it This criterion is although computational]y is required to estimate the sample mean simple, it assumes equal variance and uni— the featu-. modal Gaussian distribution and variance which also needs large number Several other classifiotion rules of training samples. On the other hand, the minimum Euclidean distance to mean based on Itakura measure[&7,log spectral criterion is simple but it requires the distance[6Jhave also been reported. In [9J. II. to rule. is test is it Is is to o this to least its in for is for is [7J. (LPO's),7. in to is res. for CH1285—6/78/0000—0731$oo.75@19781EEE 731 samples to be unimodal, equal variance Gaussian &istribution. In all these cases the features belong to different classes are to be well separated in feature space. The NNDR can be thought ol' as a compromise between these two extremes. This rule only assumes the samples form a well separated clusters in the feature space. number of classes are seven. dipthongs are articulated in sequence first by /A/ and then by /1/ or /U/ as the be case may and hence it is necessary to consider the initial portions of the utterances. Only four samples per class for all the vowels per speaker is computed and is stored in the paper tape. For digit recognition, the autocorrelation coeff1— dents (R(2) to R(5)) are computed in each In addition, the NNDR does not require the tedious computational procedures of estimating any of the parameters already for frame of duration 25.6 msec. successive 16 frames. Thus the dimensionality of the feature space is 64. In this case only 3 samples per class for all the 10 classes per speaker is stored in the paper In all, these cases the do power is tape. removed from the over all power spectrum. The sequence of feature extraction procedure is shown in the block diagram of Figure 2. discussed. The basic principle is that its assigns class to the test sample(or test to which its nearest of NNDR set) in the design neighbour design set) sainples(or belongs. The mathematicadescription this techniques is as follows: x is the = II of 2 — (1) x. ifE—:z1 j_4DrT}] di where d12, squared Euclidean s— -tance between the reference sample and the test sample. X1. is the ith sample reference belong to C.th class 3Both and Xt is the test sample. X1 and Xt are of same dimension. If, I_JL!JJ___UZDFIJ Pig.2: 2 = II X1 (dj)min The lIFT and IDPT are used to remove the DC power which varies from utterance to then utterance. The testing of this recognition scheme was spread over a period of nine EJERThNT III. months. Experiments have been conducted to evaluate the autocorrelation feature With flDR as a classificationalgorithm using the lIP Fourier analyser system 545lB. The speech data is entered into the system through it built in A/D conversion unit. The Sohur microphone used as transducer and is kept at a distance not more than 3 inch rem the mouth. Both design and testing is done.in the computer room environment. The back ground noise is about 60 tB. Throughout study sampling rate of 10 kHz and a this rri—ii Block diagram of feature extraction cycle (2) — Since the f IV. A. RESULTS Sustained Vowel Recognition: Each speakerts reference data on the paper tape is transferred to the system's memory one time. 100 utterance of each vowel from the speakers are tested. The test sequence procedure is shown in the block diagram of Figure 3. The at a all is X a Hanning window is used. The number of speakers participated in this experiment are 10 (8 male + 2 female). For sustned vowel (/a/,/i/, /u/,/e/, and,/o/)recogni_ (*AfP&7Z d FIND _____(S.$ZSlPv __________ XttCJ L EIFEREIYCE XLJ 3: Block diagram of recognition Fig. cycle tion the autocorrelationcoefficients (R(23 to R(5)) are computed from the stable portion of the signal of duration 25.6 m.sc. Only four samples per class for all the vowels per speaker is conipu.. ted and is stored in the paper tape. The 4. dimensionalityof the feature space When dipthongs /AI/, and /AU/ (in Indian languages they are called vowels) are included in already discussed vowel re— cognition scheme, the contour of auto— correlation coefficients (R(2) to R(5)) for four successive frames each of duration 25.6 msec. are used. The dimension— ality of the feature spade is 16 and the i results of the experiment shown in Table—i, and is cent percent if the refe— rence and the test vectors are from the same speaker, otherwise; it is 96.2 7. on an average taken over speakers. The recognition score remains unaltered the SNR is greater than 15 dB. The noise derived from the random noise source is added to the signal. if is B. Vowel Recognition including Dipthong The experimental procedure described in pars, IVA is repeated and the results are shown in Table—2 and is 100 7. the if 732 reference and test vectors are from the sane speaker, otherwise; it is 89.9 7. on an average taken over the speakers. Digit Recognition: C. • The experimentalprocedures descrirepeated but the nun— ber of utterance per digit per speaker is 50. The results of the experiments are shown in Table 3 to 6. The recognition score for languages Hindi, Telugu, nglisb, Kannada, and Tamil respectively are 86.8 7. , 87.6 , 88.1 7. , 90.7 the reference and and 92.2 /. vectors are from the same speaker,other— wise; it is 80.6 , 81.1 •/. , 81.9 . 82.3 7. , 83.6 7. on an average taken over the speakers. The Table 3 to 4 is for the Tamil language which shows the highest recognition rate and the Table 5 to 6 is for the Hindi language which shows the lowest recognition rate. bed in para IVA is if d . explained by refering to Pig. 4. Let and be A1 different cluster two •/. Fjgure4. Cluster area for class Cl in two distribution C2 dimensional feature to (centre of misciassifies the sanles falling in the shaded region. But the NNDR always classifies correctly even if any of the saniiLee minimum distance mean the cluster)criterion always space. The 16. Pour utterances per speaker per class from one female and three maleThe speakers were taken for reference. number of test utterances per class per speaker i5 10. The recognition rate has is falls in the shaded region since it considers only the minimum distance to individual samples in the clusters. VI. gone upto 98.1 /. on an average taken CONCLUSION This study shows the importance of NIDR together with Autocorrelation feature in speech recognition. If the reference samples that falls only on the periphery of the clusters but well separated on the periphery are selected, the recognition performance can be improved and the memory size to store the reference can also considerablybe reduced. speakers. DISCUSSION ON RESIJMS In the case of sustained vowel 3c areas belong to Class C and C if the variance are unequal an& unimoal Gaussian for this case the reference samples per class V. the A2 test The same exoeriment is repeated by using Portran IV programming and IBM 360 computer language Tamil only. In over mentioned that the NMDR works better than the minimum distance to mean criterion. The reason is that the NNDR needs only well separated clusters in the feature space. The situation where the minimum distance to mean criterion fails even if the cluster are well separated can be re.- additive white noise cognition under to note condition, it is interesting between classes that only the distance increase uniformly but still retaining the minimum distance to the class thati* correctly classified under no additive white noise condition. In the case of digit recognition, the recognition score looks to be better ACKNOWLEDGEMENT The authors wish to thank Prof.B.S. Sarrna and Mr. Ramakrishna, Dr. V.V.S. T.V. Ananthapadinanabba discussions. for many useful for the languages English,FLannada,and Tamil. It may be due to the reason that the speakers are from the Tamil and aJ.so all the speakers are origin well conversant with English and Kannala be due to the reason that It thealso formay languages Kannada and Taaiil,all the digit words are end with the same all /a/ the last character namely /u/ and hence character is redundant. As there is a any digit word utterance possibilitytheofduration of 409.6 msec., exceeding for the loss of information at the endwords languages Tamil and Kannada digit capa— niaynotaffect the recognition the case in /i/ /u/ /e/ /0/ / lu 1W 100 /0/ /e/ . 100 90 5 1 4 7 94 3 • 96 Table—i: Vowel recognition(sustaimed) Reference and test vectors from different speakers. But this is not bility. other languages. In the above discussion, it is 733 REFBRENCES White and B.R.Neely,'SpeechRecog nition Ecperiinent with Linear Prediction,Bandpass filtering and Dynamic 1. G.M. /AI Ifi' 'ICI/ iA,/ lao programming' ,IEEE Trans.Acoust. ,Speech, and Signal Proc. ,Vol.ASSP—24,No.2, pp.183—188, April 1976. JAif L..: 3 IEEE Trans.Acoust.,Speech and Signal Vol. ASSP—23, pp. 67_72,Peb.975. for .46 3 45 5 49 44. .1 same speaker reference) 0 12 3' _O40, 1 41 4 44. a 6 1 4 6 5 'l 7. .32 classifi-. cation and scene ana2yis', John Wiley and Sofl, New York, 1973. 67 J47 TABIE:3.Confusiorj Matrix(Por Digit(Tamll) :a .3 .4 .6. A.H. Gray Jr. and. j.D. Markel, 'Di5.... tance measures speech processing', IEEE Trans.Acoust.Speech,ond. Signal Proc., Vol. ASSP—24, No.5, pp.380—391, Oct. 1976. and Hart, . • Proc., Dud.a 93 107, 2 •8..,F, .__9 3 7 prediction resi&. principle applied to speech recognitbn 'Pattern 9 14 4,.F 47 1 6 5. P.Itakura, 'Minimum 7. 4 .4 J.S. Bride?, 'Speech recognition using zero crossing measurements and sequence information', Proc. Inst.Elec.Eng. Vol. 116, pp.617—623, 1969. RD. 5, 4. •_p45' 1.3.40 47 p.2235, 1968. 6. 6 & Po 1 TABLE:2.ConfusionMatrix(For Vowel including dipthongs,different speaker reference) .P.Purton, 'Speech recognition using autocorrelatlon analysis', IEEE Trans. Audio Electro Acoust., Vol. AU—16, 4. W.Begdel and oi Ioi 14:1 10,' lAW 5 S 7 fAil 2. M.R.Sambur,andL.R.Rabiner, 'A statis— tical decision approach to the recognition of connected digits',EB Trans. Acoust. ,Speecb,and Signal Proc. ,Vol. ASSP, p.24, No.6, Dec. 1976. 3. /00 '(1/ s ' .33., 2, .3 2 12 :47 43 #21 5 •3U. 44 :1 TABL:4.Confusion Matriz(Por Digit(Tamil) different speaker reference) 8. J. Llakhoul, 'Linear prediction: A tutorial review', Proc.IEEE, Vol.63, No.4, pp. 561580, April 1975. 04 52 3.41 42 a9 .f2:2 9. D.R. Reddy, 'Speech recognition by Machine', Proc. IEEE, Vol. 64,p.501, April 1976. 4. 7 8 48 2 44 • 44 .3 3 5617 a, 4 2 44, 2 1'. TABLE:5.CofusionMatrix(For same speaker L0 1 .2. 3 4 3 7 FIGURE:1.Nojsein dB vs normalized Auto— correlation Coefficients R(2) to R(5) 342 3a3 2 reference) I 1 1 .3 40 7 3.40 ,,i •47 Digit(Hiridi) 'S 9 43. *7 6 t9 45 1. •7 2 '3512 140 '34: V 1 5 4.2 54 72 44 TABLE: 6.Confusion different speaker Natrix(porDigjt(Hjflaj) reference) 734