2E2.8 International Conference on Information, Communicationsand Signal Processing ICICS '97 Singapore, 9-12 September 1997 HMM BASED FAST KEYWORD SPOTTING ALGORITHM WITH NO GARBAGE MODELIIS S. Sunil, Supriyo Pa!it and T.V. Sreenivas CEDT, Indian Institute of Science, Bangalore 560012, India. E-mail : ssunil@cedt.iisc.emet.in \ discriminating keyword and non-keyword speech which is important in wordspotting applications is addressed here. We have shown that garbage models cannot reduce both rejection and false alarm rates simultaneously. TO achieve this we have proposed a new Scoring and search method for HMM based wordspotting without garbage models. This is a simple fonvard search method which incorporates the duration modelling of keyword for efficient discrimination of keyword and non-keyword speech. This method is computationally fast, which makes it suitable for real-time implementation. The results are reported on a speaker independent database containing 10 keywords embedded in 150 camer sentences. The problem Of 1 Introduction Keyword spotting (KWS) using Markov models (HMM) are based on continuous speech recognition (CSR) algorithms [ 1,2] where HMMs are used to model both keyword speech and non-keyword speech (comprising of background speech and non-speech sounds). The performance of these methods rely heavily on the performance of nonkeyword speech models also known as "garbage" models. It has been shown [1,2] that by explicitly training garbage HMMs to model the entire background environment, the performance of tlie keyword spotting system can be improved. In most cases, garbage and keyword HMMs are connected as a network of H M M s and frame synchronous network (FSN) search method [3] is used to decode the test utterance into a sequence Of garbage and key\.vord This approach is computationally coinplex and would be difficult for real-time imDlementation, The main difference between KWS algbrithms is the way in which garbage models are defined and scoring methods are formulated. , - I \ I - I Thus, the following relation becomes apparent from the above inequalities O'In l 'o(" P). I ) > P(08 IR P ) > p(oP IR - (4) 1 ) The first inequality detennines the number of rejections and the second determines the number of false alarms. The h4L training of R I and a g [1,2] does not always assure the above inequalities. Even the on-line garbage modelling technique [4] could achieve low rejection rates at the expense of increased false-alms. Inequality * (4) also shows that garbage models cannot reduce rejection and false-alarm rates simultaneously, because, tf we try to reduce P(O'lR to achieve a lower rejection rate (inequality l), this would also reduce P 0 8 A (relaljon 3) which would increase the ( I 4 f&e-alarms (inequality and vice-versa. Thus both the rejection and false-alarm rates can be improved O d Y bY increasing the separation between P(0'IR and ,) ( I P O BR which shows that garbage models are not crucial 1 ) for improving the performance of keyword spotters. Based on these arguments we propose a new technique of keyword spotting without garbage models. 3 No garbage model technique Let K . 1s;I M be set o f M distinct kewords and let 1, be the corresponding ]keyvord models. We denote the I' observation sequence corresponding to the UtteEinCe as o(=0, 0,).+.*OT 1, T beiI& the length of the utterance. Let S, (tl ,t , ) be the score for a keyword i , 1 2 i I M , to occur 2 Drawbacks of garbage modelling in the observation interval from tl to tzr which is formulated The main aim of reducing the rejection rate and false alarm as rate simultaneously cannot be achieved with garbage models. s,( t l . t 2 1= For a keyword to be recognised when it occurs in an ..y I 2 o,,,*. - , 011 IR , + (08Pd ( t 2 - f ,ID utterance, we require fog - . P(0'11 ,)> P( O'IL wllere ,= 8) - (1) 1 I;I A4 of kepvord - ,] ) 9 t, - t , P, = Viterbi likeliliood Score, probability for tlie keyword model D I . i ,01= utterance of Where Pd = 45) Duration keyword - i p A g = garbage HMM and M = number of .-.,O, , keywords in the vocabulary. To prevent false alarms, we need For the observation sequenlce O,, [ i ' , t ; , r ; ] = argrnax S,(rl,r2) (6) P(0qL >P ( 0 P I R 1 l iI M - (2) I, wllere, 0 s = garbage utterance. Since non-keyword is deteAned for 1 I i S M , 1I t , S T , 1 I t.2 I T such that speech set IS Very large compared to the keyword S t , the d,,* 5 (I, - t l ) 5 d,- where d,"" and d:" are the limits "Ore for keyword and garbage utterances for garbage i , estimated aprior;. The of the duration of key,7c)rd will be comparable, i.e., j - ,) 1.1, - 0-7803-3676-3/97/$10.000 1997 IEEE 1020 keyword i wliich maximizes Si (f,,t, ) is the recognised 5 . Calculate keyword and t,* and tl are the starting and ending locations. 3.1 Time Normalization The 1% Viterbi probability, Of, *'*of, and s,(ti ,I2 ' 1 from the observation sequence where f l - d i h n - l I f 2 <min[T,t, +dimax-11 6. If s i ( t , , t 2 )is maximum over all the keyword scores considered so far, store keyword index i logP,(qaf,,---q.f2,Of,,--.,Of,IA i)has a large value for short and the time instants tl and b. 7. tl = t l 1*go to step f , M being the number 8. i i + 1, go to Step 3 for 2 5 i 5 h Ofkeywords. After termination, the keyvvord which gave the maximum its starting ( r ; and ending , 1 5'i 5 score, si.( f l - , locations ( t i ) are obtained. Threshold score ( S: ) 1 I i I M, for each keyword is estimated apriori using Baye's Minimum Error Criterion. If Si. (t,*,t;) 2 (Sr 1, then the recognized keyword is i ' , else no keyword is present. The above algorithm can be visualized as searching 3.2 Duration Modelling T~ reduce tl,e bias of the log Viterbi probability for short for a keyword within a window of observation sequence for observation lengths a word duration penalty is introduced. each keyword model. The window is slided across the entire The mean mi and variance 0; , 1 I i I &f of the duration observation sequence. This is shown in Figure 3. The size of the window is determined by diminand d y , 1 I i 5 M .The of each keyword K j ,1 I i I 1c.I are obtained from the training data. We fit a probability distribution ,Di. using the mean search limits dim" and d,"" , 1 5 i 5 hf are set at - *.jOi and d k =~ mi +3 0 ~ . and variance as parameters. Gamma distribution is used as d 6 n the probability density function, which has a range from By restricting the search within diminand d,"" we d = 0 to 00, d being the duration random variable. The eliminate all garbage sequences outside this range from Gamma density falls off quite Sl1aarplY to 0 as d decreases to causing false alarms. For sequences within the allowed limit, 0, whereas it goes asymptotically to 0 as d tends to 00. The the contrast between keyword and garbage is achieved by the word duration penalty, log pd - t, i ) in equation (j) Viterbi probability and the upriori duration weightage incorporated in the score. Thus, this method improves the removes the bias of the log Viterbi probability for short and P ( 0 12 , . length sequences without affecting the higher length separation between P( 0'II sequences significantly. This can be seen from Figure 2, 4 Experiment which plots the time normalized log Viterbi probability (a) Experiments are conducted on a ten keyword database ( and the score S, (t,,t , ) (b) for the same values of t,(= 43) Indian city names, with at most one keyword per sentence ), and t2 and for the same keyword model and for the same with each keyword modelled with a 5 state LR discrete HMM. keyword utterance. From the figure 2(a), for the log Viterbi Feature vector includes 10 weighted cepstral coefficients, probability with time normalization there is a local maxima energy, ratio of residual energy and zeroeth autocorrelation for tZ = 73 (which is the correct ending point ) but, tlus coefficient, zero crossing rate and delta energy. Vector maxima is overshadowed by a maxima for a much short quantization is done using a 256 size codebook designed duration, for t2 = 56 (duration being 14 observation symbols). using LBG algorithm. The recognition results and duration It can be seen from the figure 2(b) that the score, S, ( t i ,t 2 ) statistics are shown in Table - 1 and Table - 2 respectively. It suppressed the spurious short duration peak and the maxima is found from the experiments that the insertions usually at t2 = 73 is kept unaltered. Thus the word duration penalty, have shorter durations than the espected keyword durations. log pd(t2 - t , i ) promotes more probable durations and It can be seen from Table - 3 that by restricting the lower search limit, d,"" , keeping the upper limit d,"laX and penalizes the less probable ones. thresholds fixed, the false alarms have reduced significantly. observation lengths and decreases monotonically for large observation lengths as shown in Figure 1. The log Viterbi being negative, this bias can be reduced by dividing it by the length of the observation sequence, althougl, time (t2 - ). But experiments showed normalization works well for large observation lengths (i.e., remains more or less unbiased), it is unable to unbias properly the log Viterbi Probability for This is because, for very short length sequences the log Viterbi probability is so high that even duration normalization is ineffective. + ~ (I2 ID ,) ) ID 3.3 Algorithm The complete algorithm is described beiow, Figure 4 gives the system performance for two values of d,M n 1. Get the observation sequence O,, ...OT for a particular as a f i n d o n of threshold. speech utterance. 2. Start with i = I,i being the keyword index 3. ti = 1 4 . I f t , +d',,,,,,-l>T, gotostep8. 5 Conclusion The major contribution of this paper has been to show that good keyword spotting performance can be obtained from an HMM system using no-garbage model. From the results we can see that the no-garbage model approach shows reasonably good recognition accuracy at a low false alarm rate. These 1021 [2] R. C. Rose .(1995). Keyword detection in conversational speech utterances using hidden Markov model based continuous speech recognition. Conip. Speech and Lang. 9, 309-333. [3] Lee et. al., (1989). A frame synchronous network search algorithm for connected sword recognition. IEEE Trans. on ASSP 37, 1649-1658 P(O'lR and P(OpI/I ,). [4]Bourlard (1994). Optimizing recognition and rejection performance in word spotling ?stems. Proc. of I a S S P , Vol. 6 References 1, 373-376. [ 1J Wilson et. al., (1990). Automatic recognition of keywords in unconstrained speech using hidden Markov models. IEEE Trans. on ASSP 38, 1870-1878. results can be compared with that of Rose [2], who has used very elaborate modelling of non-keyword speech, TIUS technique is also computationally fast which makes it possible to do word-spotting in real time. The performance can be further improved by using discriminate techniques of keyword training for increasing the separation between ,) .120 ' ' 55 60 65 70 75 80 85 00 O!, tx Figure 1. Variation of log Viterbi probability with time (a) (b) Time normalized log Vitebi prob. and Duration penalty vs. Time Figure 2. Scoring scheme Time normalized log Viterbi probability vs. Time 1022 - I Search Window- I + Figure 3. Search method. 75 -70 *---- _ - _ - - - - 4 m h - 0 5%. sg S 5 -- t 0' U 0 *3 55 360 *so- /':* /: -ms.-su.. - ---/+' // rl I Table 1. Results of No garbage model keyword spotting technique. Data Set No. of sentences Detection Rate ('YO) Train Test 250 150 96.80 69.33 Data Set No. of Train Test sentences 250 150 Mean error in start pt. 2 2 I I I I 69.33 Error variance in end pt. 11.33 10.88 False Alarm Rate Fa/Kw/Hr 4.20 65.33 -0.250. nt, -0.50, -0.750, Error variance in start pt. 5.52 7.19 Recognition rate (%) d,m n ni. Mean error in end pt. 2 2 False Alarm Rate ( FAkwIhr.) 2.66 4.80 I I 4.80 I?l, - (7, 68.00 68.67 in, - 20, 68.00 9.00 m,-30, 68.00 9.00 m, -40, 68.00 9.00 m,- 5 c , 68.00 9.00 in, ~~ 1023 4.80 5.40 ~ I I