2E2.8

advertisement
2E2.8
International Conference on
Information, Communicationsand Signal Processing
ICICS '97
Singapore, 9-12 September 1997
HMM BASED FAST KEYWORD SPOTTING ALGORITHM
WITH NO GARBAGE MODELIIS
S. Sunil, Supriyo Pa!it and T.V. Sreenivas
CEDT, Indian Institute of Science, Bangalore 560012, India.
E-mail : ssunil@cedt.iisc.emet.in
\
discriminating keyword and non-keyword
speech which is important in wordspotting applications is
addressed here. We have shown that garbage models cannot
reduce both rejection and false alarm rates simultaneously.
TO achieve this we have proposed a new Scoring and search
method for HMM based wordspotting without garbage
models. This is a simple fonvard search method which
incorporates the duration modelling of keyword for efficient
discrimination of keyword and non-keyword speech. This
method is computationally fast, which makes it suitable for
real-time implementation. The results are reported on a
speaker independent database containing 10 keywords
embedded in 150 camer sentences.
The problem
Of
1 Introduction
Keyword spotting
(KWS) using
Markov
models (HMM) are based on continuous speech recognition
(CSR) algorithms [ 1,2] where HMMs are used to model both
keyword speech and non-keyword speech (comprising of
background speech and non-speech sounds). The performance
of these methods rely heavily on the performance of nonkeyword speech models also known as "garbage" models. It
has been shown [1,2] that by explicitly training garbage
HMMs to model the entire background environment, the
performance of tlie keyword spotting system can be improved.
In most cases, garbage and keyword HMMs are connected as
a network of H M M s and frame synchronous network (FSN)
search method [3] is used to decode the test utterance into a
sequence Of garbage and key\.vord
This approach is
computationally coinplex and would be difficult for real-time
imDlementation, The main difference between KWS
algbrithms is the way in which garbage models are defined
and scoring methods are formulated.
,
-
I
\
I
-
I
Thus, the following relation becomes apparent from the above
inequalities
O'In
l 'o("
P).
I ) >
P(08
IR
P )
> p(oP
IR
- (4)
1 )
The first inequality detennines the number of rejections and
the second determines the number of false alarms. The h4L
training of R I and a g [1,2] does not always assure the
above inequalities. Even the on-line garbage modelling
technique [4] could achieve low rejection rates at the expense
of increased false-alms. Inequality * (4) also shows that
garbage models cannot reduce rejection and false-alarm rates
simultaneously, because, tf we try to reduce P(O'lR
to
achieve a lower rejection rate (inequality l), this would also
reduce P 0 8 A
(relaljon 3) which would increase the
( I
4
f&e-alarms (inequality
and vice-versa. Thus both the
rejection and false-alarm rates can be improved O d Y bY
increasing the separation between
P(0'IR
and
,)
( I
P O BR
which shows that garbage models are not crucial
1 )
for improving the performance of keyword spotters. Based on
these arguments we propose a new technique of keyword
spotting without garbage models.
3 No garbage model technique
Let
K
. 1s;I
M be
set o f M distinct kewords and let
1, be the corresponding ]keyvord models. We denote the
I'
observation sequence corresponding to the UtteEinCe as
o(=0, 0,).+.*OT
1, T beiI& the length of the utterance. Let
S, (tl ,t , ) be the score for a keyword i , 1 2 i I M , to occur
2 Drawbacks of garbage modelling
in the observation interval from tl to tzr which is formulated
The main aim of reducing the rejection rate and false alarm as
rate simultaneously cannot be achieved with garbage models.
s,( t l . t 2 1=
For a keyword to be recognised when it occurs in an
..y I 2
o,,,*. - , 011
IR , + (08Pd ( t 2 - f ,ID
utterance, we require
fog
-
.
P(0'11 ,)> P( O'IL
wllere
,=
8)
- (1)
1 I;I A4
of kepvord
-
,]
)
9
t, - t ,
P, = Viterbi likeliliood Score,
probability for tlie keyword model D I .
i ,01= utterance of Where
Pd
=
45)
Duration
keyword - i p A g = garbage HMM and M = number of
.-.,O, ,
keywords in the vocabulary. To prevent false alarms, we need For the observation sequenlce O,,
[ i ' , t ; , r ; ] = argrnax S,(rl,r2)
(6)
P(0qL >P ( 0 P I R
1 l iI
M
- (2)
I,
wllere, 0 s = garbage utterance. Since
non-keyword is deteAned for 1 I i S M , 1I t , S T , 1 I t.2 I T such that
speech set IS Very large compared to the keyword S t , the d,,* 5 (I,
- t l ) 5 d,- where d,"" and d:" are the limits
"Ore
for keyword and garbage utterances for garbage
i , estimated aprior;. The
of the duration of key,7c)rd
will be comparable, i.e.,
j
-
,)
1.1,
-
0-7803-3676-3/97/$10.000 1997 IEEE
1020
keyword
i wliich maximizes Si (f,,t, ) is the recognised 5 . Calculate
keyword and t,* and
tl are the starting and ending locations.
3.1 Time Normalization
The
1%
Viterbi
probability,
Of,
*'*of,
and
s,(ti ,I2
' 1
from the observation sequence
where
f l - d i h n - l I f 2 <min[T,t, +dimax-11
6. If s i ( t , , t 2 )is maximum over all the keyword scores
considered so far, store keyword index i
logP,(qaf,,---q.f2,Of,,--.,Of,IA
i)has a large value for short
and the time
instants tl and b.
7. tl = t l 1*go to step
f , M being the number
8. i i + 1, go to Step 3 for 2 5 i 5 h
Ofkeywords.
After termination, the keyvvord which gave the maximum
its starting ( r ; and ending
, 1 5'i 5
score, si.( f l - ,
locations ( t i ) are obtained. Threshold score ( S: ) 1 I i I
M,
for each keyword is estimated apriori using Baye's Minimum
Error Criterion. If Si. (t,*,t;) 2 (Sr 1, then the recognized
keyword is i ' , else no keyword is present.
The above algorithm can be visualized as searching
3.2 Duration Modelling
T~ reduce tl,e bias of the log Viterbi probability for short for a keyword within a window of observation sequence for
observation lengths a word duration penalty is introduced. each keyword model. The window is slided across the entire
The mean mi and variance 0; , 1 I i I
&f of the duration observation sequence. This is shown in Figure 3. The size of
the window is determined by diminand d y , 1 I i 5 M .The
of each keyword K j ,1 I i I
1c.I are obtained from the training
data. We fit a probability distribution ,Di. using the mean search limits dim" and d,"" , 1 5 i 5 hf are set at
- *.jOi and d k =~
mi
+3 0 ~ .
and variance as parameters. Gamma distribution is used as d 6 n
the probability density function, which has a range from
By restricting the search within diminand d,"" we
d = 0 to 00, d being the duration random variable. The eliminate all garbage sequences outside this range from
Gamma density falls off quite Sl1aarplY to 0 as d decreases to causing false alarms. For sequences within the allowed limit,
0, whereas it goes asymptotically to 0 as d tends to 00. The the contrast between keyword and garbage is achieved by the
word duration penalty, log pd
- t, i ) in equation (j) Viterbi probability and the upriori duration weightage
incorporated in the score. Thus, this method improves the
removes the bias of the log Viterbi probability for short
and P ( 0 12 , .
length sequences without affecting the higher length separation between P( 0'II
sequences significantly. This can be seen from Figure 2, 4 Experiment
which plots the time normalized log Viterbi probability (a) Experiments are conducted on a ten keyword database (
and the score S, (t,,t , ) (b) for the same values of t,(= 43) Indian city names, with at most one keyword per sentence ),
and t2 and for the same keyword model and for the same with each keyword modelled with a 5 state LR discrete HMM.
keyword utterance. From the figure 2(a), for the log Viterbi Feature vector includes 10 weighted cepstral coefficients,
probability with time normalization there is a local maxima energy, ratio of residual energy and zeroeth autocorrelation
for tZ = 73 (which is the correct ending point ) but, tlus coefficient, zero crossing rate and delta energy. Vector
maxima is overshadowed by a maxima for a much short quantization is done using a 256 size codebook designed
duration, for t2 = 56 (duration being 14 observation symbols). using LBG algorithm. The recognition results and duration
It can be seen from the figure 2(b) that the score, S, ( t i ,t 2 ) statistics are shown in Table - 1 and Table - 2 respectively. It
suppressed the spurious short duration peak and the maxima is found from the experiments that the insertions usually
at t2 = 73 is kept unaltered. Thus the word duration penalty, have shorter durations than the espected keyword durations.
log pd(t2 - t , i ) promotes more probable durations and It can be seen from Table - 3 that by restricting the lower
search limit, d,"" , keeping the upper limit d,"laX and
penalizes the less probable ones.
thresholds fixed, the false alarms have reduced significantly.
observation lengths and decreases monotonically for large
observation lengths as shown in Figure 1. The log Viterbi
being negative, this bias can be reduced by
dividing it by the length of the observation sequence,
althougl, time
(t2 - ). But experiments showed
normalization works well for large observation lengths (i.e.,
remains more or less unbiased), it is unable to unbias
properly the log Viterbi Probability for
This is because, for very short length sequences the log
Viterbi probability is so high that even duration
normalization is ineffective.
+
~
(I2
ID
,)
)
ID
3.3 Algorithm
The complete algorithm is described beiow,
Figure 4 gives the system performance for two values of d,M n
1. Get the observation sequence O,, ...OT for a particular
as a f i n d o n of threshold.
speech utterance.
2. Start with i = I,i being the keyword index
3. ti = 1
4 . I f t , +d',,,,,,-l>T,
gotostep8.
5 Conclusion
The major contribution of this paper has been to show that
good keyword spotting performance can be obtained from an
HMM system using no-garbage model. From the results we
can see that the no-garbage model approach shows reasonably
good recognition accuracy at a low false alarm rate. These
1021
[2] R. C. Rose .(1995). Keyword detection in conversational
speech utterances using hidden Markov model based
continuous speech recognition. Conip. Speech and Lang. 9,
309-333.
[3] Lee et. al., (1989). A frame synchronous network search
algorithm for connected sword recognition. IEEE Trans. on
ASSP 37, 1649-1658
P(O'lR
and P(OpI/I ,).
[4]Bourlard (1994). Optimizing recognition and rejection
performance
in word spotling ?stems. Proc. of I a S S P , Vol.
6 References
1,
373-376.
[ 1J Wilson et. al., (1990). Automatic recognition of keywords
in unconstrained speech using hidden Markov models. IEEE
Trans. on ASSP 38, 1870-1878.
results can be compared with that of Rose [2], who has used
very elaborate modelling of non-keyword speech, TIUS
technique is also computationally fast
which makes it
possible to do word-spotting in real time. The performance
can be further improved by using discriminate techniques of
keyword training for increasing the separation between
,)
.120
'
'
55
60
65
70
75
80
85
00
O!,
tx
Figure 1. Variation of log Viterbi probability with time
(a)
(b)
Time normalized log Vitebi prob. and Duration penalty
vs. Time
Figure 2. Scoring scheme
Time normalized log Viterbi probability vs. Time
1022
-
I
Search Window-
I
+
Figure 3. Search method.
75
-70
*---- _ - _ - - -
- 4
m h - 0 5%.
sg S 5 --
t
0'
U
0
*3 55 360
*so-
/':* /:
-ms.-su..
- ---/+'
//
rl
I
Table 1. Results of No garbage model keyword spotting technique.
Data Set
No. of sentences
Detection Rate ('YO)
Train
Test
250
150
96.80
69.33
Data Set
No. of
Train
Test
sentences
250
150
Mean error
in start pt.
2
2
I
I
I
I
69.33
Error variance
in end pt.
11.33
10.88
False Alarm Rate
Fa/Kw/Hr
4.20
65.33
-0.250.
nt, -0.50,
-0.750,
Error variance
in start pt.
5.52
7.19
Recognition rate (%)
d,m n
ni.
Mean error
in end pt.
2
2
False Alarm Rate
( FAkwIhr.)
2.66
4.80
I
I
4.80
I?l,
- (7,
68.00
68.67
in,
- 20,
68.00
9.00
m,-30,
68.00
9.00
m, -40,
68.00
9.00
m,- 5 c ,
68.00
9.00
in,
~~
1023
4.80
5.40
~
I
I
Download