Vocal interface with memory of speech for dependant people

advertisement
VOCAL INTERFACE WITH A SPEECH MEMORY FOR
DEPENDANT PEOPLE
Stéphane RENOUARD, Maurice CHARBIT and Gérard CHOLLET.
Département de Traitement du Signal et des Images,
Ecole Nationale Supérieure des
Télécommunications,
CNRS-LTCI
{renouard, charbit, chollet}@tsi.enst.fr
ABSTRACT
We present here a new approach to automatic speech recognition based on speech memory. With this
approach, statistic acoustic models are continuously adapted and a meaningful confidence measure is
computed. Tests are conducted on the isolated utterance data database POLYVAR recorded over a one year
period. An autonomous vocal interface, a ‘majordome’ of the smart house, which constitues an help for the
everyday life of the dependent people is under processing.
1. INTRODUCTION
Speech is regarded as the most natural, flexible and effective means for human
communication. Progress in the area of automatic speech recognition, comprehension of
oral dialogue and speech synthesis let us hope of being able one day to dialogue naturally
with computers in any environment. However, the field of automatic speech recognition
(ASR) still presents many scientific and technological locks. For exemple, uncontrolled
factors such as distorted or disturbed speech, class of speakers (children, elderly people,
foreign people, disabled people), etc are still blocking factors.
Within the framework of the GET project ‘Smart Home and Independent living for
persons with disabilities and elderly people’, our research team set up a vocal interface (VI)
for physically disabled people. This VI brings a palliative mean of communication using
WIMP (Windows, Icons, Menus and Pointers) environment by offering further autonomy
to the user. Systems for ASR comprise in most instances three distinct phases : training,
recognition and adaptation. The training phase is the longest in time because of the
requirement of large audio databases, specially when building speaker independent models.
Even if in the speaker dependant case the duration of this phase can be reduced, user must
devote a considerable time to it.
We present here an original system based on a memory of speech [1,2]. Every
utterances prononced by the user are recorded to perform post processing applications such
as model adaptation, confidence measurement or speaker verification. For instance, we use
the memory of speech to perform speaker model adaptation and out of vocabulary words
rejection. Section 2 gives an overview of the system and a description of the memory of
speech. Section 3 shows the experiments on speaker model adaptation to reduce the training
phase in ASR using memory of speech. Section 4 presents results of out of vocabulary
words rejection using a confidence measurement calculated from utterences that may be
present in the memory of speech.
2. SYSTEM OVERVIEW
2.1 speech module and interface
The figure belove (Fig.1) shows a block diagram of our speech interface. It was
developped in two parts, the interface and the speech module. This allow us to build the
speech module independently of the the interface. Another application possibility is to use
the interface on a client (PDAs, wearable computers…) and the speech module on a server
to improve portability and performance of the system.
The overall speech module was built using HTK 3.2 [3]. Acoustic analysis is based
on the extraction of 12 Mel-Scale Frequency Cepstrum Coefficients (MFCC) plus Energy
and their first and second order derivative. Frames of the speech signal are analysed every
20 ms using a Hamming windows function and a frame shift of 10 ms. The recognizer uses
Hidden Markov Models (HMM) with 25 phones models and 1 silence model. Each phone
model has 3 states and 16 gaussians per state. The baseline speech recognizer contains
phone HMM models trained on the french read-speech corpus BREF which contains over
100 hours of speech material from 120 speakers (55m/65f) [4]. The text materials were
selected from the French newspaper Le Monde, so as to provide a large vocabulary (over
20,000 words) and a wide range of phonetic environments.
The interface was build using BORLAND DELPHI 6.0. and proposes classical
fonctions of WIMP interfaces.
I
N
T
E
R
F
A
C
E
S M
P O
E D
E U
C L
H E
Microphone
Sending
order
Speech signal
or
Recognition
pattern
Confirmation
By user
Speech
memory
Accoustic
annalysis
Grammar
constraints
Pattern
matching
Post
processing
Recognition
decision
Figure 1. System overview
2.2 memory of speech
The speech memory is a database of voice constituted by the utterances prononced
by users. It could be viewed as a ‘live’ recorded speech database. We use data present in the
memory for two tasks : restimating user statistical model and training a rejection model for
‘out of vocabulary’ words.
On each new login on the interface, a speaker independent model is loaded. A short
training phase that require user cooperation is performed. Utterances recorded are used to
adapt speaker model to improve recognition performance (see section 3). Adapted models
are saved when closing the interface.
Data presents in the memory allow us to compute a confidence measurement (CM)
for each utterance pronunced using method described in section 4. Threshold for CM may
be continously adapted with this technique.
3. MODEL ADAPTATION
In this section we descibe speaker model adaptation using limited training data.
Phone models were trained using POLYPHONE data, while POLYVAR database was used
to simulate the content of the speech memory.
3.1 HMM Adaptation with a small quantity of data
It is commonly agreed that, for a given speech recognition task, a speaker dependant
system (SD) usually outperforms a speaker-independent system (SI), as long as a sufficient
amount of training data is available. When the amount of speaker specific training data is
limited, the gain in using such a model is not guaranted. One way to improve the
performance is to make use of existing knowledge, consisting in a large multispeaker
database, so that a minimum amount of training data is sufficient to modelize the new
speaker [5]. Such a training procedure is often refered to as speaker adaptation when a
priori knowledge is derived from a speaker-independent database, and as speaker
conversion when the knowledge is derived from a different speaker.
Using data contained from the memory of speech, speaker adaptation was
performed with maximum likelihood method using the expectation-maximization (EM)
algorithm [3,6]. For HMM parameter estimation, this algorithm is also called the BaumWelch algorithm. This algorithm exploits the fact that the complete-data likelihood of the
model is simpler to maximize than the likelihood of the adaptation (or incomplete) data, as
in the case that complete data have sufficient statistics.
3.2 Databases and experimental setup
Speaker adapation experiments were carried out on a corpus consisting of two
speech database: Swiss French POLYPHONE and POLYVAR [4]. The first one, which
contains 4,293 information service calls (2,407 female, 1,886 male speakers) collected over
the Swiss telephone network, was used to train a speaker independent model for each
phone. Tests were made using a part of the POLYVAR database, consisting in repetitions
of 17 french isolated words, by 5 locuteurs. The native model from POLYPHONE was
trained using iterative training as described in 3.1. We used tokens 00 to 30 (510 isolated
words) from locuteurs M00 to M04. Tokens from 00 to 03 were used for training and
tokens from 04 to 30 for testing. After each reestimation, a recognition is performed using
Viterbi algorithm on test data.
3.3 results
Table 1 shows the Word Error Rate (WER, i.e number of substitions + insertions /
total number of recognized words) for isolated words recognition with speakers M00 to
M04. The baseline model without retraining (T0) give overall good performances
(WERmean ~ 10.6 %). This baseline WER is compared to WER recognition after 17 words
(T1), 34 words (T2), T3 (51 words) and T4 (68 words) reestimations. For speakers M00 to
M04, WER decreases for T1 to T4 adaptations. Using 4 tokens (T4) for training does not
increase drastically recognition performance. Two tokens (T2) adaptation is sufficient in our
case to obtain very good recognition performances (WERmean ~ 1.32 %).
TABLE 1
WER Recognition Results on 447 isolated words.
Sp
M00
M01
M02
M03
M04
T0
11,6 %
10.5 %
8.8 %
9.6 %
11.2 %
T1
T2
6.8 %
4.0 %
3.8 %
6.2 %
7.0 %
3.1 %
0.5 %
1.7 %
0.7 %
0.6 %
T3
T4
1.2 %
0.2 %
1.2 %
0.5 %
0.5 %
0.4 %
0.2 %
0.2 %
0.4 %
0%
From Table 1, we can show that a small quantity of data used to adapt a speaker
independent model increases the baseline situation. Only two sets of words were necessary
to obtain overall good performances for such an application.
4. REJECTING ‘OUT OF VOCABULARY’ WORDS
In this section, we propose to test a rejection method for out of vocabulary words.
Data of the POLYVAR databases were used to simulate content of speech memory.
4.1 Rejecting ‘out of vocabulary’ words
One of the major problem that most interactive vocal services have to cope with is
the rejection of incorrect data. Indeed, the users of such services are generally not aware of
the constraints of the systems and may be talking in a noisy environnement. Thus,
Automatic speech recognition systems must be able to reject incorrect uterrances such as
out of vocabulary words, speaker’s hesitations or noise tokens. The problem is then to find
an acceptable trade off between the different possible type of errors : substitution errors,
false rejection errors (when a vocabulary word is rejected) and false alarm errors (when an
incorrect uterance is recognized as part of the vocabulary).
The idea is to post-process the hypotheses delivered by the recognizer, by
computing a confidence measure for each recognition hypothesis. Different solutions are
possible, such as the use of a segmental approach based on a phonetic level likelihood ratio
estimated with a set of phonetic or prosodic parameters [5, 9]. We choose here to estimate a
confidence measure based on word level likelihood ratios, which does not require the
extraction of any additional parameter [9].
The confidence measure used in this work is the normalized log-likelihood ratio
(LR) between the first and second best hypotheses in a N-Best decoding approach [3]. The
formalism of this strategy has been developped in [5]. To estimate the confidence of output
word W, for a given utterance X of T frames, we need to compute :
F (W ) 
1
LR X/W 
T
(1)
under hypothesis H0 “Word belongs to the dictionary” and H1 “Word do not belongs to the
dictionary”. P(F(W)|H0,1) is approximated considering :
LR ( X/W )  Log (#1/#2), under H0
LR( X/W )  Log (#2/#3), under H1
(2)
where #1, #2 and #3 are scores for first, second and third best hypothesis from the Viterbi
decoder. We considere that Log (#2/#3) is close to a modelisation of W . In order to make
accept (H0) or reject (H1) decisions, we need to determine some threshold  to compare the
confidence score to. The decision then become :

F(W)




Accept W
Reject W
(3)
4.2 Database and experimental setup
Confidence measurement experiments were carried out using locutor m00 of the
POLYVAR database. Silences were removed from each sound file. The reference model
was trained on repetitions 30 to 220 (Tt). Model trained on Tt gives a WER score of ~
0.1%. Others models were trained using one repetition token (T1), two repetition tokens
(T2) and three repetition tokens (T3). T0 is the native model from POLYPHONE (no
retraining). WER scores are the same as in Table 1. Tests were done using repetions 03 to
30. Results are representated using DET curve [8] and considerated at False rejection
probability equal to False Alarm probability (this point is known as EER, Equal Error
Rate).
4.4 results
Training set used for
model adaptation
T0
T1
T2
T3
Tt
Figure 2. DET curves on confidence measure for ‘out of vocabulary’
words, using training tokens T t and T0 to T3.
Figure 2. Shows that the reference model trained with Tt has EER ~ 10%. Native
model with WER = 11.6% has an EER of ~ 30%. EER results for others models decrease
with the quantity of adaptation data used. In fact, our test did not show enough robustness
with degradation of recognition performances. As WER decreases (depending on the
number of repetitions used for training), EER measure decreases in a drastic way. We need
to improve the robustness of our test.
5. CONCLUSION
Our results indicate that only few words are necessary to adapt a locutor dependant
model from a well trained locutor independent model. Adapted models show overall good
performances with only 3 repetition tokens. The use of a speech memory is quite relevant in
that case. The model depicted for ‘out of vocabulary’ words rejection gives results that may
be improved, in particular the robusteness of the model when speech recognition
performances decrease. In a close futur, we plain to develop a speaker verification task to
complete the interface using data from the speech memory.
At terms, the vocal interface with a speech memory will facilitate interactions with
a ‘majordome’ capable of interfacing with telephone, interphone, or system requiring a
remote control (television, HiFi equipement, etc.). The vocal interface with a speech
memory should be considerated as a central element of the smart house which thus
becomes a real communicating house.
6. REFERENCES
[1] D. Petrovska-Delacretaz et G. Chollet, “Searching Through a speech Memory for efficient Coding, Recognition and
synthesis”, 2002, in Phonetics and its Applications, Edited by Angelika Braun and Herbert R. Masthoff, Franz Steiner
Verlag
[2] T.C. Ervin II, J.H. Kim, “Speaker Independant Speech Recognition Using an Associative Memory Model”,
Southeastcon '93 Proceedings, 1993, p.4-7.
[3] Steve Young et al., “The HTK Book”, rev. dec 2002, copyright 2001-2002 Cambridge University Engineering Dept.
[4] L.F. Lamel, J.L. Gauvain, M. Eskénazi, ``BREF, a Large Vocabulary Spoken Corpus for French,'' EUROSPEECH-91.
[5] C.H. Lee and B.H Juang, “A study on speaker adaptation of the parameters of continuous density Hidden Markov
Models”, 1991, IEEE Transactions on Signal Processing, Vol. 39, n°4, p.806-814.
[6] JL. Gauvain and CH Lee, “Maximum a posteriori Estimation for Multivariate Gaussian Mixture Observations of
Markov Chains”, 1994, IEEE Transactions on speech and Audio Processing, vol.2, n°2, p.291-300
[7] D.Genoud, G. Gravier, F. Bimbot, G. Chollet, ”Combining Methods to Improve Speaker Verification Decision”, 1996,
Proc. ICSLP '96
[8] A.Martin, G Doddington, T.Kamm, M.Ordowski, M. Przybocki, “The DET curves in assessment of detection task
performance”, EuroSpeech 1997 Proceedings, Vol.4, p. 1895-1898
[9] N. Moreau, D. Charlet and D. Jouvet, “Confidence measure and incremental adaptation for the rejection of incorrect
data”, 2000, Acoustics, Speech, and Signal Processing, ICASSP '00. Proceedings, Vol. 3 , p. 1807 -1810
The autors would like to thank the fundation “Louis Leprince-Ringuet” for its financial contribution to the
GET project ‘Smart Home and Independent living for persons with disabilities and elderly people’
Download