Title of the Paper (18pt Times New Roman, Bold)

advertisement
Discrete Speech Recognition Using a Hausdorff-based Metric
TUDOR BARBU
Institute for Computer Science
Romanian Academy
Carol I 22A, Iaşi 6600
ROMANIA
Abstract: - In this work we provide an automatic speaker-independent word-based discrete speech recognition
approach. Our proposed method consist of several processing levels. First, an word-based audio segmentation is
performed, then a feature extraction is applied for the previously obtained segments. The speech feature vectors
are computed using a melcepstral vocal sound analysis. Then they must be compared with the feature vectors of
the vocal utterances from a recorded training set, for a supervised classification. Because of their different
dimensions, we propose a Hausdorff-based nonlinear distance for vector comparing. Also, we provide a variant
of minimum distance feature vector classification, based on the previously created metric.
Key-Words: - Speech recognition, Discrete speech, Vocal sound, Mel-cepstral analysis, Hausdorff metric,
Feature vectors, Supervised classification, Training set
1 Introduction
In this paper we are interested in developing a
system for automatic speech recognition. As we
know, a speech recognition system receives a vocal
sound as an input and provides as an output the text
representing the transcript of the spoken utterance.
There are many known speech recognition
approaches [1]. They can be separated in several
different classes, depending on the linguistic units
used
in
recognition.
Thus,
there
exist
phoneme-based, morpheme-based, word-based or
phrase-based recognition techniques. We propose an
word-based speech recognition approach.
Depending on the types of the vocal utterances that
the system is able to recognize, it can be either a
discrete speech recognition system or a continuous
speech recognition system [4]. Discrete speech
contains slight pauses between the spoken words.
Continuous speech represents the natural speech,
its words not being separated by pauses. The
word-based segmentation of a vocal sound is easier
for the discrete speech case.
For realizing the continuous speech segmentation,
special methods must be utilized to determine the
words boundaries. We choose to perform a discrete
speech recognition approach.
Also, depending on the population of users the
recognition systems can handle, they can be either
speaker-dependent or speaker-independent speech
recognition systems [4]. The systems in the first class
must be trained on a specific user, while those in the
second class are trained on a set of speakers.
We focus on a speaker-independent system. Its
modelling consists of the following processing
stages:
1.
2.
3.
4.
Developing the system training feature set
Audio preprocessing of the input vocal sound
Word-based segmentation of the input utterance
Feature vector extraction for the resulted signal
segments
5. Supervised classification and labeling of the
speech segments
6. Obtaining the final transcript of the speech from
previously determined labels
We present each of these developing stages in the
next sections. Also, we offer a practical example to
illustrate the functioning of the pattern recognition
approach.
The main contributions of this paper are the
developing of a nonlinear metric based on the
Hausdorff distance for sets [5] and the creation of a
supervised classifier which utilizes the proposed
distance.
2 Training Feature Vector Extraction
We propose a supervised vocal pattern recognition
method, therefore it requires first developing a
training set of speech patterns. In this section we
focus on this training set.
An word-based approach requires a very large
training set for an optimal speech recognition. The set
dimension depends on the chosen vocabulary. An
ideal recognition system utilizes the whole
vocabulary of the speech language, which could
contain thousands words. Most speech recognition
systems use low-size vocabularies, having tens
words, or medium-size vocabularies with hundreds
words.
The large vocabulary size represents a
disadvantage of the word-based speech-recognition
techniques. Vocabulary size is equal with the classes
number because each word corresponds to a class in
the recognition process. It is obvious that a
phoneme-based recognition system utilizes a much
smaller number of classes, because the number of
words of a language (tens thousands) is much greater
than the number of the phonemes of that language
(usually 30-40 depending on language).
We consider creating a vocabulary containing
several words only initially and extending it over
time. Let N be our vocabulary size. For each word
from the vocabulary we consider a set of speakers,
each of them recording a spoken utterance of that
word.
Thus, we obtain a set of digital audio signals for
each word. All the recorded sounds represent the
word prototypes of the system. For each i
representing a vocabulary word number, we then
have a set of signal prototypes {S1i ,..., Sni i }, where
i  1, N , ni is the number of users which spoke the
word referred by i and S ij represents the audio signal
of that spoken word recorded by speaker indicated
by j  1, ni . The set of all prototype signals,
{S11,..., Sn11 ,..., S1N ,..., SnNN } , represents the training set
of our system.
Also, we introduce class labels for all these
signals. The label of a signal of a spoken word will be
its transcript, the written word. Thus, for each
i  1, N and j  1, ni , we set a signal label l (S ij )
A short-time signal analysis is performed for each
of these vocal sounds. We choose to divide each
signal in overlapping segments of length 256 samples
with overlaps of 128 samples. Then, each resulted
signal segment is windowed, by multiplying it with a
Hamming window of length 256. We then compute
the spectrum of each windowed sequence, applying
FFT (Fast Fourier Transform) to it, obtaining the
acoustic vectors of the current S ij signal. Mel
spectrum of these vectors is computed by converting
them on the melodic scale that is described as:
mel( f )  2595  log 10 (1  f / 700)
(2)
where f represents the physical frequencies and mel(f)
the mel frequencies. The mel cepstral acoustic
vectors are obtained by applying first the logarithm,
then the DCT (Discrete Cosinus Transform) to the
mel spectral acoustic vectors.
Then we can compute the delta mel cepstral
coefficients (DMFCC), as the first order derivatives
of MFCC, and the delta delta mel frequency cepstral
coefficients (DDMFCC), as the second order
derivatives of MFCC. We prefer to use the delta delta
mel cepstral acoustic vectors for describing speech
content. These acoustic vectors have a dimension of
256. For reducing this size, we truncate each acoustic
vector to the first 12 coefficients, which we consider
to be sufficient for speech featuring. Then we
construct a 12 row matrix by positioning these
truncate delta delta mel cepstral vectors on the
columns. This resulted DDMFCC-based matrix
represents the chosen speech feature vector.
Therefore, for each i  1, N and j  1, ni , a feature
vector V ( S ij ) is computed as a matrix with 12 rows
and a number of columns depending on the S ij
which is a string representing a vocabulary word. It is
obvious that we have:
length. The computed vectors represent the training
feature vectors of the recognition system. The set
{V (S11 ),...,V (Sn11 ),...,V (S1N ),...,V (SnNN )} represents
l (i)  l ( S1i )  ...  l ( Sni i ), i  1, N ,
the training feature set of the speech recognition
system.
(1)
where l (i ) represents the label of the word-related
class referred by i.
Next step is the computing of prototype vectors of
the recognition system. They represent the feature
vectors of the training set. We perform the training
feature extraction by applying a mel cepstral analysis
to S ij sounds, the Mel Frequency Cepstral
Coefficients (MFCC) being the dominant features
used for speech recognition [1,2,3].
3 Input Speech Analysis
In this section we focus on the input vocal sound
analysis. As specified in introduction, we consider
only discrete speech sounds to be recognized by our
system. Also, we put the condition that the vocal
sounds introduced as input contain spoken words
belonging to the given vocabulary only.
Let S be the signal of a vocal sound we want to
recognize by identifying its transcript. First, several
pre-processing actions can be performed on this
audio signal. If an amount of noise is still present in
the sound, some filtering techniques must be applied
to smooth the signal.
Also, in the pre-processing stage an important
audio special effect, used often before speech
recognition, can be added to the signal. The
preemphasing effect is applied as follows:
S[a]  S[a]    S[a  1] ,
(3)
where we set the control parameter   0.5 .
The next stage of speech analysis consist of an
word-based vocal segmentation. We must extract
from S the signal segments corresponding to the
spoken words. This task is performed by detecting
the pauses that separate the words.
A pause segment is characterized by its length and
by its low amplitudes. Thus, we use two threshold
values for the pauses identification purpose. Let them
be T and t. A pause represents a signal sequence
{S[a],..., S[a  t ]} , having the property
S[a],..., S[a  t ]  T . We choose the length related
parameter t = 500 and the amplitude related threshold
T to be small enough.
As a result of the pauses identification, the word
segments extraction process become quite simple to
fulfil. Let s1 ,..., sn be the word related speech signals
extracted from S. Obviously, the speech recognition
of S consist of determining the written words
corresponding to these signals.
For each i  1, n , the recognition system has to
find the word represented by si . An automatic
recognition is performed by comparing these speech
signals with the signals from the training set.
A feature extraction process is applied to each
si sequence. The same mel cepstral analysis is
utilized for feature extraction. For each i  1, n , a
feature vector V ( si ) is computed as the delta delta
mel cepstral matrix, truncated at 12 coefficients per
each column.
Each obtained vector has 12 rows and a number of
columns depending of si length. Having different
dimensions, these feature vectors and the training
feature vectors cannot be compared to each other by
using linear metrics, like Euclidean distance or
Minkowsky metric.
A solution for finding the distance between a
feature vector V ( sk ) and a training vector V ( S ij ) ,
could be their transformation in matrices of the same
size. These feature matrices already have the same
number of rows, only the number of columns can be
different. The matrix with the greater number of
columns (thus corresponding to the longer signal) can
be resampled to get the same number of columns as
the other. The two vectors, having now the same
dimensions, can be compared with an Euclidean
distance for matrices.
This vector resampling solution is not optimal
because resampling process may often produce a
speech information loss on the transformed feature
vector. If there is a great size difference between the
vectors to be compared, the resampling could result
in important classification errors. For this reason we
propose a new type of distance measure in the next
section.
4 A Hausdorff-based Nonlinear Metric
The classification stage of our recognition process
requires a distance measure between different
dimension vectors. Therefore, we introduce a
nonlinear metric which works for matrices and it is
based upon the Hausdorff distance for sets [5].
First, let us present some general theory regarding
Hausdorff metric. If A and B are two different-sized
sets A  B , the Hausdorff metric is defined as the


maximum distance of a set to the nearest point in the
other set. Therefore, we have the following relation:
h( A, B)  max {min {dist (a, b)}}
aA
(4)
bB
where h represents the Hausdorff distance between
sets and dist is any metric between points (for
example the Euclidean distance).
In our case we must compare two matrices,
different sized but with one common dimension,
instead of two sets of points. So let us consider
A  (a ij ) nm and B  (bij ) n p the two matrices
having the same first dimension. We use notation n
for the rows number, although we already used it in
the last section. Let us assume that m  p .
We introduce two more vectors, y  ( yi ) p1 and
z  ( zi ) m1 , then compute
z
m
y
p
 max yi and
0i p
 max z i . With these notations we create a
0i  m
new nonlinear metric d having the following
form:

d ( A, B )  max  sup inf By  Az n , sup inf By  Az
z m 1 y p 1
 y p 1 z m 1
n

 (5)

This restriction-based distance represents the
Hausdorff
distance
between
the
sets

B y: y
p


 1 and A z : z
m

 1 in the metric
space R n . Therefore, we have the following
relation:
d ( A, B)  h( B( y : y
p
 1), A( z : z
m
 1)) (6)
As resulting from (6), the metric d depends ony
and z. Trying to eliminate these terms, we have
found a new form for d. It do not depend on these
vectors and it is not a Hausdorff distance anymore.
The new obtained Hausdorff-based metric is given
as following:


d ( A, B)  max  sup inf sup bik  aij , sup inf sup bik  aij  (7)
1 j  m 1 k  p 1 i  n
1 k  p 1 j  m 1 i  n

This nonlinear function d verifies the main
properties of the distance: d ( A, B)  0 ,
d ( A, B)  d ( B, A), d ( A, B)  d ( B, C )  d ( A, C )
Using this metric given by (7), the distance
between any two matrices, but having a common
dimension, can be determined. For our speech
recognition task, matrices A and B are sound
feature vectors.
The provided distance constitutes a very good
discriminator between these vectors in the
classification process. From our tests it results that
if two speeches are similar, then the distance
between their feature vectors d (v1 , v2 ) become
quite small. In the next section we utilize this new
provided metric in the context of a minimum
distance type classifier.
We propose a variant of minimum distance
classifier as a sound classification approach [6]. The
classical variant of this classifier consist of a set of
prototypes and an appropriated metric. A pattern to
be recognized is inserted in the class corresponding
to the closest prototype in terms of the chosen
distance.
Let us extend now the minimum distance
classification technique to work for our speech
classification task. For each word-related class we
have not only one but more prototypes. The nonlinear
metric introduced in last section is used as a distance
measure between feature vectors.
So, for each speech sound, each class is considered
and the mean value of the distances between its
feature vector and the training vectors of that class is
computed. The speech signal is inserted in the class
corresponding to the smallest mean value and
receives the label of that class.
Therefore, if sk is the current speech signal, it
must be placed in the class indicated by
ni
 d (V (s ),V (S
arg min
i
j 1
k
j
i
))
, where d represent the
ni
distance given by relation (7). Thus, speech
recognition process is formally described by:



l ( sk )  l  arg min
i


ni
 d (V (s ), V (S
j 1
k
ni
j
i

)) 

(8)


 k 1,n ,i 1, N
For each k  1, n , the written word related to
5 A Supervised Speech Classification
Approach
The next stage of the automatic speech recognition
process is the speech classification. Our pattern
recognition system use a supervised classifier [6].
As we know, the patterns to be classified are the
sound signals s1 ,..., sn . We know that each of them
represents a vocabulary word, but we do not know
what word it is. The word can be identified by
inserting that signal in a class and labelling it with the
class label, which represents a written word. To be
inserted in a word-related class, a sound sk has to be
similar enough to the speech signals from the training
set which is related to that class. This means that its
feature vector V ( sk ) is closed enough as distance to
a training feature vectors set, {V (S1i ),...,V (Sni i )} .
speech sound sk is identified as l ( sk ) . The transcript
of the entire speech S results as a concatenation of
these labels.
Thus, the final result of the recognition process is
the label l (S ) , computed as follows:
l ( S )  l ( s1 )  ...  l ( sn )
(9)
where the operator  represents here the string
concatenation.
6 Experiments
In this section we present some practical results of
our experiments. We have tested the described
recognition system for English language and
obtained satisfactory results.
So, let us present now one of our many tests in the
word-based speech recognition field. From space
reasons we present a simple experiment in this paper.
Thus, we consider a discrete vocal sound whose
speech has to be recognized. Its signal S is the one
represented in the next figure.
Fig.1
word, and only two speakers for each of the others.
Thus, the following training set is obtained:
S11 , S 21 , S12 , S 22 , S13 , S 23 , S33 , S14 , S 24 , S15 , S 25 , S16 , S 26 and
S 36 .
All these prototype signals are displayed in Fig. 4.
The signals related to word (class) referred by i are
represented on the row i.
Fig. 4
We perform a segmentation on S, using T = 0.01,
and obtain the three speech signals, s1 , s 2 , s3 ,
represented in Fig 2.
Fig. 2
Also, for these classes we have the following
labels: l(1) = ’car’, l(2) = ’John’, l(3) = ‘apple’, l(4) =
‘help’, l(5) = ‘hurry’, l(6) = ’needs’. The training
feature vectors V ( S ij ) , resulted from delta delta mel
cepstral analysis, are displayed in Fig. 5.
Fig. 5
For each si , a delta delta mel cepstral feature
vector V ( si ) is computed. The feature vectors are
displayed as color images in the next figure.
Fig.3
For recognizing speech transcripts we use a small
English vocabulary which contains the words: {car,
John, apple, help, hurry, needs}.
The spoken words of S belong to this vocabulary.
We consider three speakers for the third and the last
Then, we compute the distances, given by (7),
between feature vectors displayed in Figure 3 and
training vectors displayed in Fig. 5. For each si , the
mean of the distances to each class is then calculated,
and resulted mean values are registered in the next
table.
1
2
Tab1e 1
3
4
5
6
V ( s1 )
V (s2 )
V ( s3 )
4.10
3.02
4.36
4.27
3.99
5.19
6.41
4.24
5.46
5.79
6.59
3.01
3.36
3.03
2.91
2.76
3.53
4.47
As we can see from this table, the minimum mean
distance from V ( s1 ) to the training feature vectors of
a word-related class is 3.02 and it corresponds to the
second class. Therefore s1 must be inserted in that
class and l (s1 )  l (2) ' John' .
the table row corresponding to V (s2 ) we
the closest class to this vector is the one
by number 6, the corresponding mean
value being 3.01. Therefore, we have
l (s2 )  l (6) ' needs' . The third feature vector,
From
see that
referred
distance
V ( s3 ) , is at a minimum mean distance 2.76 from the
fourth
class.
Thus, we obtain the label
l ( s3 )  l (4) ' help ' .
Final speech recognition result, the label of S, is
obtained as l ( S )  l ( s1 )  l ( s2 )  l ( s3 ) . Thus, the
initial vocal sound speech transcript is
l ( S ) ' John needs help ' .
This experiment it is not a real speech recognition
system because of its very limited vocabulary, which
consists of several words only. Our intention was to
prove the discriminating power of our proposed
metric, by presenting this practical experiment.
7 Conclusions
We have proposed a formal model for an
automatic speaker-independent word-based discrete
speech recognition system. The novelty brought by
this work is a nonlinear metric which works properly
in discriminating between different dimensioned
speech features.
There exist Hausdorff-based metrics utilized in
image processing and recognition domain. We have
created such a distance for working in the speech
recognition field.
Also, we tested our recognition approach obtaining
good results. In our experiments low-size
vocabularies were used. Our idea is to extend such a
vocabulary over time, adding more and more words,
such that it reach a considerable size.
Obviously, the Hausdorff-based distance we have
provided will work properly with the other types of
speech recognition approaches, which were
mentioned by us in the introduction. Thus, our future
research will focus on continuous speech recognition
and phoneme-based speech recognition. In both cases
we want to keep using our metric in feature
classification stage.
8 References
[1] L. Rabiner, B.H.Juang, 1993. Fundamentals of
Speech Recognition. Englewood Cliffs, NJ:
Prentice-Hall
[2] Minh N. Do, 2000. An Automatic Speaker
Recognition System. Swiss Federal Institute of
Technology, Lausanne
[3] Beth Logan, 2000. Mel Frequency Cepstral
Coefficients for Music Modelling. Compaq
Computer Corporation, Cambridge
[4] Gary D. Robson, 1995. Speech Recognition
Technology. The Circuit Rider
[5] Normand Gregoire, Mikael Bouillot, 1998.
Hausdorff distance between convex polygons.
Web project for CS 507 Computational
Geometry, McGill University
[6] Richard O. Duda, 1997. Pattern Recognition for
HCI, Department of Electrical Engineering, San
Jose State University
Download