Discrete Speech Recognition Using a Hausdorff-based Metric TUDOR BARBU Institute for Computer Science Romanian Academy Carol I 22A, Iaşi 6600 ROMANIA Abstract: - In this work we provide an automatic speaker-independent word-based discrete speech recognition approach. Our proposed method consist of several processing levels. First, an word-based audio segmentation is performed, then a feature extraction is applied for the previously obtained segments. The speech feature vectors are computed using a melcepstral vocal sound analysis. Then they must be compared with the feature vectors of the vocal utterances from a recorded training set, for a supervised classification. Because of their different dimensions, we propose a Hausdorff-based nonlinear distance for vector comparing. Also, we provide a variant of minimum distance feature vector classification, based on the previously created metric. Key-Words: - Speech recognition, Discrete speech, Vocal sound, Mel-cepstral analysis, Hausdorff metric, Feature vectors, Supervised classification, Training set 1 Introduction In this paper we are interested in developing a system for automatic speech recognition. As we know, a speech recognition system receives a vocal sound as an input and provides as an output the text representing the transcript of the spoken utterance. There are many known speech recognition approaches [1]. They can be separated in several different classes, depending on the linguistic units used in recognition. Thus, there exist phoneme-based, morpheme-based, word-based or phrase-based recognition techniques. We propose an word-based speech recognition approach. Depending on the types of the vocal utterances that the system is able to recognize, it can be either a discrete speech recognition system or a continuous speech recognition system [4]. Discrete speech contains slight pauses between the spoken words. Continuous speech represents the natural speech, its words not being separated by pauses. The word-based segmentation of a vocal sound is easier for the discrete speech case. For realizing the continuous speech segmentation, special methods must be utilized to determine the words boundaries. We choose to perform a discrete speech recognition approach. Also, depending on the population of users the recognition systems can handle, they can be either speaker-dependent or speaker-independent speech recognition systems [4]. The systems in the first class must be trained on a specific user, while those in the second class are trained on a set of speakers. We focus on a speaker-independent system. Its modelling consists of the following processing stages: 1. 2. 3. 4. Developing the system training feature set Audio preprocessing of the input vocal sound Word-based segmentation of the input utterance Feature vector extraction for the resulted signal segments 5. Supervised classification and labeling of the speech segments 6. Obtaining the final transcript of the speech from previously determined labels We present each of these developing stages in the next sections. Also, we offer a practical example to illustrate the functioning of the pattern recognition approach. The main contributions of this paper are the developing of a nonlinear metric based on the Hausdorff distance for sets [5] and the creation of a supervised classifier which utilizes the proposed distance. 2 Training Feature Vector Extraction We propose a supervised vocal pattern recognition method, therefore it requires first developing a training set of speech patterns. In this section we focus on this training set. An word-based approach requires a very large training set for an optimal speech recognition. The set dimension depends on the chosen vocabulary. An ideal recognition system utilizes the whole vocabulary of the speech language, which could contain thousands words. Most speech recognition systems use low-size vocabularies, having tens words, or medium-size vocabularies with hundreds words. The large vocabulary size represents a disadvantage of the word-based speech-recognition techniques. Vocabulary size is equal with the classes number because each word corresponds to a class in the recognition process. It is obvious that a phoneme-based recognition system utilizes a much smaller number of classes, because the number of words of a language (tens thousands) is much greater than the number of the phonemes of that language (usually 30-40 depending on language). We consider creating a vocabulary containing several words only initially and extending it over time. Let N be our vocabulary size. For each word from the vocabulary we consider a set of speakers, each of them recording a spoken utterance of that word. Thus, we obtain a set of digital audio signals for each word. All the recorded sounds represent the word prototypes of the system. For each i representing a vocabulary word number, we then have a set of signal prototypes {S1i ,..., Sni i }, where i 1, N , ni is the number of users which spoke the word referred by i and S ij represents the audio signal of that spoken word recorded by speaker indicated by j 1, ni . The set of all prototype signals, {S11,..., Sn11 ,..., S1N ,..., SnNN } , represents the training set of our system. Also, we introduce class labels for all these signals. The label of a signal of a spoken word will be its transcript, the written word. Thus, for each i 1, N and j 1, ni , we set a signal label l (S ij ) A short-time signal analysis is performed for each of these vocal sounds. We choose to divide each signal in overlapping segments of length 256 samples with overlaps of 128 samples. Then, each resulted signal segment is windowed, by multiplying it with a Hamming window of length 256. We then compute the spectrum of each windowed sequence, applying FFT (Fast Fourier Transform) to it, obtaining the acoustic vectors of the current S ij signal. Mel spectrum of these vectors is computed by converting them on the melodic scale that is described as: mel( f ) 2595 log 10 (1 f / 700) (2) where f represents the physical frequencies and mel(f) the mel frequencies. The mel cepstral acoustic vectors are obtained by applying first the logarithm, then the DCT (Discrete Cosinus Transform) to the mel spectral acoustic vectors. Then we can compute the delta mel cepstral coefficients (DMFCC), as the first order derivatives of MFCC, and the delta delta mel frequency cepstral coefficients (DDMFCC), as the second order derivatives of MFCC. We prefer to use the delta delta mel cepstral acoustic vectors for describing speech content. These acoustic vectors have a dimension of 256. For reducing this size, we truncate each acoustic vector to the first 12 coefficients, which we consider to be sufficient for speech featuring. Then we construct a 12 row matrix by positioning these truncate delta delta mel cepstral vectors on the columns. This resulted DDMFCC-based matrix represents the chosen speech feature vector. Therefore, for each i 1, N and j 1, ni , a feature vector V ( S ij ) is computed as a matrix with 12 rows and a number of columns depending on the S ij which is a string representing a vocabulary word. It is obvious that we have: length. The computed vectors represent the training feature vectors of the recognition system. The set {V (S11 ),...,V (Sn11 ),...,V (S1N ),...,V (SnNN )} represents l (i) l ( S1i ) ... l ( Sni i ), i 1, N , the training feature set of the speech recognition system. (1) where l (i ) represents the label of the word-related class referred by i. Next step is the computing of prototype vectors of the recognition system. They represent the feature vectors of the training set. We perform the training feature extraction by applying a mel cepstral analysis to S ij sounds, the Mel Frequency Cepstral Coefficients (MFCC) being the dominant features used for speech recognition [1,2,3]. 3 Input Speech Analysis In this section we focus on the input vocal sound analysis. As specified in introduction, we consider only discrete speech sounds to be recognized by our system. Also, we put the condition that the vocal sounds introduced as input contain spoken words belonging to the given vocabulary only. Let S be the signal of a vocal sound we want to recognize by identifying its transcript. First, several pre-processing actions can be performed on this audio signal. If an amount of noise is still present in the sound, some filtering techniques must be applied to smooth the signal. Also, in the pre-processing stage an important audio special effect, used often before speech recognition, can be added to the signal. The preemphasing effect is applied as follows: S[a] S[a] S[a 1] , (3) where we set the control parameter 0.5 . The next stage of speech analysis consist of an word-based vocal segmentation. We must extract from S the signal segments corresponding to the spoken words. This task is performed by detecting the pauses that separate the words. A pause segment is characterized by its length and by its low amplitudes. Thus, we use two threshold values for the pauses identification purpose. Let them be T and t. A pause represents a signal sequence {S[a],..., S[a t ]} , having the property S[a],..., S[a t ] T . We choose the length related parameter t = 500 and the amplitude related threshold T to be small enough. As a result of the pauses identification, the word segments extraction process become quite simple to fulfil. Let s1 ,..., sn be the word related speech signals extracted from S. Obviously, the speech recognition of S consist of determining the written words corresponding to these signals. For each i 1, n , the recognition system has to find the word represented by si . An automatic recognition is performed by comparing these speech signals with the signals from the training set. A feature extraction process is applied to each si sequence. The same mel cepstral analysis is utilized for feature extraction. For each i 1, n , a feature vector V ( si ) is computed as the delta delta mel cepstral matrix, truncated at 12 coefficients per each column. Each obtained vector has 12 rows and a number of columns depending of si length. Having different dimensions, these feature vectors and the training feature vectors cannot be compared to each other by using linear metrics, like Euclidean distance or Minkowsky metric. A solution for finding the distance between a feature vector V ( sk ) and a training vector V ( S ij ) , could be their transformation in matrices of the same size. These feature matrices already have the same number of rows, only the number of columns can be different. The matrix with the greater number of columns (thus corresponding to the longer signal) can be resampled to get the same number of columns as the other. The two vectors, having now the same dimensions, can be compared with an Euclidean distance for matrices. This vector resampling solution is not optimal because resampling process may often produce a speech information loss on the transformed feature vector. If there is a great size difference between the vectors to be compared, the resampling could result in important classification errors. For this reason we propose a new type of distance measure in the next section. 4 A Hausdorff-based Nonlinear Metric The classification stage of our recognition process requires a distance measure between different dimension vectors. Therefore, we introduce a nonlinear metric which works for matrices and it is based upon the Hausdorff distance for sets [5]. First, let us present some general theory regarding Hausdorff metric. If A and B are two different-sized sets A B , the Hausdorff metric is defined as the maximum distance of a set to the nearest point in the other set. Therefore, we have the following relation: h( A, B) max {min {dist (a, b)}} aA (4) bB where h represents the Hausdorff distance between sets and dist is any metric between points (for example the Euclidean distance). In our case we must compare two matrices, different sized but with one common dimension, instead of two sets of points. So let us consider A (a ij ) nm and B (bij ) n p the two matrices having the same first dimension. We use notation n for the rows number, although we already used it in the last section. Let us assume that m p . We introduce two more vectors, y ( yi ) p1 and z ( zi ) m1 , then compute z m y p max yi and 0i p max z i . With these notations we create a 0i m new nonlinear metric d having the following form: d ( A, B ) max sup inf By Az n , sup inf By Az z m 1 y p 1 y p 1 z m 1 n (5) This restriction-based distance represents the Hausdorff distance between the sets B y: y p 1 and A z : z m 1 in the metric space R n . Therefore, we have the following relation: d ( A, B) h( B( y : y p 1), A( z : z m 1)) (6) As resulting from (6), the metric d depends ony and z. Trying to eliminate these terms, we have found a new form for d. It do not depend on these vectors and it is not a Hausdorff distance anymore. The new obtained Hausdorff-based metric is given as following: d ( A, B) max sup inf sup bik aij , sup inf sup bik aij (7) 1 j m 1 k p 1 i n 1 k p 1 j m 1 i n This nonlinear function d verifies the main properties of the distance: d ( A, B) 0 , d ( A, B) d ( B, A), d ( A, B) d ( B, C ) d ( A, C ) Using this metric given by (7), the distance between any two matrices, but having a common dimension, can be determined. For our speech recognition task, matrices A and B are sound feature vectors. The provided distance constitutes a very good discriminator between these vectors in the classification process. From our tests it results that if two speeches are similar, then the distance between their feature vectors d (v1 , v2 ) become quite small. In the next section we utilize this new provided metric in the context of a minimum distance type classifier. We propose a variant of minimum distance classifier as a sound classification approach [6]. The classical variant of this classifier consist of a set of prototypes and an appropriated metric. A pattern to be recognized is inserted in the class corresponding to the closest prototype in terms of the chosen distance. Let us extend now the minimum distance classification technique to work for our speech classification task. For each word-related class we have not only one but more prototypes. The nonlinear metric introduced in last section is used as a distance measure between feature vectors. So, for each speech sound, each class is considered and the mean value of the distances between its feature vector and the training vectors of that class is computed. The speech signal is inserted in the class corresponding to the smallest mean value and receives the label of that class. Therefore, if sk is the current speech signal, it must be placed in the class indicated by ni d (V (s ),V (S arg min i j 1 k j i )) , where d represent the ni distance given by relation (7). Thus, speech recognition process is formally described by: l ( sk ) l arg min i ni d (V (s ), V (S j 1 k ni j i )) (8) k 1,n ,i 1, N For each k 1, n , the written word related to 5 A Supervised Speech Classification Approach The next stage of the automatic speech recognition process is the speech classification. Our pattern recognition system use a supervised classifier [6]. As we know, the patterns to be classified are the sound signals s1 ,..., sn . We know that each of them represents a vocabulary word, but we do not know what word it is. The word can be identified by inserting that signal in a class and labelling it with the class label, which represents a written word. To be inserted in a word-related class, a sound sk has to be similar enough to the speech signals from the training set which is related to that class. This means that its feature vector V ( sk ) is closed enough as distance to a training feature vectors set, {V (S1i ),...,V (Sni i )} . speech sound sk is identified as l ( sk ) . The transcript of the entire speech S results as a concatenation of these labels. Thus, the final result of the recognition process is the label l (S ) , computed as follows: l ( S ) l ( s1 ) ... l ( sn ) (9) where the operator represents here the string concatenation. 6 Experiments In this section we present some practical results of our experiments. We have tested the described recognition system for English language and obtained satisfactory results. So, let us present now one of our many tests in the word-based speech recognition field. From space reasons we present a simple experiment in this paper. Thus, we consider a discrete vocal sound whose speech has to be recognized. Its signal S is the one represented in the next figure. Fig.1 word, and only two speakers for each of the others. Thus, the following training set is obtained: S11 , S 21 , S12 , S 22 , S13 , S 23 , S33 , S14 , S 24 , S15 , S 25 , S16 , S 26 and S 36 . All these prototype signals are displayed in Fig. 4. The signals related to word (class) referred by i are represented on the row i. Fig. 4 We perform a segmentation on S, using T = 0.01, and obtain the three speech signals, s1 , s 2 , s3 , represented in Fig 2. Fig. 2 Also, for these classes we have the following labels: l(1) = ’car’, l(2) = ’John’, l(3) = ‘apple’, l(4) = ‘help’, l(5) = ‘hurry’, l(6) = ’needs’. The training feature vectors V ( S ij ) , resulted from delta delta mel cepstral analysis, are displayed in Fig. 5. Fig. 5 For each si , a delta delta mel cepstral feature vector V ( si ) is computed. The feature vectors are displayed as color images in the next figure. Fig.3 For recognizing speech transcripts we use a small English vocabulary which contains the words: {car, John, apple, help, hurry, needs}. The spoken words of S belong to this vocabulary. We consider three speakers for the third and the last Then, we compute the distances, given by (7), between feature vectors displayed in Figure 3 and training vectors displayed in Fig. 5. For each si , the mean of the distances to each class is then calculated, and resulted mean values are registered in the next table. 1 2 Tab1e 1 3 4 5 6 V ( s1 ) V (s2 ) V ( s3 ) 4.10 3.02 4.36 4.27 3.99 5.19 6.41 4.24 5.46 5.79 6.59 3.01 3.36 3.03 2.91 2.76 3.53 4.47 As we can see from this table, the minimum mean distance from V ( s1 ) to the training feature vectors of a word-related class is 3.02 and it corresponds to the second class. Therefore s1 must be inserted in that class and l (s1 ) l (2) ' John' . the table row corresponding to V (s2 ) we the closest class to this vector is the one by number 6, the corresponding mean value being 3.01. Therefore, we have l (s2 ) l (6) ' needs' . The third feature vector, From see that referred distance V ( s3 ) , is at a minimum mean distance 2.76 from the fourth class. Thus, we obtain the label l ( s3 ) l (4) ' help ' . Final speech recognition result, the label of S, is obtained as l ( S ) l ( s1 ) l ( s2 ) l ( s3 ) . Thus, the initial vocal sound speech transcript is l ( S ) ' John needs help ' . This experiment it is not a real speech recognition system because of its very limited vocabulary, which consists of several words only. Our intention was to prove the discriminating power of our proposed metric, by presenting this practical experiment. 7 Conclusions We have proposed a formal model for an automatic speaker-independent word-based discrete speech recognition system. The novelty brought by this work is a nonlinear metric which works properly in discriminating between different dimensioned speech features. There exist Hausdorff-based metrics utilized in image processing and recognition domain. We have created such a distance for working in the speech recognition field. Also, we tested our recognition approach obtaining good results. In our experiments low-size vocabularies were used. Our idea is to extend such a vocabulary over time, adding more and more words, such that it reach a considerable size. Obviously, the Hausdorff-based distance we have provided will work properly with the other types of speech recognition approaches, which were mentioned by us in the introduction. Thus, our future research will focus on continuous speech recognition and phoneme-based speech recognition. In both cases we want to keep using our metric in feature classification stage. 8 References [1] L. Rabiner, B.H.Juang, 1993. Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall [2] Minh N. Do, 2000. An Automatic Speaker Recognition System. Swiss Federal Institute of Technology, Lausanne [3] Beth Logan, 2000. Mel Frequency Cepstral Coefficients for Music Modelling. Compaq Computer Corporation, Cambridge [4] Gary D. Robson, 1995. Speech Recognition Technology. The Circuit Rider [5] Normand Gregoire, Mikael Bouillot, 1998. Hausdorff distance between convex polygons. Web project for CS 507 Computational Geometry, McGill University [6] Richard O. Duda, 1997. Pattern Recognition for HCI, Department of Electrical Engineering, San Jose State University