Searching through a Speech Memory for efficient Coding, Recognition and Synthesis or or for Unified Speech Processing Without Transcriptions Gérard CHOLLET (chollet@tsi.enst.fr), Dijana PETROVSKA (dijana.petrovski@unifr.ch) ENST, CNRS-LTCI, 46 rue Barrault, 75634 PARIS cedex 13, France ABSTRACT The common denominator of different speech processing methods is the speech unit that is being used. The majority of current systems use phonemes (or derived units) as an atomic representation of the speech. Using phonetic speech units leads to efficient representation and implementation for a lot of systems for speech processing. The major problem that arises when phoneme based systems are being developed is the lack of transcribed databases. This is the major bottleneck for developing new applications and porting the existing ones to new languages. For each of these applications new transcribed databases are needed. In order to by-pass this tedious, error-prone and expensive step of transcribing speech, we need to develop systems that minimize the usage of transcribed databases. Our paper presents a unified approach of speech processing based on automatically derived speech units, that do not require transcribed databases. These automatically derived units are denoted here as Automatic Language Independent Speech (ALIS) units. Results for very-low bit (<600 bps) speaker-independent coding and preliminary results for speaker !!!!! recognition using these units are reported. Another perspectives are also described. 1. Introduction The principles of pattern recognition are widely used in automatic speech processing: a similarity measure is evaluated between an unlabeled input and a labeled pattern in memory. For example, in order to achieve bit rates of the order of 400 bps in speech coding, it is necessary to use recognition and synthesis techniques. The coder and the decoder share a dictionary of speech segments. The information transmitted to the coder consists of the indexes of the recognized unit, and some prosodic parameters. At the decoder side speech synthesis is used in order to reconstruct the output speech, from the sequence of the transmitted symbols. In the case when we deal only with speech coming from one speaker (speaker dependent case), after the recognition phase, we use a similarity measure in order to evaluate which amongst the available units in the memory of the coder (that are also present in the decoder side) best matches the incoming speech unit. By transmitting only these indexes, (and some prosodic information), the transmitted bit rate is drastically reduced. In the case when we would like to deal with speech coming from an unknown speaker, we have to use a speaker independent system. It is build by training a speech recognizer, with as much speech available from as many speakers as possible. With this speech recognizer we are going to determine the symbolic representation of the incoming unknown speech to be coded. The next step is to determine which are the best candidates for the coding among the shared units in the coder and the decoder. During this matching phase, if this similarity measure conveys speaker specific informations, we could exploit them in order to do either speaker recognition, or/and speaker adaptation. Current coders working at bit rates lower than 400 bps are based on segmental units related to phones (the phones being the physical realization of the corresponding phonemes). The efficiency of these phonetic vocoders [5], [6], is due to the fact that the main information transmitted by this type of coders is the index of the recognized unit, and some prosodic parameters. An alternative approach using automatically derived speech units based on ALISP tools was developed at ENST, ESIEE and VUT-BRNO [1]- [4]. These speech units, are derived from a statistical analysis of the speech corpus, requiring neither phonetics nor orthographic transcriptions of the speech data. However the experiments conducted so far have only involved the speaker-dependent case. In this paper we extend this technique to the speaker independent case, and show some preliminary results showing the potentialities of this new method for speaker recognition. The outline of this paper is the following: Section 2 explains the selections procedure of the speech segments. The database that underlies the experiments and the experimental setup details are given is Section 3. The results of very low bit speech coding for the speaker independents case are given in Section 4. In Section 5 we make a comparison of the ALISP speech units with phone units. The experiments that show that the metrics we use to evaluate the distortion between two speech segments, conveys speaker specific informations, are given in Section 6. Section 7 gives some perspectives and concludes our work. A mettre ailleurs Some of the problems in Automatic Speech Recognition: variability of acoustic patterns, the acoustic realization of phonemes is dependent on many factors (speaker, context, stress, environment, …) How to achieve Very Low Bit-rate Coding ? Compression of computer files, multigrams, perplexity Organizing a speech memory Related work 2. Selection of Speech Segments In order to solve the problem of searching in a database representing our speech memory, and avoid the full search solution, it is important to efficiently represent this memory. The selection of speech units is done in two steps. 2.1 Acoustic modeling of data driven (ALIS) units The first stage consists of a recognition phase: we are going to use Hidden Markov Models for their well known acoustic modeling capabilities. But in place of the widely used phonetic labels we are going to use a set of data driven speech symbols, automatically derived from the training speech data. The procedure used to build this set of symbolic units is achieved through temporal decomposition, vector quantization and segmentation. We are going to use these symbols for the training of the Hidden Markov Models of these Automatic Language Independent Speech (ALIS) units. They are going to be used during the recognition phase of the incoming speech to be coded. This stage is schematically represented in Figure1. Here is a more detailed description of the methods used: 2.1.1 Temporal Decomposition After a classical pre-processing step, temporal decomposition [8] is used for the initial segmentation of the speech into quasi-stationary parts. The matrix of spectral parameters is decomposed into a limited number of events, represented by a tar get and an interpolation function. The td95 package (footnote: Thanks to Frederic Bimbot, IRISA Rennes, France) for the permission to use it is) used for the temporal decomposition. A short-time singular value decomposition with adaptive windowing is applied during the initial search of the interpolation functions. The following step is the iterative refinement of targets and interpolation functions. At this point, the speech is segmented in spectrally stable segments. For each segment, we also calculate its gravity center frame. 2.1.2. VQ clustering and segmentation Vector quantization (VQ) is used for the clustering of the temporal decomposition segments. The training phase of the codebook is based on the K-means algorithm. The size of the coding dictionary is determined by the number of the centroids in the codebook. The codebook is trained using only the gravity center frame of the temporal decomposition segments. The result of this step is the labeling of each temporal decomposition segment with the labels of the VQ codebook. The next step of the segmentation is realized using cumulated distance of all the vectors from the segment. 2.1.2. Hidden Markov Modeling Hidden Markov Models (HMMs) are widely used in speech recognition because of their acoustic modeling capabilities. In these experiments, the HMMs are used to model and refine the segments from the initial VQ segmentation on the train data. At this step we have trained the HMM models that are going to be used for the recognition of the coding units. 2.2 DTW matching The results of this first search are refined with a second search. Once the most probable sequence of symbols is estimated, for each of these symbols we are going to find which are the examples (having the same label in the speech memory) that best match the coding segments. The distortion measure that we used in our experiments is the Dynamic Time Warping (DTW) distance. When dealing with speech examples coming from multiple speakers, we are preserving the information about the identity of those speakers. In such a way we can choose the subset of the available examples to represent all the training speakers in a uniform manner. By uniform we mean that when a subset of examples is chosen, we do it in such a way that each speaker is represented by the same number of examples. 3. Database and experimental setup 3.1 Database For the experiments we used the BREF database [7], a large vocabulary read-speech corpus for French. The texts were selected from 5 million words of the French newspaper ``Le monde''. In total 11,000 texts were chosen, in order to maximize the number of distinct triphones. Separate text materials were selected for training and test corpora. 120 speakers have been recorded, each providing between 5,000 and 10,000 words (approximately 40-70 min of speech), from different French dialects. They are subdivided in 80 speakers for training, 40 for development and 20 for evaluation purposes. 3.2 Experimental setup In this paper we address the issue of extending a speakerdependent very low bit-rate coder to a speakeindependent situtation based on automatically derived speech units with ALISP. As a first step a gender-dependent, speaker-independent coder is experimented. For the speaker--independent experiments, we have taken 33 male speakers to train our speaker—independent (and gender-dependent) coder from the training partition of the BREF database. For the evaluation 3 male speakers were taken from the evaluation corpus. For a baseline comparison, we generated equivalent speaker-dependent experiments, for the three male speakers, from the evaluation set. Their speech data was divided into a train and test part. For ease of comparison, the test sentences are common for both experiments. The BREF database is sampled at 16 kHz. The speech parameterization was done by classical Linear Predictive Coding (LPC) cepstral analysis. The Linear Prediction Cepstral Coefficients (LPCC) are calculated every 10 ms, on a 20 ms window. For consistency with the previous speaker-dependent experiments on Boston University corpus [2], [3], cepstral mean sub traction is applied in order to reduce the effect of the channel. The temporal decomposition was set up to produce 16 events per second on the average. A codebook with 64 centroids is trained on the vectors from the gravity centers of the interpolation functions, while the segmentation was performed using cumulated distances on the entire segments. With the speech segments clustered in each class, we trained a corresponding HMM model with three states. The initial HMM models were refined through one re-segmentation and re-estimation step. With those models, the 8 longest segments per model were chosen from the training corpus to build the set of the synthesis units, denoted as synthesis representatives. The original pitch and energy contours, as well as the optimal DTW time-warps between the original segment and the coded one were used. The index of the best-matching DTW representative is also transmitted. The unit rate is evaluated assuming uniform encoding of the indices. The encoding of the representatives increases the rate by 3 additional bits per unit. 4. Comparison of ALISP units with phone units The automatically derived speech units are the central part of the proposed method. With the experimental parameters as reported in the previous section, the mean length of the temporal decomposition segments (prior to the HMM modeling and refinement) is 61 ms per temporal decomposition segment. We have studied the correspondence of the automatically derived units with a acoustic-phonetic segmentation. The phonetic units were obtained from forced Viterbi alignments of automatically generated phonetic transcriptions (originating from an automatic phonetizer) 2 Table 1: Phone unit set used in the comparison with ALISP units. Procedure to quantify their correspondence. We first measured the overlapping of the na ALISP units with the np phone units. Then, the confusion matrix X, with dimensions np na is calculated from the relative overalping values as follows: where c(p i ) is the number of occurrences of the phone p i in the corpus, and r(p i k ; a j ) is the relative overlapping of the k-th occurrence of the phone unit p i with the unit a j . The relative overlapping r is calculated from the absolute overlapping R, by a normalization with the length of the onsidered phone unit: r(p i k ; a j ) = R(p i k ; a j )=L(p i k ). The value of x i;j gives us the correspondence of the phone unit i with the ALISP unit j on the whole training set. The set of phone units used in this experiment, is given in Table 1. The resulting confusion matrix is shown in Figure 1, where the 35 phone units are arranged according to phone classes, and the 64 ALISP units to give a pseudo diagonal shape. We can conclude that there are some units that are similar. For example, the ALISP units He and H8 correspond to the closures b, d, g, and Hf and Hi correspond to the other subset of closures k, p, t. The fricative f has his corresponding unit HI. The unit HM corresponds to the nasal set m, n. Among the most confusing units we can point the unit H2, corresponding to a large set of phonemes. 5. Speaker independent coding Some results and comments (with LPC synthesis). 6. Searching for the nearest speaker In the case when we would like to deal with speech coming from an unknown speaker, we have to use a speaker independent system. It is build by training a speech recognizer, with as much speech available from as many speakers as possible. With this speech recognizer we are going to determine the symbolic representation of the incoming unknown speech to be coded. The next step is to determine which are the best candidates for the coding among the shared units in the coder and the decoder. During this matching phase, if this similarity measure conveys speaker specific informations, we could exploit them in order to do either speaker recognition, or/and speaker adaptation. In order to verify these ideas we have done some preliminary experiments. We have considered two different cases : If the speaker to code is present in the training database, does our similarity measure discovers that (we will denote this case as known train speakers case). The second point to check is the behaviour of the system in the case that we would like to code speech from a speaker that is unknown to the system (we will denote this case as impostors speakers case). 6.1 Coding speech coming from (known) train speakers In order to study if with the DTW distance measure it is possible to determine the speaker identity, we have first considered the case of coding speech that we know that is originating from one of the training speakers. Given a speech data to be coded, we have segmented it first with our ALISP recognizer. For each of the recognized segments, we then proceed to the determination of the best matching examples from our subset of examples having the same label. As each of these examples are also identified by their original identity, we can report a histogram on the coding sentence, giving us the relative frequencies of the number of times the various identities present in the training database were chosen as the best matching units. A maximal value of this relative frequency is an indication that the identity of the given speaker was preffered among the others identities. We report the results concerning the choise of the speaker identity for coding of short and longe speech segments. The order of magnitude of the shorter test is roughly 30s, and for the longer ones in varies from 5 to 7 min. Results concerning the relative frequency histogram of the prefferred speaker identity for the coding speech segments originating from speaker whose identity is number one are reported on Figures 6.1 and 6.2. The duration of the coded speech is 30 s and 5 min. Speech coming from train speaker number 2, was also coded in a similar manner, with a short and a long speech data. These results are reported on Figures 6.3 and 6.4. From these experiments were can conclude that at least for those two speakers, our algorithm chooses as the most closest speaker the real identity of the speaker. The fact that for the short and the long speech data the histograms have the same tendancy is a confirmation of the statistical significance of our test. 6.2 Coding speech coming from (unknown) impostor speakers In order to evaluate the behaviour of our system when speech data uttered by un unknown speaker is to be coded, we have done test with speech coming from two speakers that are not present in the training set. They are denoted here as X and Y. Their respective results, are presented in figures 6.3 and 6.4. Eventhough these results need to be confirmed by a greater number of training and test speakers, we can conjecture that when coding speech data coming for known speakers, the proposed algorithm can identify his identity, by choosing preferentially (10% to 15% of the times) segments belonging to this speaker. In the case when un impostor is present, the system has more difficulties to shoose a preffered speaker (the max values being in this cases 6 to 8 %). Figure 6.3 Figure 6.4 8. CONCLUSION and PERSPECTIVES. 9. Acknowledgments. Several colleagues have contributed to this paper. They are in alphabetic order: René Carré, Maurice Charbit, Guillaume Gravier, Nicolas Moreau, Dijana Petrovska, Jean-Pierre Tubach, François Yvon. Thanks to all of them. 10. REFERENCES 1. Barbier, L., Techniques neuronales pour l'adaptation à l'environnement d'un système de reconnaissance automatique de la parole, Doctoral thesis, ENST, Paris, 1992.