1 WICT PROCEEDINGS, DECEMBER 2008 Building an Automatic Speech Annotation System Anthony Psaila Department of Communications and Computer Engineering University of Malta Email: antantyy@yahoo.co.uk Abstract The availability of phonetically annotated and labelled speech corpora is highly desirable from the point of view of speech research and speech technology development. Corpora that are built by the generation of manual phoneme annotation are a laborious and time-consuming task normally carried out by expert phoneticians. The adoption of automatic phoneme annotation for the Maltese language is highly desirable for the building of a Maltese Language Central Corpus database. For this purpose, two automated methods will be proposed. The first will include phoneme annotation by utilising the EnglishAmerican TIMIT database, while the second will utilise a set of manually-segmented Maltese utterances tailor-made for this project. In order to do this, it is required to have a previouslyannotated set of utterances up to phoneme level, and to train a set of models, one for each phoneme. For the building of phoneme models, several algorithms may be used, based on, among others, Neural Networks, Hidden Markov Models (HMMs), Wavelets and Dynamic Time Warping. The Hidden Markov Model algorithm was preferred, being the one that is most widely used, and also due to the ready availability of the HTK software suite. Initially, the system will automatically annotate simple Maltese words, but eventually it will be expanded to perform generic Maltese Speech Annotation. This paper will present the methods applied in more detail together with the results obtained. Prof. Paul Micallef Department of Communications and Computer Engineering University of Malta Email: pjmica@eng.um.edu.mt 1 Introduction Speech annotation is the task of partitioning a speech signal into basic speech sounds of a language, also known as phonemes. It is a process of locating the phoneme boundaries as precisely as possible from ‘a priori’ sequence of phonemes. Several algorithms have been developed during the last years, but the most successful were based on Hidden Markov Models (HMM). These are probabilistic models where the observations are a probabilistic function i.e. the hidden states. HMM is defined in terms of its set of transition probabilities {aij} and state output probability distributions {bj(Ot)} [1] 2 The power of HMM For an efficient observation system, it is imperative to optimize a set of HMM models to a sequence of available training data. For this purpose, a re-estimation procedure must be designed so that the set of model parameters are adjusted to better describe a given observation sequence. The observation sequence used to adjust such a model is referred to as training. Training is very difficult and crucial. One well known solution is an iterative 2 procedure such as the Baum-Welch algorithm. In the proposed examples, the models were re-estimated several times during the training phase to obtain an optimized model. To uncover the hidden part, namely the start and end timings of each phoneme, the Viterbi algorithm is usually applied for a known sequence of phonemes. This is an optimization procedure for selecting the best phoneme start and end timings for a given observation. This is known as decoding or Maximum Likelihood Sequence [2]. In this case, speech annotation was performed by using HTK, which is a toolkit for building HMMs [3]. It is designed for building HMM-based speech processing tools, such as speech recognisers, or to perform speech annotation. Usually two stages are performed, each one complementing the other. First, the HTK training tools are used to estimate HMM parameters by using training utterances. Secondly, unknown utterances are transcribed using the developed HTK recognition tools in the first stage. For this purpose, the HTK toolkit will be used to implement Maltese Speech Annotation. 3 Using the DARPA TIMIT Database The setting up of a Maltese annotation system requires a considerable amount of previouslysegmented data, which was not available for the purposes of this project. As an alternative solution, testing commenced using DARPA TIMIT [4]. This is a well known multi-speaker, English/American annotated database. The purpose of this work was to automatically annotate Maltese speech by making use of existing material from other languages. Initial experiments were therefore devoted to the building of phoneme HMMs from an existing annotated database, i.e. DARPA TIMIT. The DARPA TIMIT database has already been used for bootstrapping recognition from another language to Maltese [5]. In this case, it has been used for speech annotation. For this purpose, the HTK toolkit and 23 utterances from TIMIT were used for the training phase of the phoneme models. The topology used was a continuous-density Gaussian 5-state HMM. A sequence of parameterized feature vectors was extracted from the utterances. The system made use of MFCC using Energy, Delta, Acceleration and Cepstral Mean Normalization. Each analysis made use of a 25 ms Hamming window with a 10 ms overlap. The training of all the phonemes was carried out by a system based on HTK toolkit. For the purpose of system evaluation, it was required to map the set of TIMIT phonemes into the equivalent Maltese phonemes. In practice, some Maltese phonemes used the same TIMIT phonemes while the rest made use of the nearest equivalent. System evaluation was performed by a set of Maltese utterances. Phoneme segmentation found to be within tolerance, which for this experiment measured 17.2 ms (see section 5), was marked as correct. System performance was not so satisfactory because only 73% of the segmentation were within tolerance range. However, these were only preliminary results on which further research was developed. This relatively unsatisfactory system performance was due to one or 3 more of the following factors: 1) the training of HMMs was carried out using a small number of utterances, 2) the training utterances were recorded by various speakers, 3) not all the Maltese phonemes had an exact mapping with TIMIT phonemes, 4) the results obtained were limited to a mapped set of Maltese phonemes. To resolve some of these issues and increase system performance, it was decided to provide some single-speaker, freshly-recorded Maltese speech data for the training of the HMMs. 4 Using a tailor-made Maltese Utterance Database The second method was based on a speaker-dependent system utilizing a continuous-density Gaussian, 5-state, left-right HMM with mixture topology. To build the Maltese speech database, 100 Maltese utterances were recorded by one speaker at a professional studio. By utilizing HTK toolkit and Cool Edit, all the utterances were manually labelled so that for each phoneme there was an associated start and end timing. For the training of HMMs, 70 of these utterances were used while the other 30 were used for the evaluation phase. 4.1 Data Preparation Phase The labelling of all the 70 training utterances were collectively saved in one Master Label File (MLF). For the training of HMMs, the 70 training speech waveforms were parameterized into MFCC coefficients, using Energy, Delta, Acceleration and Cepstral Mean coefficients. A 39element acoustic vector was extracted from each frame. The source waveforms were sampled at 16000 Hz, utilizing a 25 msec Hamming window and a frame period of 10 msec. 4.2 Initialization Phase In order to create the HMM definitions for the required phonemes, a prototype model was designed with the already mentioned topology. Due to the fact that several parameter distributions are inherently multi-modal, the design of the model contains multiple mixture components to improve annotation performance. Initially the prototype is configured with all mean values equal to 0 while all diagonal variance values are equal to 1. The HTK HInit tool was used for the initialization of the set of phoneme HMMs [3]. The system made use of the available training data and their respective MFCC parameters. In all there were 27 phonemes, plus one for sil to represent ‘silence’ and one for breath to represent ‘voice breath’. 4.3 Iteration Phase During this phase the initial HMM parameters are further reestimated by the HTK HRest tool. This tool is based on Baum-Welch algorithm, and it makes use of the available training material, until the re-estimation change converges up to a given threshold. 4.4 Re-estimation Phase This phase was dominated by the HERest embedded training tool [3]. It performed re-estimation of the parameters, which have been previously estimated by HRest, of all the HMM 4 phoneme models simultaneously. For each training utterance, the corresponding phoneme models were concatenated, and subsequently, the forward-backward algorithm was used to accumulate the relevant statistics for each HMM in the sequence. Each training utterance was processed in turn, and had an associated label file with the corresponding transcription for that utterance. This process is illustrated by Fig 1. To avoid the computations of highly improbable combinations during the forward-backward probabilities, and also for speeding up the computation, HERest made use of pruning [3]. Pruning restricts the computations of t (i ) and t (i ) values to just those within a threshold, while the rest are ignored. On completion, HERest outputs the new updated HMM definitions in a master macro file. These new phoneme HMM definitions will be the key parameters during the phoneme annotation process. vectors similar to the training utterances. Automatic Annotation was performed by the HTK HVite tool in forced alignment mode. In this mode the system locates phoneme boundaries for each speech utterance with a known sequence of phonemes. HVite computes a new network for each input utterance, using the phoneme level transcriptions and a phoneme one-to-one dictionary. The HVite decoding process, when in forced alignment mode, builds a network with only one path, since it is constrained by the known phonemic sequence. For each test utterance, HVite constructs an alignment network, by utilizing a master macro file, which contains the list of all the required HMM models. The annotation process will then attach the appropriate HMM definition to each phoneme instance. The decoder, by applying the Viterbi algorithm, then selects the optimal path through the network which leads to the best matching pronunciations. A lattice that includes model alignment information is then eventually constructed. Finally, the lattice is converted into a transcription, and also produces the phoneme labels with their associated boundaries. The output transcriptions will be saved in an output file containing the phoneme sequence with their associated boundaries. The annotation system works by using these principles. 5 Results Fig 1 File Processing in HERest (The HTK book 2002) 4.5 Evaluation Phase The second set of 30 Maltese utterances was used for system evaluation. These utterances were parameterized into a series of feature Due to the fact that the annotation system was utilizing a frame of 25 ms duration, at a period of 10 ms, and the location of phoneme boundaries was frame dependent. This dependence was calculated using the formula 5 2 12 tolerance = + frame period, where = 25 ms. It was established that label markings within the 17.2 ms tolerance were correct. System performance was established by comparing the annotated phoneme label markings with the manually marked ones. Detection accuracy was worked out by the following formula: Detection Accuracy number of boundaries correctly located total number of boundaries tested Boundary detection accuracy reached the 88.11% mark. All the voiced phones, which total 652, were further divided into 5 categories, and their percentage accuracy is listed in Table 1. One can notice that the semivowels produced the worst performance and constitute the group of phonemes that are also most difficult to determine manually. Percentage Accuracy Quantity 244 - vowels Percentage <= 17.2 ms 87.91 141 - stops 97.16 26 - affricate 100.00 132 - fricative 93.94 109 - semivowels 72.94 652 - all phones 89.11 Table 1 Percentage Accuracy for each category Fig 2 illustrates a comparison between the manual and automatic transcriptions with respect to waveform S071. The upper set of labels represents the manual data transcription, while the lower set of labels represents the automatic data transcription. 6 System Drawbacks and Recommendations For the purpose of proper training of model parameters, and to better evaluate the variety of phonemes, it would be preferable for a larger number of manually-segmented and recorded utterances to be used. It is also important to apply high quality speech recording for better phoneme identification. Ideally, each utterance should consist of clearly spoken words. Another drawback was the fact that the speech utterances were not phonetically balanced, and thus some phonemes had relatively few occurrences. Since the training of the Hidden Markov Models is dependent on the manual labelling of the training data, it is very desirable that their phoneme markings are as accurate as possible. For those phonemes that presented difficulty with respect to annotation, semivowels in particular, one would recommend that special dedicated training HMMs be applied to specifically train this group of phonemes. This could be carried out by using an HMM with a different number of states or another topology. Although the annotation system was speaker-dependent, it could be expanded to a speaker-independent system. This would require a larger, multi-speaker Maltese annotated database for the training of the HMMs. 6 Fig 2 A WaveSurfer screen shot showing, vertically from top to bottom, S071 waveform, manual phonetic labels and the automatic-aligned phonetic labels. The design of the HMM model was based on a tri-mixture topology, which is somewhat limited in the resultant distributions. To improve system performance and obtain a more robust model, the number of mixtures should preferably be increased to nine or more. References [1] L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, Vol.77, No. 2, 1989 [2] S. Young, and G. Bloothooft, “Corpusbased Methods in Language and Speech Processing”, 1997 [3] S. Young, G Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland. The HTK Book. Cambridge University Engineering Dept, Cambridge, UK, 2002 [4] J.S. Garofolo, L.F. Lamel, W.M. Fisher, J. G. Fiscus, D. S. Pallet, N.L. Dahlgren, “DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM”, NIST, 1990 [5] R. Ellul-Micallef, “Speech Annotation”, Dissertation, Department of Communications and Computer Engineering, University of Malta, 2002