Slovko: A Phoneme-based Speech Re-sequencing System in MATLAB Miloš Cerňak Institute of Informatics of the Slovak Academy of Sciences, Dúbravská c. 9, 845 07 Bratislava Milos.Cernak@SAVBA.SK Abstract The paper presents Slovak speech re-sequencing system. The system is derived from the Festival TTS system, and its first version uses a phoneme as a basic synthesis unit. The design of the system is divided into two parts. Firstly, semi-automatic orthoepic transcription, speech labeling using HMM, and building of CARTs for each phoneme, is done directly in the Festival system. A wrapper was created to convert CARTs from LISP structures to a MATLAB readable format. Secondly, definition of the unit costs and the concatenation costs, selection of the candidate clusters of speech segments, and finally Viterbi search, is done in MATLAB. 1. Introduction Speech re-sequencing systems belong to the most popular approaches in present TTS systems. MATLAB is a mathematical tool, and it is often used by digital speech processing groups in the academic area. However, only few works were published, where MATLAB is a framework for TTS system. Slovko TTS system emanates from Festival TTS system, and it is implemented in MATLAB environment. Such usage of two different systems allows utilizing the power of the Festival and Edinburgh speech tools in offline speech processing and CARTs creation, and the power of MATLAB in run-time speech segment selection and concatenation. Presented approach has two main advantages. Firstly, MATLAB is well-known tool and it is widely used by many speech labs, and secondly, it offers lots of implementations of new scientific algorithms, which might be easily used in the run-time speech segment selection and concatenation. In section 2 outlines the process of automatic labeling of recorded speech database using a statistical approach. Section 3 briefly describes CART based phonetic transcription. Sections 4 and 5 give an overview of the unit selection and unit concatenation. The first uses DTW algorithm, and is based on time alignment of labeled speech and its re-synthesized version. The second one is a statistical approach. It uses HMM either with already trained acoustic models of context-dependent phoneme models or without them. I followed the second approach – using HMM without reference acoustic models. Speech recognition system Sphinx2 and SphinxTrain tool for acoustic model creation of Carnegie Mellon University were used (http://www.speech.cs.cmu.edu/sphinx/). The process of speech segmentation is shown at Fig. 1. The system uses the speech signal files with precise orthographic transcription as its inputs. Customization of Festival TTS system for usage with Slovak language was necessary in this step. Creation of the acoustic models by SphinxTrain tool, and consecutive alignment by HMM, is not an optimal solution. However, the concatenation of diphones instead of phones (see [2] as an example) or the usage of optimal coupling method [3] might these errors minimizes. Classification and regression trees (CART) method can be used to automatically correct the errors of time alignment as well. This approach is based on the assumptions that HMM make systematic errors. Small fraction of annotated corpus is taken and human expert corrects the mistakes. Based on this information a CART is trained and then this “correction” tree is used to automatically correct whole speech corpus [4]. 2. Automatic Speech Database Labeling There are known two main approaches, which give the best results for the automatic speech database labeling [1]. Figure 1: The flowchart of the automatic speech segmentation. CI means ‘Context independent’, and CD means ‘Context Dependent’. 3. Data-driven Phonetic Transcription The process of the building of phonetic transcription automatically, consists of the following stages: - An assignment of allowed phonemes to each Slovak letter. These assignments were created manually, in order to facilitate automatic rule’s learning. Therefore the system is semi-automatic. - An alignment of the transcription of the training set, where each letter had to have its respective transcription. - A generalization of observed assignments in the test set. More details about these steps can be found in [6]. The testing with a knowledge based transcription system was made. The testing set consisted of 2918 phonemes. The knowledge based system made 8 mistakes in the whole testing set, while the data driven system failed 67 times. So the probability of good transcription of the letters to right phonemes was 99.73% for classical approach and 97.70% for this automatic approach. 4. Unit Selection Slovko TTS system uses technique described in [5]. Because Festival TTS system implements this technique as well, the preprocessing of the creation of the CARTs is done directly in Festival. I developed a file format converter of the CARTs - from LISP structures into MATLAB readable format. Features, witch are used in Slovko TTS system are as follow: - Vowel or consonant - Vowel length - Vowel height - Vowel frontness - Lip rounding - Consonant type - Place of articulation - Consonant voicing These features were used as question during building of the CARTs and are extracted from an input text automatically, in order to find appropriate cluster of units from speech database. Let me define the speech synthesis framework for unit concatenation. Let S s1 , s1 , n be a speech vector consisted of N units. Each unit si , where 1 i n , is defined as si siname , sisentence , sistart , simiddle , siend , (4.1) where siname is a unique name of the unit, sisentence is the name of the file of speech database the unit came from, and last three items represents time markers of the unit in the sentence of the speech database. Speech synthesis begins with a phonetic transcription of the input text, producing the sequence of phones P p1 , p2 , pm , where p1 and pm are pause segments at the beginning and the end of the synthesized speech. For each phoneme is then created feature vector v : For all pi, 2≤ i≤ m-1 Find vector F (pi–1) Find vector F (pi+1) v(pi)=concat (F(pi–1),F(pi+1)) End The letter F denotes a matrix of the features of all Slovak phonemes. The CART method finds a set of cluster of units from the speech database, which is indexed by the features. A specific cluster was selected using the feature vector v pi for each synthesized phoneme. In this way a lattice is formed, consisted of m 2 clusters. The lattice is weighted by a target cost and a concatenation cost. The target cost d u j , T is defined as a distance from a center of a cluster and the concatenation cost d c i , j is defined as a weighted sum of differences of F0 between concatenated units, and a Euclid distance of concatenated micro segments. The task of optimal sequence searching is to define of the sequence ˆ , , , , 1 i m 2 . (4.2) 1 2 i This might be formulized as follow: N N 1 ˆ arg min d , T d , (4.3) u j c j j 1 j 2 j 1 A Viterbi algorithm did the effective searching of the optimal path. The relation of the vectors and S is as follow: S. (4.4) 5. Unit Concatenation The proposed method is working on the pitch synchronous basis. To speech up the computations, pitch markers should be defined at first. I have done it by the pitchmark program of the Edinburgh Speech Library. Speech signals are double low pass filtered and then double high pass filtered. Double filtering (feeding the waveform through the filter, then reversing the waveform and feeding it through again) is performed to reduce any phase shift between the input and output of the filtering operation. Using variables of minimal and maximal allowed pitch periods, pitch markers are defined. Figure 2: The creation of a new artificial pitch period from two overlap periods of the concatenated segments. To achieve minimal audible changes by concatenation of i and j , the concatenation point should be at the end of pitch cycle with minimal difference of pitch between the concatenated segments [7]. I followed this scheme in the proposed waveform interpolation method. New artificial pitch period was created and inserted between the concatenated segments. Figure [2] shows this schematically. During acoustic parameterization of the acoustic inventory (speech database) I have found pitch markers mi . For the concatenation of si and s j units, the closets pitch marks mi to s end j and m j to s start j has been found. New borders of both units si and s j has been defined as sistart , mi , (5.1) m j 1 , s end . j (5.2) and Next pitch period from si has been selected as: mend mi , mi 1 , and next pitch period from (5.3) si has been selected as: mstart m j , m j 1 . (5.4) Let mend be a pitch period consists of N samples and mstart a pitch period consists of M samples. The length of new mutually adapted pitch period has been defined as: L max N , M . The overlap-and-add principle has been used to create new artificial pitch period. Hann window with the length 2 N has been applied to the mend pitch period, centered at the first sample of the mend . If N L , L N zeros have been added at the end of the pitch period. Similarly, Hann window with the length 2 M has been applied to Figure 3: Synthesized speech with a raw concatenation (up), and with applying a pitch-synchronous windowed method. the mstart pitch period, centered at the last sample of the mstart . If M L , M L zeros have been inserted at the beginning of the pitch period. Finally, both pitch periods has been added, and the new pitch period has been inserted between the units sistart , mi and m j 1 , s end . Figure 3 j shows synthesized speech with raw concatenation and with proposed pitch-synchronous windowed method. 6. Discussion and Future Work During the evaluation process I tried to synthesize using the original CARTs, however by a diphone concatenation. For each unit si from the speech vector S , a diphone is defined as s 1 middle , s middle . (6.1) The concatenation cost was computed as described in the previous section 5, and the Viterbi algorithm was used again for a searching of the optimal sequence of the units. Final concatenation of selected units was done as a concatenation of selected diphones. However described approach was not as good as was expected. The reason might be in a fact, that the acoustic representation of the phonemes instead of diphones was used to create the CARTs. Further improving might be in a dynamic concatenation. Concatenation points ikoniec and istart 1 are then not fixed, and can be moved in a certain range i middle,end and , middle , in accordance with some objective measurei start 1 ments (see [3] as an example). Finally, prosodic modeling might not be neglected. There is an often discussion what is better: either a knowledge-based approach or a data-driven approach. Recent results of data-driven approaches show us, that they are very perspective and can produce very good models without years of exhausting research. The work done of relative Czech language [8] is a good example of that. Slovko TTS system works with the CARTs, so there could be included another CARTs for prosody modeling as well. Slovko TTS is implemented in a MATLAB environment as a set of scripts. Generation of a one-second synthesized speech takes linearly approximately one second on the AMD Athlon 1.5 GHz, which is far away from real time processing (for example the generation of 5 s speech takes roughly 10 s : 5 s for the synthesizing + 5 s for the replying). Slovko TTS was however designed as an experimental tool to verify speech synthesis data-driven approaches for Slovak language. Many technicians at the most of universities are familiar to such environment, which means a rapid transfer of the knowledge and implementation of speech synthesis techniques to the teachers and their students, and they can easily either further expand the synthesizer, or replace some block with more effective algorithms. The current version of Slovko TTS is fully intelligible and maintains the speaker’s characteristics. 7. Acknowledgements This work was supported by the Slovak Agency for Science VEGA grant No. 2/2087/22. 10. References [1] F. Malfrére, O. Deroo, T. Dutoit, and C. Ris. Phonetic Alignment: Speech Synthesis-based vs. Viterbi-based. Speech Communication, 40:503–515, 2003. [2] M. Beutnagel, A. Conkie, and A. Syrdal. Diphone Synthesis Using Unit Selection. In Third International Workshop on Speech Synthesis, Sydney, Australia, 1998. [3] A. Conkie and S. Isard. Optimal Coupling of Diphones. In Jan P. H. Van Santen, Richard W. Sproat, JosephP. Olive, and Julia Hirschberg, editors, Progress in Speech Synthesis, pages 293–303. Springer-Verlag New York, Inc., New York, 1997. [4] Jordi Adell and Antonio Bonafonte. Towards Phone Segmentation For Concatenative Speech Synthesis. In 5th ISCA Speech Synthesis Workshop, pages 139 – 144, Pittsburgh, USA, 2004. [5] A. W. Black and P. Taylor. Automatically Clustering Similar Units for Unit Selection in Speech Synthesis. In Proc. of the European Conference on Speech Communication and Technology, volume 2, pages 601–604, Rhodos, Greece, 1997. [6] Cerňak M., Rusko M., Trnka M. and S. Daržágín: Data-driven Versus Knowledge-based Approaches to Orthoepic Transcription in Slovak, In: Proceedings of the 2nd ICETA, pp. 95-98, Košice, Slovakia, 2003 [7] J. Yi. Corpus-Based Unit Selection for NaturalSounding Speech Synthesis. PhD thesis, MIT, 2003. [8] R. Batušek: A Duration Model for Czech Text-toSpeech Synthesis, In International conference Speech Prosody 2002, Aix-en-Provence, France, 2002.