ELECTRONICS AND ELECTRICAL ENGINEERING ISSN 1392 – 1215 20XX. No. X(XX) ELEKTRONIKA IR ELEKTROTECHNIKA SIGNAL TECHNOLOGY T 121 SIGNALŲ TECHNOLOGIJA Speech Segmentation Analysis Using Synthetic Signals Mindaugas Greibus, Laimutis Telksnys Vilnius University Institute of Mathematics and Informatics, Akademijos str., 4, LT-08663 Vilnius, Lithuania, phone: +37068789464, e-mail: mindaugas.greibus@exigenservices.com, telksnys@ktl.mii.lt Introduction Speech segmentation problem can defined differently depending on application that the segments will be used. Speech segments can be divided into levels: phone, syllable, word and phrase. Automated speech recognition algorithm can solve segmentation tasks indirectly. Blind segmentation is used for task when it is needed extract speech information e.g. speech corpora construction, initial segment extraction to reduce amount of data for ASR. Automatic speech segmentation is attractive for different type of solutions. Cosi et al. [1] defines that automated segmentation advantage is that results are predictable: mistakes is done in a coherent way. Evaluation speech segmentation precision is needed to have appropriate reference segments. It is non-trivial task to find the transcription that can serve as ideal for comparison [2]. Manual segmentation is dependent on expert experience and base on non-quantitative criteria to define where segments start and where end. Same person can put change point marker in slightly different place each time when he or she does segmentation. In the experiment Wesenick et al. [2] they compared manual and automatic phoneme segmentation performance. Results showed that in general errors are done by experts and algorithm for certain phonemes transitions: nasal to nasal and vowel to lateral. Manual and automatic segmentation showed same tendencies for the best segmentation cases also: algorithm and experts were performed well with phoneme transitions: voiceless plosive to nasal, voiceless plosive - vowel. A natural speech corpus not always provides enough speech cases for experimentation. Also elements in natural speech corpus can suffer common issues: transcription errors, unpredictable noises, lack of possibility to control speech model, lack of environment noises adjustment. Several speech model complexity levels would allow indentify erroneous algorithm places. Synthetic signal generated by speech model could solve some of the issues. Malfrère et al. [3] in their article propose use synthetic speech for phone segment alignment. Author mentioned that synthetic speech does not requires having corpora for training, but it is possible generate signals that will be similar enough to natural speech. Other approaches like HMM requires large amount of training data and supervised training. Malfrère noticed that phone alignment is working when original signal and synthetic signals are sex dependent. Horak [4] used similar algorithm. The results of the experiment showed that vowels segmentation error was smallest in comparison with other phoneme types and fricatives showed highest error rate. Sethy and Narayanan [5] was investigating what signal features should be the best for aligning synthetic and natural speech signals. Features were tested: Melfrequency cepstrum coefficients and their deltas, line spectral frequencies, formants, energy and its delta and the zero crossing rate. It was found out that there was no feature or feature set that would be the best for all the phonemes. It was proposed to use two feature combinations depending of what phonemes are compared. The most often the papers describes that speech model is used for phoneme alignment. Synthetic speech can be employed as a part of a speech algorithm. It can be possible use speech model for the speech segmentation also. In this paper it will be proposed approach to use synthetic speech as modeled signals to compare speech segmentation algorithms in various noise levels. In the next section, it will be described the method of generation the synthetic signals. Later it will be presented the speech model adequateness criteria. How the speech model can be used for speech algorithm comparison is defined before section about segmentation result evaluation method. Evaluation results of two segmentation algorithms are defined in separate section. In conclusion are summarized this paper results. Speech Model for Synthesis Natural speech is a complex process as it is highly dynamic. Speech generation is relaying on multiple mechanisms: linguistic, articulatory, acoustic, and perceptual [6]. Produce synthetic speech signal is not a trivial task. Speech signal generation has several steps [7]: text analysis, transforming to computer pronunciation instruction using linguistic analysis and speech waveform generation using recorded speech. Quality of synthetic speech cannot be measured straight forward. Common approach it is qualitative assessment of synthesized speech. It is challenging task generate synthetic speech same as natural speech. There are multiple speech engines that can be used to simplify speech signals generation: Mbrola [7], FreeTTS [8], Mary [9]. Mbrola project supports 25 languages. This is very convenient to evaluate speech algorithms if multi language is important. Also Mbrola it is used by other researches in their papers [4] [5] [10] Mbrola speech engine plays important part in the speech synthesis as it generates waveforms. Text analysis and linguistic analysis should be done separately. It can be done by transformation rules that are used in transformation Letter-To-Sound engines like FreeTTS [8], eSpeak [11]. This step is called Natural Language Processing [12]. Such processing use 3 models: phonetisation, duration, pitch. Phonetisation model uses lexical analysis to map grapheme to phoneme. This mapping converts one stream of orthographical symbols into symbols of the corresponding sequence of sounds [13]. Pitch model [8] determines synthesized speech parameters for pitch, tone, stress, and amplitude. This model is important for natural speech. In natural speech phonemes are spelled differently each time. Speech rate of Lithuanian language can fluctuate from 2.6 up to 5.7 syllables per second [14]. In order to imitate duration properties it is possible define duration as random variable∼ 𝒩(𝜇𝐷𝑈𝑅 , 𝜎𝐷𝑈𝑅 2 ). Mean a phoneme duration (𝜇𝐷𝑈𝑅 ) can be defined same as in a natural speech and Variance phoneme duration (𝜎𝐷𝑈𝑅 2 ) should be chosen that not exceeding natural speech limits. Synthetic speech without randomization property would not be very efficient from the research point of view. Noise resistance it is another important point when evaluating speech algorithm performance. The most often the stationary noise can be generating by computer and for non-stationary should be used real life samples. Gaussian stationary noise it is the worst case scenario as it has flat spectrum and filtering techniques is hard to apply. As synthetic speech signals has no noise it is easy to control noisification level SNR (1) measured in dB. 𝑆𝑁𝑅 = 10 log ( 𝜎𝑆2 ) 𝜎𝑁2 (1) , where 𝜎𝑆2 –speech variance, 𝜎𝑁2 – noise variance In order that a signal would be noisified properly it is needed to scale speech signal (2) 𝑆𝑁𝑅 𝑦̅[𝑡] = 𝜎𝑁 √10 10 𝜎𝑠 𝑦[𝑡] (2) , where 𝑦̅[𝑡] – t sample of noisified speech signal; 𝑦[𝑡] – t sample of speech signal. Synthetic speech could make investigation easier as it is possible control the noise, content and generate unlimited speech samples. Speech model signal has to be sufficient to represent natural speech properties. Adequate Speech Model In generate if model is adequate enough it can be defined by situation-depended criteria [15]: if it is “right”, “useful” “believable” and “worth the costs”. The costs of synthetic speech corpus generated with speech model is low comparing with natural speech corpus as requires less human effort to collect enough data for investigations. Synthetic speech corpus has to match major requirements for the speech situations that are investigated by researcher. Extreme naturalness may not be wanted in regular research works as synthetic speech can require too much resources and the effect of the results can be insignificant in comparison with more primitive speech model. Speech model provides ability to control speech content, generated sample size. Such tool can help researcher to compare speech algorithms faster than with a specialized natural speech corpus. The definition, how synthetic speech signal is similar to natural speech, can be defined as set of signal attributes that defines signal suitability for communication [16]. In general speech signal quality measurements [17] can be divided in two types: objective and subjective. The objective measurement cannot always unambiguously define quality that is good enough for experimentation. The intelligibility measurement cannot express quantitatively easily. The methods, that using intelligibility, requires expert group and they evaluate how realistic sound is. In this paper segmentation error was used as the quantitative similarity measurement. The first step it is record natural speech signals with selected phrase. Next synthetic signals should be generated using same phrase. After segmentation algorithms should be executed against both corpora and segmentation errors calculated. Similarity of synthetic and natural speech can be expressed as segmentation error difference produced by segmentation algorithms. If errors values are close, then synthetic signal are similar to natural speech from speech segmentation perspective. For speech segmentation algorithms requires that the speech model should match VIP (Variable, Intelligible, Pronounceable) criteria: Variable – synthesized speech segment properties must be different each time within limits of natural speech. Intelligible – signals must be recognized as speech and words must be understood. Pronounceable – phonemes of specific language should be used. Natural behavior of speech in the most cases same word will be pronounced differently according to context, without this speech model would be too static in comparison with the natural speech. As it would be expected speech model should generate signals that it is understandable by human. In general it is not acceptable reuse other language phonemes for experiments, as each language has its own specific pronunciation. Speech Model for Segmentation Algorithms Speech segmentation algorithm creation has to be tested in different cases of speech signal and environments. For researchers are important to control environment during their experiments. This allows indemnify defective places easier, compare with other algorithms and optimize algorithm parameters. Generated signals can cover cases that in natural language are rare. Information where speech segments start and end, allows eliminate human errors of the manual segmentation. In the way as presented above the generated signals will be less complex comparing with natural speech, but the signals are good enough to understand by human. transformation should be done. Example generated by eSpeak: word what phonemes [w-Q-t], word is phonemes [I-z] , word the phonemes [D-@], word time phonemes [taI-m]. As it was mentioned earlier each phoneme has its own duration. Example generated by eSpeak: [w]=65ms, [Q]=62ms, [t]=101ms, [I]=74ms, [z]=65ms, [D]=65ms, [@]=134ms, [aI]=109ms, [m]=159ms. The phoneme duration value is used as a mean of random number. The pitch model is important for natural speech. For speech segmentation algorithms it was found that it is not essential. Pitch definition brings big overhead due it is needed estimate many speech engine parameters, but results were not affected significantly. For the experiment described in the next section, it was created synthetic speech corpus, that does not require prosody information and default values of the speech engine were used. This can be improved during the next experiment iterations. When parameters are prepared for speech generation, then it is possible to generate unlimited speech signals, which are different from speech segmentation algorithm perspective. Each generated signal should have transcription and audio files. Speech segmentation algorithm should label each automatic-segment. The labeling method requires training data set. It is used same speech signal generation method here as well. Training data set should not be used for testing. When data set is ready, the speech segmentation algorithm process audio files and generates sequence of auto-segments. Each auto-segment is matched to the references signals and normally the label is assigned. Next step it is compare auto-segments and original segments. Each segment is matched with reference in original transcription file and correctness of boundaries and labels are evaluated. Speech Result Evaluation Fig. 1. Speech algorithm analysis using synthetic signals Speech model preparation with segmentation algorithm method steps are defined figure 1. Synthetic speech corpus should be prepared for an experiment first of all. Researchers should identify what cases should be covered in their experiment: the set of phoneme and noise. Example: Compare two segmentation algorithms with phrase: "What is the time?". The 50 signals should be generated for each noise level. It must be used 5 different levels of white noise: 30dB, 15dB, 10dB, 5dB, 0dB (250 signals). The next it is needed transform text information to instructions for a speech engine. Simple transformation can be done in two steps: first step it is map one or more symbols from text alphabet to speech engine acceptable alphabet and the second step it is define phoneme duration. Mapping from regular alphabet to computer readable phonetic alphabet depends on speech engine. Mbrola uses SAMPA [18] alphabet that defines how Figure 2 describes what errors can appear during segmentation. It is compared original transcription (MANL) and auto-segmentation(AUTO) results. First task it is find if segment boundaries are detected correctly. Few error types may be detected (figure 2): EXS – automatically noise was detected as segment. SFT – automatic segment boundary was shifted. ITR – additional boundary was inserted in a segment. JON – boundary between Fig. 2. Automatic speech segment comparison with referece Auto-segment boundary detection error rate is calculated using (3). 𝐸𝑅𝑅𝐵 = 𝐸𝑋𝑆 + 𝑆𝐹𝑇 + 𝐼𝑇𝑅 + 𝐽𝑂𝑁 100 (3) 𝑁𝐹 𝐸𝑅𝑅𝐵 – boundary detection error rate, 𝑁𝐹 – number of segments that are found during experiment. Segment with incorrect boundaries errors are not used for pattern matching to define a label of a segment. Recognition result can be rejected if is algorithm found see similar references. Recognition error rate (4) is when algorithm says that it was segment class A, but it was segment class E (figure 2). It is calculated: 𝐸𝑅𝑅𝑅 = 𝐴𝐸 100 𝑁𝐵 Fig. 3 MEaRBS algorithm boundary detection cases (4) 𝐸𝑅𝑅𝑅 – recognition error rate, 𝑁𝐵 – number of correct segments from boundary detection step. Experiment Data Synthetic speech and natural speech was tested in order to see if generated signals can be use speech segmentation algorithm comparison. For the experiment was created speech corpora as described above. Against same signal set were executed two segmentation algorithms: Multi Feature Extremum and Rule Based Segmentation (MEaRBS) [19] and Threshold [20]. It will be shown that both algorithms were showed similar results with synthetic and natural speech, but their segmentation results allow saying which performed better. Synthetic speech corpora creation method is described above. It was chosen speaker dependent natural speech corpus model for the experiment. Speaker independent should require more effort to collect data and train recognition engine. For natural speech it was recorded 51 signals of the same phrase by male speaker. One signal was used to train speech recognition engine All the natural speech signals where segmented semi-automatically way. Segmentation was done in two steps boundaries detection and segment label definition. The expert segmentation results of natural speech are used to compare with automated segmentation results. Error rates were represented in figure 4. It is possible to see that speech segment boundaries for natural speech and synthetic speech with 5dB noise showed similar result, but natural speech had higher errors for recognition. Recognition result of synthetic signal with 10db noise is similar with natural speech. Fig. 4 MEaRBS algorithm segmentation and label class recognition errors Experiment Results: Threshold Algorithm Another experiment was done with static threshold segmentation algorithm [20]. It was kept same experiment data and experiment system parameters. The results are shown in figure 5. Natural speech segmentation results are similar to synthetic speech 5dB and 10dB, but segment recognition results are better. This happened, because DTW algorithm is very sensitive to finding beginning end of a segment and this algorithm detected boundaries in different way that MEaRBS algorithm. Experiment Results: MEaRBS Algorithm MEaRBS algorithm segmented 300 files (250 synthetic and 50 natural speech signals). In figure 3 results are shown graphically: vertically it is representing signal types: natural speech, total of all synthetic signals and synthetic speech with different noise levels: 30 dB, 15 dB, 10 dB, 5dB, 0dB. Horizontally it is representing case percent value. It is possible to see that the algorithm showed the worst results for synthetic speech with 0 dB noise level. Comparing all percent values natural speech showed similar result as 30dB synthetic speech. Fig. 5 Threshold algorithm segmentation and label class recognition errors Conclusions Method of speech segmentation algorithm comparison using synthetic speech corpora was presented. It was reviewed and proposed modified version of speech model usage for segmentation evaluation method. It was shown by experiment that results of segmentation results using natural speech and synthetic speech is similar: speech segmentation error is in the confidence intervals of 15 dB synthetic signal and natural speech overlaps. It can be noted that synthetic speech can be used for speech research and some algorithms comparison and certain cases investigation. References1. P. Cosi, D. Falavigna, and M. Omologo, A preliminary statistical evaluation of manual and automatic segmentation discrepancies// in Second European Conference on Speech Communication and Technology, 1991. 2. B. M. Wesenick and A. Kipp, Estimating The Quality Of Phonetic Transcriptions And Segmentations Of Speech Signals// in Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, 1996, P. 129132. 3. F. Malfrere, O. Deroo, T. Dutoit, and C. Ris, Phonetic alignment: speech synthesis-based vs. viterbi-based// Speech Communication. – , vol. 40, P. 503--515, 2003. 4. P. Horak, Automatic Speech Segmentation Based on Alignment with a Text-to-Speech System// Improvements in Speech Synthesis. – , P. 328--338, 2002. 5. A. Sethy and S.S. Narayanan, Refined speech segmentation for concatenative speech synthesis// in Seventh International Conference on Spoken Language Processing, 2002. 6. L. Deng, Dynamic Speech Models: Theory, Algorithms, and Applications.: Morgan and Claypool, 2006. 7. T. Dutoit, High-quality text-to-speech synthesis: An overview// JOURNAL OF ELECTRICAL AND ELECTRONICS ENGINEERING AUSTRALIA. – , vol. 17, P. 25--36, 1997. 8. W. Walker, P. Lamere, and P. Kwok, FreeTTS: a performance case study// Sun Microsystems, Inc., 2002. 9. M. Schroder and J. Trouvain, The German text-tospeech synthesis system MARY: A tool for research, development and teaching// International Journal of Speech Technology. – , vol. 6, P. 365--377, 2003. 10. F. Malfrere and T. Dutoit, High-quality speech synthesis for phonetic speech segmentation// in Fifth European Conference on Speech Communication and Technology, 1997. 11. H. Yang, C. Oehlke, and C. Meinel, German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings// in 10th IEEE/ACIS International Conference on Computer and Information Science, Sanya, 2011, P. 201 - 206. 12. A. Chabchoub and others, An automatic mbrola tool for high quality arabic speech synthesis// International Journal of Computer Application. – , vol. 36, P. 1-5, 2011. 13. F. Yvon and others, Objective evaluation of grapheme to phoneme conversion for text-to-speech synthesis in French// Computer Speech and language, vol. 12, P. 393--410, 1998. 14. A. Kazlauskiene and K. Velickaite, Pastabos del lietuviu kalbejimo tempo// Acta Linguistica Lituanica. – , vol. 48, P. 49–58, 2003. 15. H.B. Weil and F.L. Street, What Is An Adequate Model?// System Dynamics Conference, P. 281-321, 1983. 16. J.H. Eggen, On the quality of synthetic speech evaluation and improvements// Technische Universiteit Eindhoven, 1992. 17. A. Anskaitis, Koduoto balso kokybės tyrimas// Vilnius Gediminas Technical University, 2009. 18. D. Gibbon, R. Moore, and R. Winski, Handbook of standards and resources for spoken language systems., 1997. 19. M. Greibus and L. Telksnys, Rule Based Speech Signal Segmentation// Journal of Telecommunications and Information Technology (JTIT)2011, vol. 1, P. 37--44, 2011. 20. S.V. Gerven and F. Xie, A comparative study of speech detection methods// in Fifth European Conference on Speech Communication and Technology, 1997, P. 1095-1098. Date of paper submission: 2012 03 12 M. Greibus, L. Telksnys. Speech Segmentation Analysis Using Synthetic Signals // Electronics and Electrical Engineering. – Kaunas: Technologija, 20XX. – No. X(XXX). – P. XX–XX. It is needed certain nuber of speech cases when real life problems related with speech signals are solved. A narual speech corpora may suffer from limited sound cases for an experiment. In other hand an experimenter can create huge speech corpora with synthetic signals that reflect investigative speech cases that it is similar enough to natural speech. In this paper is presented comparative usage of synthetic and natural speech corpora for speech segmentation algorithms. It was define adequateness synthetic and natural corpora criteria for speech segmentation. The experiments results showed that synthetic signals can be used for speech algorithm research. Ill. X, bibl. X, tabl. X (in English; abstracts in English and Lithuanian). M. Greibus, L. Telksnys. Šnekos segmentavimo analizė naudojantis sintezuotus signalus// Elektronika ir elektrotechnika. – Kaunas: Technologija, 20XX. – Nr. X(XXX). – P. XX–XX. Sprendžiant praktinius uždavinius susijusiais su šnekos signalais, reikia turėti tam tikrą skaičių specifinių situacijų. Neretai natūralios šnekos garsynuose užtenkamai atvejų ne visuomet galima surasti. Vietoj natūralaus garsyno galima naudoti sintetintus. Šiame darbe yra pateiktas sintetinio garsyno panaudojimas šnekos segmentavimo uždaviniui spręsti. Sintetiniai garsynai, kurie yra pakankamai artimi natūraliems, gal būti naudojami tyrimuose. Šiame darbe yra apibrėžtas sintetinio garsyno panašumo matas šnekos segmentavimo algoritmams. Eksperimentiškai yra parodyta, kad sintetinis garsynas gali būti naudojimas šnekos segmentavimo uždaviniams tirti. Il. X, bibl. X, lent. X (anglų kalba; santraukos anglų ir lietuvių k.).