ELECTRONICS AND ELECTRICAL ENGINEERING ISSN 1392 – 1215 20XX. No. X(XX) ELEKTRONIKA IR ELEKTROTECHNIKA SIGNAL TECHNOLOGY T 121 SIGNALŲ TECHNOLOGIJA Speech Segmentation Analysis Using Synthetic Signals Mindaugas Greibus, Laimutis Telksnys Vilnius University Institute of Mathematics and Informatics, Akademijos str., 4, LT-08663 Vilnius, Lithuania, phone: +37068789464, e-mail: mindaugas.greibus@exigenservices.com, telksnys@ktl.mii.lt Introduction Speech segmentation problem can defined differently depending on application that the segments will be used. Speech segments can be divided into levels: phone, syllable, word and phrase. Automated speech recognition algorithm can solve segmentation tasks indirectly. Blind segmentation is used for task when it is needed extract speech information e.g. speech corpora construction, initial segment extraction to reduce amount of data for ASR. Automatic speech segmentation is attractive for different type of solutions. Cosi et al. [1] defines that automated segmentation advantage is that results are predictable: mistakes is done in a coherent way. In the experiment Wesenick et al. [2] they compared manual and automatic phoneme segmentation performance. Results showed that in general errors are done by experts and algorithm for certain phonemes transitions: nasal to nasal and vowel to lateral. Manual and automatic segmentation showed same tendencies for the best segmentation cases also: algorithm and experts were performed well with phoneme transitions: voiceless plosive to nasal, voiceless plosive - vowel. A natural speech corpus not always provides enough speech cases for experimentation. Also elements in natural speech corpus can suffer common issues: transcription errors, unpredictable noises, lack of possibility to control speech model, lack of environment noises adjustment. Several speech model complexity levels would allow indentify erroneous algorithm places. Synthetic signal generated by speech model could solve some of the issues. Some authors [3][4] in their article propose use synthetic speech for phone segment alignment. Author mentioned that synthetic speech does not requires having corpora for training, but it is possible generate signals that will be similar enough to natural speech. Other approaches like HMM requires large amount of training data and supervised training. Malfrère noticed that phone alignment is working when original signal and synthetic signals are sex dependent. Sethy and Narayanan [5] was investigating what signal features should be the best for aligning synthetic and natural speech signals. Features were tested: Melfrequency cepstrum coefficients and their deltas, line spectral frequencies, formants, energy and its delta and the zero crossing rate. It was found out that there was no feature or feature set that would be the best for all the phonemes. It was proposed to use two feature combinations depending of what phonemes are compared. The most often the papers describes that speech model is used for phoneme alignment. Synthetic speech can be employed as a part of a speech algorithm. It can be possible use speech model for the speech segmentation also. Adequate Speech Model for Synthesis Natural speech is a complex process as it is highly dynamic. Speech generation is relaying on multiple mechanisms: linguistic, articulatory, acoustic, and perceptual[6]. Produce synthetic speech signal is not a trivial task. Speech signal generation has several steps [7]: text analysis, transforming to computer pronunciation instruction using linguistic analysis and speech waveform generation using recorded speech. Quality of synthetic speech cannot be measured straight forward. Common approach it is qualitative assessment of synthesized speech. It is challenging task generate synthetic speech same as natural speech. There are multiple speech engines that can be used to simplify speech signals generation: Mbrola[7], FreeTTS [8] and some others. Mbrola project supports 25 languages. This is very convenient to evaluate speech algorithms if multi language is important. Also Mbrola it is used by other researches in their papers[3][5] Mbrola speech engine plays important part in the speech synthesis as it generates waveforms. Text analysis and linguistic analysis should be done separately. It can be done by transformation rules that are used in transformation Letter-To-Sound engines like FreeTTS [8], eSpeak [9]. Such processing use 3 models: phonetisation, duration, pitch. Phonetisation model uses lexical analysis to map grapheme to phoneme. This mapping converts one stream of orthographical symbols into symbols of the corresponding sequence of sound. Pitch model [8] determines synthesized speech parameters for pitch, tone, stress, and amplitude. This model is important for natural speech. In natural speech phonemes are spelled differently each time. Speech rate of Lithuanian language can fluctuate from 2.6 up to 5.7 syllables per second [10]. In order to imitate duration properties it is possible define duration as random variable∼ 𝒩(𝜇𝐷𝑈𝑅 , 𝜎𝐷𝑈𝑅 2 ). Synthetic speech without randomization property would not be very efficient from the research point of view. Noise resistance it is another important point when evaluating speech algorithm performance. The most often the stationary noise can be generating by computer and for non-stationary should be used real life samples. As synthetic speech signals has no noise it is easy to control noisification level SNR (1) measured in dB. 𝜎𝑆2 (1) 𝑆𝑁𝑅 = 10 log ( 2 ) 𝜎𝑁 , where 𝜎𝑆2 –speech variance, 𝜎𝑁2 – noise variance In order that a signal would be noisified properly it is needed to scale speech signal (2) 𝑦̅[𝑡] = 𝑆𝑁𝑅 𝜎𝑁 √10 10 𝑦[𝑡] (2) 𝜎𝑠 , where 𝑦̅[𝑡] – t sample of noisified speech signal; 𝑦[𝑡] – t sample of speech signal. Synthetic speech could make investigation easier as it is possible control the noise, content and generate unlimited speech samples. Speech model signal has to be sufficient to represent natural speech properties. In generate if model is adequate enough it can be defined by situation-depended criteria [11]: if it is “right”, “useful” and “believable”. Synthetic speech corpus has to match major requirements for the speech situations that are investigated by researcher. Extreme naturalness may require too much effort and the effect of the results can be insignificant in comparison with more primitive speech model. Speech model provides ability to control speech content, generated sample size. Such tool can help researcher to compare speech algorithms faster than with a specialized natural speech corpus. The definition, how synthetic speech signal is similar to natural speech, can be defined as set of signal attributes that defines signal suitability for communication [12]. In general speech signal quality measurements [13] can be divided in two types: objective and subjective. The objective measurement cannot always unambiguously define quality that is good enough for experimentation. The intelligibility measurement cannot express quantitatively easily. The methods, that using intelligibility, requires expert group and they evaluate how realistic sound is. In this paper segmentation error was used as the quantitative similarity measurement. The first step it is record natural speech signals with selected phrase. Next synthetic signals should be generated using same phrase. After segmentation algorithms should be executed against both corpora and segmentation errors calculated. Similarity of synthetic and natural speech can be expressed as segmentation error difference produced by segmentation algorithms. If errors values are close, then synthetic signal are similar to natural speech from speech segmentation perspective. For speech segmentation algorithms requires that the speech model should match VIP (Variable, Intelligible, Pronounceable) criteria. Variable – synthesized speech segment properties must be different each time within limits of natural speech. Intelligible – signals must be recognized as speech and words must be understood. Pronounceable – phonemes of specific language should be used. Natural behavior of speech in the most cases same word will be pronounced differently according to context, without this speech model would be too static in comparison with the natural speech. As it would be expected speech model should generate signals that it is understandable by human. In general it is not acceptable reuse other language phonemes for experiments, as each language has its own specific pronunciation. Speech Model for Segmentation Algorithms Speech segmentation algorithm creation has to be tested in different cases of speech signal and environments. For researchers are important to control environment during their experiments. This allows indemnify defective places easier, compare with other algorithms and optimize algorithm parameters. Generated signals can cover cases that are rare in natural language. If it is known in advance segment boundaries, then segmentation human errors are eliminated. In the way as presented above the generated signals will be less complex comparing with natural speech, but the signals are good enough to understand by human. Fig. 1. Speech algorithm analysis using synthetic signals Speech model preparation with segmentation algorithm method steps are defined figure 1. Synthetic speech corpus should be prepared for an experiment first of all. Researchers should identify what cases should be covered in their experiment: the set of phoneme and noise. Example: Compare two segmentation algorithms with phrase: "What is the time?". The 50 signals should be generated for each noise level. It must be used 5 different levels of white noise: 30dB, 15dB, 10dB, 5dB, 0dB (250 signals). The next it is needed transform text information to instructions for a speech engine. Simple transformation can be done in two steps: first step it is map one or more symbols from text alphabet to speech engine acceptable alphabet and the second step it is define phoneme duration. Mapping from regular alphabet to computer readable phonetic alphabet depends on speech engine. Mbrola uses SAMPA [14] alphabet that defines how transformation should be done. When speech model parameters are prepared for speech generation, then it is possible to generate unlimited speech signals, which are different from speech segmentation algorithm perspective. Each generated signal should have transcription and audio files. A speech segmentation algorithm process audio files and generates sequence of auto-segments. Each auto-segment is matched to the references signals and normally the label is assigned. Next step it is compare auto-segments and original segments. Each segment is matched with reference in original transcription file and correctness of boundaries and labels are evaluated. Speech Result Evaluation and Experiment Data Figure 2 describes what errors can appear during segmentation. It is compared original transcription (MANL) and auto-segmentation (AUTO) results. First task it is find if segment boundaries are detected correctly. Few error types may be detected (figure 2): EXS –automatically noise was detected as segment; SFT – automatic segment boundary was shifted; ITR – additional boundary was inserted in a segment; JON – boundary between Segmentation (MEaRBS) and Threshold. It will be shown that both algorithms were showed similar results with synthetic and natural speech, but their segmentation results allow saying which performed better. Synthetic speech corpora creation method is described above. It was chosen speaker dependent natural speech corpus model for the experiment. Speaker independent should require more effort to collect data and train recognition engine. For natural speech it was recorded 51 signals of the same phrase by male speaker. One signal was used to train speech recognition engine All the natural speech signals where segmented semi-automatically way. Segmentation was done in two steps boundaries detection and segment label definition. The expert segmentation results of natural speech are used to compare with automated segmentation results. Experiment Results MEaRBS algorithm segmented 300 files (250 synthetic and 50 natural speech signals). In figure 3 results are shown graphically: vertically it is representing signal types: natural speech, total of all synthetic signals and synthetic speech with different noise levels: 30 dB, 15 dB, 10 dB, 5dB, 0dB. Horizontally it is representing case percent value. It is possible to see that the algorithm showed the worst results for synthetic speech with 0 dB noise level. Comparing all percent values natural speech showed similar result as 30dB synthetic speech. Fig. 2. Automatic speech segment comparison with referece Auto-segment boundary detection error rate is calculated using (3). 𝐸𝑋𝑆 + 𝑆𝐹𝑇 + 𝐼𝑇𝑅 + 𝐽𝑂𝑁 𝐸𝑅𝑅𝐵 = 100 (3) 𝑁𝐹 𝐸𝑅𝑅𝐵 – boundary detection error rate, 𝑁𝐹 – number of segments that are found during experiment. Segment with incorrect boundaries errors are not used for pattern matching to define a label of a segment. Recognition result can be rejected if is algorithm found see similar references. Recognition error rate (4) is when algorithm says that it was segment class A, but it was segment class E (figure 2). It is calculated: 𝐴𝐸 (4) 𝐸𝑅𝑅𝑅 = 100 𝑁𝐵 𝐸𝑅𝑅𝑅 – recognition error rate, 𝑁𝐵 – number of correct segments from boundary detection step. Synthetic speech and natural speech was tested in order to see if generated signals can be use speech segmentation algorithm comparison. For the experiment was created speech corpora as described above. Against same signal set were executed two segmentation algorithms [15]: Multi Feature Extremum and Rule Based Fig. 3 MEaRBS algorithm boundary detection cases Error rates were represented in figure 4. It is possible to see that speech segment boundaries for natural speech and synthetic speech with 5dB noise showed similar result, but natural speech had higher errors for recognition. Recognition result of synthetic signal with 10db noise is similar with natural speech. Fig. 4 MEaRBS algorithm segmentation and label class recognition errors Another experiment was done with static threshold segmentation algorithm. It was kept same experiment data and experiment system parameters. The results are shown in figure 5. Natural speech segmentation results are similar to synthetic speech 5dB and 10dB, but segment recognition results are better. This happened, because DTW algorithm is very sensitive to finding beginning end of a segment and this algorithm detected boundaries in different way that MEaRBS algorithm. Fig. 5 Threshold algorithm segmentation and label class recognition errors Conclusions The method of speech segmentation algorithm comparison using synthetic speech corpora was presented. It was reviewed and proposed modified version of speech model usage for segmentation evaluation method. It was shown by experiment that results of segmentation results using natural speech and synthetic speech is similar: speech segmentation error is in the confidence intervals of 15 dB synthetic signal and natural speech overlaps. It can be noted that synthetic speech can be used for speech research and some algorithms comparison and certain cases investigation. References 1. P. Cosi, D. Falavigna, and M. Omologo, A preliminary statistical evaluation of manual and automatic segmentation discrepancies// in Second European Conference on Speech Communication and Technology, 1991. 2. B. M. Wesenick and A. Kipp, Estimating The Quality Of Phonetic Transcriptions And Segmentations Of Speech Signals// in Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, 1996, P. 129--132. 3. P. Horak, Automatic Speech Segmentation Based on Alignment with a Text-to-Speech System// Improvements in Speech Synthesis. – , P. 328--338, 2002. 4. F. Malfrere, O. Deroo, T. Dutoit, and C. Ris, Phonetic alignment: speech synthesis-based vs. viterbibased// Speech Communication. – , vol. 40, P. 503--515, 2003. 5. A. Sethy and S.S. Narayanan, Refined speech segmentation for concatenative speech synthesis// in Seventh International Conference on Spoken Language Processing, 2002. 6. L. Deng, Dynamic Speech Models: Theory, Algorithms, and Applications.: Morgan and Claypool, 2006. 7. T. Dutoit, High-quality text-to-speech synthesis: An overview// JOURNAL OF ELECTRICAL AND ELECTRONICS ENGINEERING AUSTRALIA. – , vol. 17, P. 25--36, 1997. 8. W. Walker, P. Lamere, and P. Kwok, FreeTTS: a performance case study// Sun Microsystems, Inc., 2002. 9. H. Yang, C. Oehlke, and C. Meinel, German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings// in 10th IEEE/ACIS International Conference on Computer and Information Science, Sanya, 2011, P. 201 - 206. 10. A. Kazlauskiene and K. Velickaite, Pastabos del lietuviu kalbejimo tempo// Acta Linguistica Lituanica. – , vol. 48, P. 49–58, 2003. 11. H.B. Weil and F.L. Street, What Is An Adequate Model?// System Dynamics Conference, P. 281-321, 1983. 12. J.H. Eggen, On the quality of synthetic speech evaluation and improvements// Technische Universiteit Eindhoven, 1992. 13. A. Anskaitis, Koduoto balso kokybės tyrimas// Vilnius Gediminas Technical University, 2009. 14. D. Gibbon, R. Moore, and R. Winski, Handbook of standards and resources for spoken language systems., 1997. 15. M. Greibus and L. Telksnys, Rule Based Speech Signal Segmentation// Journal of Telecommunications and Information Technology (JTIT)2011, vol. 1, P. 37--44, 2011. Date of paper submission: 2012 03 12 M. Greibus, L. Telksnys. Speech Segmentation Analysis Using Synthetic Signals // Electronics and Electrical Engineering. – Kaunas: Technologija, 20XX. – No. X(XXX). – P. XX–XX. It is needed a certain nuber of speech cases when real life problems related with speech signals are solved. A narual speech corpora may suffer from limited sound cases for an experiment. In other hand an experimenter can create huge speech corpora with synthetic signals that reflect investigative speech cases that it is similar enough to natural speech. In this paper is presented comparative usage of synthetic and natural speech corpora for speech segmentation algorithms. It was define adequateness synthetic and natural corpora criteria for speech segmentation. The experiments results showed that synthetic signals can be used for speech algorithm research. Ill. X, bibl. X, tabl. X (in English; abstracts in English and Lithuanian). M. Greibus, L. Telksnys. Šnekos segmentavimo analizė naudojantis sintezuotus signalus// Elektronika ir elektrotechnika. – Kaunas: Technologija, 20XX. – Nr. X(XXX). – P. XX–XX. Sprendžiant praktinius uždavinius susijusiais su šnekos signalais, reikia turėti tam tikrą skaičių specifinių situacijų. Neretai natūralios šnekos garsynuose užtenkamai atvejų ne visuomet galima surasti. Vietoje tokio garsyno galima naudoti sintetintus. Šiame darbe yra pateiktas sintetinio garsyno panaudojimas šnekos segmentavimo uždaviniui spręsti. Sintetiniai garsynai, kurie yra pakankamai artimi natūraliems, gal būti naudojami tyrimuose. Šiame darbe yra apibrėžtas sintetinio garsyno panašumo matas šnekos segmentavimo algoritmams. Eksperimentiškai yra parodyta, kad sintetinis garsynas gali būti naudojimas šnekos segmentavimo uždaviniams tirti. Il. X, bibl. X, lent. X (anglų kalba; santraukos anglų ir lietuvių k.).