IMPLEMENTATION OF SIMULATED ANNEALING IN UNIT SELECTION FOR MALAY TEXT-TO-SPEECH SYSTEM LIM YEE CHEA A dissertation submitted in fulfillment of the requirements for the award of the degree of Master of Science (Mathematics) Faculty of Science Universiti Teknologi Malaysia NOVEMBER 2009 iii Dedicated to Jesus Christ, my personal Lord and Savior, my pastor, Church members, my beloved mum, dad, brother and sister. iv ACKNOWLEDGEMENTS “Let us then with confidence draw near to the throne of grace, that we may receive mercy and find grace to help in time of need.” First and foremost, I want to thank Jesus for His grace and mercy throughout this project. It is by His hand and wisdom in guiding me to finish my work. I would like to extend my appreciation to my honorable supervisor, Dr. Zaitul Marlizawati Zainuddin and my co-supervisor, Dr. Tan Tian Swee, for their academic guidance, suggestions, support and encouragement shown during the course of my study. The patience, tolerance, diligence and dedication shown to me have given me great encouragement and a good example to follow after. Finally, I would love to convey my gratitude to my beloved family members and church members for their love and care shown to me along the process of the study. They have given me so much assistance, comfort and prayer support, either financially or spiritually, of which words could not express and will forever be remembered in my heart. Here I want to especially appreciate Mohd Redzuan bin Jamaludin, his willingness and guidance in doing Matlab. v ABSTRACT Unit selection method has become the predominant approach in speech synthesis. The quality of unit selection based concatenative speech synthesis primarily governed by how well two successive units can be joined together. Therefore, the main purpose of unit selection is to minimize the audible discontinuities. The process of unit selection is based on phonetic context and Simulated Annealing that selects units from large database with the minimization of a criterion, which is often called cost. This dissertation presents a variable-length unit selection Malay text to speech system that is capable of providing more natural and accurate unit selection for synthesized speech. To provide the capability of selecting a speech unit not only limited to phoneme, diphone or triphone but also a string of phonemes that can be matched directly to the database, unit selection methods have been implemented. The Mel Frequency Cepstral Coefficients (MFCC) as spectral parameters have been introduced in the unit selection based speech synthesis. Distance measurement is needed to measure the difference between two vectors of this speech feature. The spectral distance used is Euclidean Distance. vi ABSTRAK Kaedah pilihan unit telah menjadi cara utama dalam sintesis pertuturan. Kualiti untuk pilihan unit dalam penyambungan perkataan adalah berpandukan kepada betapa baiknya kedua-dua unit menyambung bersama. Oleh itu, matlamat utama dalam pilihan unit adalah untuk mengurangkan komposisi jarak. Process untuk pilihan unit adalah bergantung pada konteks fonetik dan Simulated Annealing yang memilih unit dari database dengan meminimumkan satu criteria, yang selalunya dipanggil kos. Disertasi ini melaksanakan satu pemilihan unit berlainan panjang yang mampu memberikan pemilihan unit yang lebih tepat dan semulajadi untuk pertuturan sintesis. Untuk mengadakan pemilihan pertuturan unit yang berupaya bukan hanya terhad kepada foneme,dua fonem atau tiga fonem tetapi juga satu raingkaian fonem yang boleh terusdipadankan kepada pangkalan data, kaedah pemilihan unit telah dilaksanakan. Mel Frequency Cepstral Coefficients (MFCC) sebagai spektra parameter telah diperkenalkan dalam pemilihan unit pertuturan sintesis. Pengiraan jarak adalah diperlukan untuk mengira jarak antara dua vector ini. Spectra jarak yang digunakan adalah Jarak Euclidean. vii TABLE OF CONTENT CHAPTER TITLE TITLE PAGE i DECLARATION PAGE ii DEDICATION iii ACKNOWLEDGEMENT iv ABSTRACT v ABSTRAK vi TABLE OF CONTENTS vii LIST OF TABLES xi LIST OF FIGURES xiii LIST OF SYMBOLS xvi LIST OF APPENDICES 1 PAGE xviii INTRODUCTION 1 1.0 Introduction 1 1.1 Background of the Problem 2 1.2 Problem Statement 3 1.3 Objective of the Study 3 1.4 Scopes of the Study 3 1.5 Significance of the Study 4 1.6 Research Methodology 4 1.7 Dissertation Layout 5 viii 2 LITERATURE REVIEW 5 2.1 Speech synthesis 6 2.1.1 7 2.2 Concatenative Speech Synthesis Unit Selection 2.2.1 9 Non-Uniformed or Variable Length Unit 11 Selection 2.2.2 2.3 Corpus-based Unit Selection Cost function for unit selection 12 14 2.3.1 The Acoustic Parameters 16 2.3.2 Linguistic Features 16 2.3.3 Local cost 17 2.3.3.1 Sub-cost on prosody 19 2.3.3.2 Sub-cost on discontinuity 20 2.3.3.3 Sub-cost on phonetic environment 20 2.3.3.4 Sub-cost on spectral 21 discontinuity 2.3.3.5 Sub-cost on phonetic 22 appropriateness 2.3.3.6 Other sub-costs 23 2.3.3.7 Integrated cost 23 2.4 Cost weighting 24 2.5 Target cost 25 2.6 Concatenation cost 26 2.7 Spectral Distances 29 2.8 Feature Extraction 30 2.8.1 30 2.9 MFCC Distance Measures 32 2.9.1 Simple Distance Measures 33 2.9.1.1 Absolute Distance 33 2.9.1.2 Euclidean Distance 34 2.9.2 Statistically Motivated Distance Measures 34 2.9.2.1 Mahalanobis Distance 34 2.9.2.2 Kullback–Leibler (KL) distance 35 ix 2.10 3 4 Heuristic Method 36 2.10.1 Simulated Annealing 37 2.10.2 Approaches to improve SA algorithm 39 2.10.3 Polynomial approximation 40 2.10.4 Annealing Schedule 41 2.10.4.1 Theoretically optimum cooling schedule 41 2.10.4.2 Geometric cooling schedule 42 2.10.4.3 Cooling schedule of Van Laarhoven et al. 42 2.10.4.4 Cooling schedule of Otten et al. 43 2.10.4.5 Cooling schedule of Huang et al. 43 2.10.4.6 Adaptive cooling schedules 44 2.10.4.7 A new adaptive cooling schedule 44 2.11 Parallel SA 46 2.12 Segmented Simulated Annealing 47 PROPOSED SYSTEM AND IMPLEMENTATION 49 3.0 Introduction 49 3.1 System Design Flow 50 3.2 Malay Phonetics and Phone Sets 51 3.3 Malay Phoneme 51 3.3.1 Malay Vowels 51 3.3.2 Malay Consonant 51 3.4 Phoneme Units Database 52 3.5 Feature Extraction 55 3.6 Phonetic context 58 3.7 Unit Selection 59 3.8 Concatenation 60 SIMULATED ANNEALING 63 4.0 Introduction 63 4.1 Procedure of Simulated Annealing 65 4.2 Initial Solution 67 4.3 The cooling schedule 67 4.3.1 Markov chain 70 x 4.4 Neighbourhood Generation Mechanism 70 4.5 Metropolis's criterion 80 4.6 Stopping criteria 82 4.7 Unit Selection 82 4.7.1 Phonetic context 82 4.7.2 Concatenation Cost 88 4.8 5 89 4.7.2.2 Concatenation cost for Move 2 90 4.7.2.3 Concatenation cost for Move 3 90 4.7.2.4 Concatenation cost for Move 4 91 Concatenation 100 TESTING, ANALYSIS AND RESULT 107 5.1 Experiment 107 5.2 Test Materials 107 5.3 Test Conditions 107 5.4 Test Procedure 108 5.5 Profiles of Listeners 109 5.5.1 Percentages of Listeners by Gender 110 5.5.2 Percentage of Listeners by Race 111 5.5.3 Percentage of Listeners by State of Origin 112 Result and Analysis 113 5.6.1 Word Level Testing 113 5.6.2 Mean Opinion Score 114 5.6 6 4.7.2.1 Concatenation cost for Move 1 CONCLUSION AND RECOMMENDATION 117 6.1 Conclusion 117 6.2 Suggestion for Future Work 120 REFERENCES APPENDICES A-E 121 131 -163 xi LIST OF TABLES TABLE NO. TITLE PAGE 2.1 Sub-cost functions 17 3.1 Total units after extracting the phoneme units from the carrier sentences 54 4.1 Maximum number of iterations for Markov Chain length 1 and 2 to reach final temperature greater than 0.1. 70 4.2 The information of the 10 words before filter using phonetic context. 86 4.3 The information of the 10 words after filter using partially matched phonetic context (left phonetic context). 4.4 The information of the 10 words after filter using fully matched phonetic context (left and right phonetic context). 4.5 92 Information of concatenation cost with temperature reduction rate, α = 0.85 4.11 91 Information of concatenation cost with temperature reduction rate, α = 0.95 4.10 90 Information of concatenation cost (Move 4) with temperature reduction rate, α = 0.90 4.9 90 Information of concatenation cost (Move 3) with temperature reduction rate, α = 0.90 4.8 89 Information of concatenation cost (Move 2) with temperature reduction rate, α = 0.90 4.7 88 Information of concatenation cost (Move 1) with temperature reduction rate, α = 0.90 4.6 87 93 Information of concatenation cost with temperature reduction rate, α = 0.80 94 xii 4.12 Information of concatenation cost with temperature reduction rate, α = 0.95 4.13 Information of concatenation cost with temperature reduction rate, α = 0.90 4.14 96 Information of concatenation cost with temperature reduction rate, α = 0.85 4.15 95 97 Information of concatenation cost with temperature reduction rate, α = 0.80 98 4.16 The sequences of the 10 selected words. 100 5.1 Profiles of Listeners 109 5.2 Words selected for listening test. 113 5.3 The score line of synthesis words with considers the concatenation cost. 115 5.4 The score line of the 10 synthesis words with considers the concatenation cost. 116 xiii LIST OF FIGURE FIGURE NO. TITLE PAGE 2.1 Classes of waveform synthesis methods for speech synthesis. 7 2.2 Viterbi search. 8 2.3 Architecture of corpus-based unit selection concatenative speech synthesizer. 13 2.4 Schematic diagram of cost function 15 2.5 Example of unit search algorithm. The shortest path is marked in blue. 28 2.6 Example of unit search algorithm. The difference in cost between the optimal sequences of two graphs is evaluated for d3 in pre-selection. 29 2.7 Objective Spectral distances 30 2.8 Block diagram of the conventional MFCC extraction algorithm 31 2.9 Parallel Simulated Annealing Taxonomy 46 2.10 Segmented simulated annealing 48 3.1 Block Diagram of System Design Flow. 50 3.2 A set of coefficient transform from MFCC algorithm. 53 3.3 Speech unit database. 54 3.4 The GUI to extract MFCCs coefficients. 55 3.5 The GUI to extract MFCCs coefficients. 56 3.6 The 12 coefficients extracted for phoneme “_m”. 56 3.7 The 12 coefficients extracted for phoneme “a”. 57 3.8 Distance measure and speech feature. 57 3.9 The candidate unit for phoneme “_n” that matched right phonetic context. 58 3.10 The candidate unit for phoneme “a” that matched left and right phonetic 3.11 context. 59 Unit selection 60 xiv 3.12 Waveform for phoneme “_n”. 61 3.13 Waveform for phoneme “a”. 61 3.14 Waveform for phoneme “s”. 61 3.15 Waveform for phoneme “i”. 62 3.16 Concatenation of the best matching units for the word “nasi”. 62 4.1 SA flow diagram to find best speech unit sequence. 66 4.2 Temperature reduction pattern for various reduction rates with Markov Chain length 1. 4.3 69 Temperature reduction pattern for various reduction rate with Markov Chain length 2. 69 4.4 Metropolis criterion 81 4.5 The feasible search region to form a Malay word “kampung” before filter using phonetic context. 4.6 The feasible search region to form a Malay word “kampung” after filter using partially matched phonetic context (left phonetic context). 4.7 85 The feasible search region to form a Malay word “kampung” after filter using fully matched phonetic context (left and right phonetic context). 4.8 84 85 SA best solutions, mean and worst solutions for ten problems from Table 4.12. 99 4.9 Waveform “_s1”. 101 4.10 Waveform “e537” 101 4.11 Waveform “l362” 101 4.12 Waveform “a2710” 101 4.13 Waveform “n1031” 102 4.14 Waveform “j7” 102 4.15 Waveform “u206” 102 4.16 Waveform “t142” 102 4.17 Waveform “ny1” 103 4.18 Waveform “a2060” 103 4.19 Concatenation waveform for the word “selanjutnya”. 103 4.20 Spectrogram for the word “nasi”. 104 4.21 Spectrogram for the word “berpengetahuan”. 104 4.22 Spectrogram for the word “demikian”. 105 xv 4.23 Spectrogram for the word “demikian” that do not consider concatenation cost. 105 4.24 Spectrogram zoom in for the word “demikian” from Figure 4.22. 106 4.25 Spectrogram zoom in for the word “demikian” from Figure 4.23. 106 5.1 Percentage of listeners by gender. 110 5.2 Percentage of listeners by race. 111 5.3 Percentage of listeners by state of origin. 112 5.4 Level of intelligibility of the 10 selected words. 114 5.5 Results of the mean opinion score. 115 xvi LIST OF SYMBOLS/ ABBREVIATIONS AC Average cost kb Boltzmann constant S Configuration set C Cost function E Energy Cmax Estimation of the maximum value of the cost function 〈 f (T )〉 Expected cost in equilibrium FFT Fast Fourier Transform F0 Fundamental Frequency GUI Graphical User Interface KL Kullback-Leibler LSF Line spectral frequencies LP Linear prediction LPC Linear Predictive Coefficients LC Local cost MC Maximum cost MOS Mean Opinion Score MCD Mel-cepstral distortion MFCCs Mel-Frequency Cepstral Coefficients Mel ( f ) Mel scale MCA Multiple centroid analysis NC p Norm cost N Neighbourhood structure PLP Perceptual linear prediction xvii P(E) Probabilities of acceptance δ Real number C pro Sub-cost on prosody CF0 Sub-cost on F0 discontinuity Cenv Sub-cost on phonetic environment Cspec Sub-cost on spectral discontinuity Capp Sub-cost on phonetic appropriateness T Temperature α Temperature reduction rate TTS Text-to-speech TSP Travelling salesman problem U Upper bound σ 2 (T ) Variance in the cost at equilibrium xviii LIST OF APPENDICES APPENDIX TITLE PAGE A Source Code of MFCC 131 B Source Code of Simulated Annealing (Move 1) 137 C Source Code of Simulated Annealing (Move 2) 145 D Source Code of Simulated Annealing (Move 3) 153 E Evaluation Questionnaire 161 CHAPTER 1 INTRODUCTION 1.0 Introduction Corpus-based concatenative synthesis has become the major trend recently because the resulted speech sounds more natural than that produced by parameterdriven production models (Chou, 1999). Unit selection synthesizers in the current state produce highly intelligible, near natural synthetic speech (Tsiakoulis et al., 2008). This method creates speech by re-sequencing pre-recorded speech units selected from a very large speech database (Cepko et al., 2008). Speech is produced by searching through large speech database (corpus) and concatenating selected units, thus forming the output signal. This approach shows its superiority over formant and articulatory synthesis, because it tends to concatenate natural acoustic units with no modification. Thus, offering better speech quality (Janicki et al., 2008). Text to speech synthetic is produced by concatenating speech unit from a very large speech corpus containing enough prosodic and spectral varieties for all synthetic units (Vepa et al., 2002; Vepa and King, 2004). Hence, it is possible to synthesize highly natural-sounding speech by selecting an appropriate sequence of units (Vepa et al., 2002). The selection of the best unit sequence from the database can be treated as a search problem which has the lowest overall distance. Since the quality of the resulting synthetic speech will depend to a large extent on the variability and availability of representative units, therefore, it is crucial to design a corpus that covers all speech units and most of their variations in a feasible size (Min et al., 2001). The unit selection process is based on the cost function that consists of target cost and join cost. The join cost is measurement of the acoustic smoothness between 2 the concatenated units (Dong and Li, 2008). This dissertation will focus on concatenation costs which generally use a distance measure on a parameterization of the speech signal. MFCCs are chosen as spectral parameters as they are most commonly used in state-of-the-art recognizers (Rabiner and Juang, 1993). Distance measurement is needed to measure the difference between two vectors of this speech feature. The spectral distance used is Euclidean Distance. Mel Frequency Cepstral Coefficients were derived using standard methods commonly used in speech recognition. MFCCs are representative of the real cepstrum for a windowed short time signal derived from the Fast Fourier Transform (FFT) of the speech signal (Wei et al., 2006). 1.1 Background of the Problem The main problem with the existing Malay text-to-speech (TTS) synthesis system is the poor quality of the generated speech sound. This poor quality is caused by the inability of traditional TTS system to provide multiple choices of unit for generating an accurate synthesized speech (Tan and Sheikh, 2008b). Most of the current Malay TTS systems are utilizing diphone concatenation that only supports a single unit for each existing diphone, the selection of speech unit for concatenation may not be accurate enough (Tan and Sheikh, 2008b). The current trend in high quality text-to-speech systems (TTS) is to concatenate acoustic units selected from large-scaled corpus of continuous read speech. Thus, a robust unit selection is needed to handle the huge volume of data in the database (Blouin et al., 2002).There exist artifacts such as phase mismatches and discontinuities in spectral shape since units are extracted from disjoint phonetic contexts which can have a deleterious effect on perception (Hunt and Black, 1996). It is nominally cast as a multivariate optimization task, where the available unit inventory is searched for the “best” sequence of units which makes up the target utterance. This optimization relies on suitable cost criteria to characterize relevant aspects of acoustic and prosodic context (Bellegarda, 2008). 3 1.2 Problem Statement The task of the research is to use Simulated Annealing to find the minimum path for the speech units. 1.3 Objective of the Study The dissertation aims to achieve the three objectives outlined in this section i) To implement Mel Frequency Cepstral Coefficients (MFCCs) in unit selection. ii) To implement heuristic optimization method in unit selection. iii) To evaluate the performance of the heuristic optimization method in unit selection. 1.4 Scopes of the Study This dissertation presents a variable-length unit selection scheme to select text-to-speech (TTS) synthesis units from phoneme based corpus which supporting phoneme pattern in Malay Text to Speech. Speech feature selected are MFCCs. Spectral distance used is Euclidean distance. Heuristic methods namely Simulated Annealing is implemented in unit selection to select the best sequence of unit. 4 1.5 Significance of the Study For Malay TTS system, this is the first version of implementation of unit selection using heuristic method which is Simulated Annealing. The performance of this kind of algorithm and methods will be evaluated based on values of cost functions obtained and listening test. By doing so, the advantages and disadvantages of this method will be known if compared to other existing unit selection methods. 1.6 Research Methodology The variable length unit selection is capable of providing more natural and accurate unit selection for synthesized speech and has been implemented in Malay text to speech system in this project (Tan and Sheikh, 2008b). During synthesis, proper units are selected by searching the closest database units to the symbolic target sequence using the Simulated Annealing. The number of possible units at a given time can number in the tens of thousands if a database is built from a 100-hour corpus (Nishizawa and Kawai, 2006). Therefore, heuristic optimization method is needed to select the appropriate units without having to go through all possible combination of units sequences. The C++ programming codes for Simulated Annealing was developed. To make the acoustic distortion measures correspond to human perception more consistently, the Mel Frequency Cepstral Coefficients (MFCC) as spectral parameters have been introduced in the unit selection based speech synthesis (John et al., 1993). Distance measurement is needed to measure the difference between two vectors of this speech feature. The spectral distance used is Euclidean Distance. The smaller the magnitude in Euclidean Distance means closer the concatenation point and thus generated better speech sound. The performance of the heuristic method and other unit selection method were evaluated based on values of cost functions obtained and listening test. 5 1.7 Dissertation Layout This dissertation is divided into six major parts. Chapter 1 includes introduction, background, objective and scope of the thesis. The purpose is to show how this research is different from other conventional method. Chapter 2 provides the comprehensive study in various unit selection methods. The focus will be on the cost function for unit selection, speech features and spectral distance. It will also include a discussion for Simulated Annealing (SA) with the purpose of laying a foundation for the possible approach to improve the performance of SA. Chapter 3 describes on the proposed system and implementation. It will discuss the process involved in generating the waveform for synthesis word from contextual linguistic, selection of speech units, concatenation and output sound. Chapter 4 describes the procedure for SA. It will also describe the procedure in unit selection from contextual linguistic, SA to concatenation. Various parameter setting and neighbourhood generation mechanism for SA will be used to investigate the performance of SA. Chapter 5 is listening test for the synthesis words based on result in Chapter 4. The purpose is to justify the contribution of concatenation cost in improving the speech quality. Chapter 6 provides the conclusion for the system. It will also give some recommendation for further improvement of the system. CHAPTER 2 LITERATURE REVIEW 2.1 Speech synthesis There are two types of speech synthesis methods (Figure 2.1) which are parameter synthesis and concatenative synthesis (Hirai and Tenpaku, 2004). For parameter synthesis method, it involved encoded and decoded of speech samples (Hirai and Tenpaku, 2004). Before the speech samples are stored in a database, speech samples are encoded into like Linear Predictive Coefficients (LPC) parameters. In the synthesis stage, this speech samples are decoded. The encoded of speech samples are required due to small memory storage size. During the encoded process, it involves information lost and the consequence is speech intelligibility degrades (Hirai and Tenpaku, 2004). For concatenative synthesis, it do not involved encoded and decoded of speech samples. During speech concatenation, the original speech segments are selected to concatenate (Hirai and Tenpaku, 2004). The original here refer to as they are, or processed lightly. In this case, speech intelligibility and the speaker’s identity are maintained since information is stored well. However, this method required large storage capabilities. This method also results in large computational cost for searching for appropriate concatenation speech segments. The issues of large storage capabilities and computational cost are resolved with everincreasing advancements in computer technology these days (Hirai and Tenpaku, 2004). For the synthesis system in this dissertation, the length of a segment is a phoneme. 7 Waveform Synthesis Parametric Synthesis Source-filter Articulatory Diphones Concatenative Synthesis Fixed inventory Triphones Nonuniform Unit Selection Demisyllables Figure 2.1 Classes of waveform synthesis methods for speech synthesis (Schwarz, 2007). 2.1.1 Concatenative Speech Synthesis There exists several numbers of different techniques for synthesizing speech (Chappell and Hansen, 2002). Corpus-based concatenative approach to speech synthesis has been widely explored in recent years (Sakai et al., 2008). The concatenative synthesis starts with a collection of speech waveform signals and concatenates individual segments to form a new utterance. The concatenative approach is based on the idea of re-combining natural prosodic contours and phoneme sequences using a superpositional framework (Jan et al., 2005). In concatenative speech synthesis, speech segments, each of which is often generalized as a unit, are selected from speech corpus through the minimization of the overall cost. The Viterbi search (Figure 2.2) is basically employed, which is based on the dynamic programming (DP) approach to find the unit sequence with the minimal cost (Nishizawa and Kawai, 2008). The final speech is more natural than with other forms of synthesis since concatenative synthesis begins with a set of natural speech segments (Chappell and Hansen, 2002). 8 Figure 2.2 Viterbi search (Sakai et al., 2008) There exist several possible choices for basic synthesis unit in concatenative speech synthesizer such as phonemes, diphones, demisyllables, syllables, words or phrases (Min et al., 2001). Smaller units and larger units have it own advantages and disadvantages. For a small units like phonemes, it is not difficult to collect a speech corpus that embodies many prosodic and spectral varieties (Min et al., 2001). But the disadvantage is the synthesized speech tends to suffer more distortions caused by mismatches between concatenated units since small units mean much more concatenation points. For larger units such as words or phrases, it is almost impossible to cover many varieties in a feasible size (Min et al., 2001) although they have less concatenation point. Each segment’s boundary for concatentation is chosen during synthesis in order to best fit the adjacent segments in the synthesized utterance. Spectral mismatch can be computed using an objective measure to determine the level of spectral fit between segments at various possible boundaries. The spectral mismatch is measured at various possible segment boundaries, and the minimum measure score means the closest spectral match (Chappell and Hansen, 2002). Large database can yield high speech quality for direct concatenation of segments since the database contains enough sample segments to include a close match for each desired segment (Chappell and Hansen, 2002). However, large 9 database can also mean costly in terms of database collection, search requirements, and segment memory storage and organization (Chappell and Hansen, 2002). For databases that contain multiple instances of each speech unit, segments selection is based upon two cost functions which are the target cost and concatenation cost (Chappell and Hansen, 2002). The target cost measure the difference between available segments with a theoretical ideal segment, and the concatenation cost measures the acoustic smoothness between the concatenated units (Dong and Li, 2008). Several spectral distance measures have been compared when used as concatenation costs (Chappell and Hansen, 2002). 2.2 Unit Selection Unit selection method has become the major approach in speech synthesis recently and captures the attention of most researchers. The speech units need to be carefully selected from a large database of continuous read speech recorded from a professional speaker in order to yield high quality TTS systems (Sarathy and Ramakrishnan, 2008). To produce enough speech target specification for unit selection, the database should be designed to cover as much of the prosodic and phonetic characteristics of the language as possible (Sarathy and Ramakrishnan, 2008). The unit selection becomes much slower when a larger unit database is used for high-quality sounds. It is because computational effort in the search for the optimal unit sequence is proportional to the square of the number of possible units (Nishizawa and Kawai, 2006). The aim of unit selection speech synthesis is to select a sequence of units which requires less signal processing or ideally no signal processing at all (Robert et al., 2007). In current speech synthesizers, the process of unit selection is based on some type of dynamic programming that selects units from large database with the minimization of the integrated cost (Wu et al., 2004). In corpus-based TTS, the search for optimum unit sequence is determined by a Viterbi algorithm that minimizes a cost function (Díaz and Banga, 2006). There are two different types of spectral distortion in unit selection synthesis. These two spectral distortions are 10 contextual unit distortion and inter-unit distortion (Sagisaka, 1994). The unit selection algorithm uses two cost functions; target cost and concatenation cost (Fek et al., 2006). The quality of unit selection based concatenative speech synthesis is determined by how well two successive units can be joined together to minimize the audible discontinuities (Vepa et al., 2002). The concatenation cost is used as the objective measure of discontinuity. Most of the time, contextual unit distortion and inter-unit distortion is often represented by target cost and concatenation cost respectively. The contextual unit distortion is a measure of a whole unit caused by the difference between the unitextraction context and the target synthesis context. Most of the time, phonetic contextual difference is used to represent this contextual unit distortion (Sagisaka, 1994). For inter-unit distortion, it is a measure for spectral discontinuity at unit concatenation boundaries. This spectral discontinuity is computed from the difference in spectral envelopes of neighboring units at the concatenation points (Sagisaka, 1994). The combinations of units should be considered in unit selection, since the suitability of synthesis unit includes not only similarity between a synthesis target and a selected unit but also the smoothness between neighboring units (Nishizawa and Kawai, 2006). According to Fek et al.,(2006), the target costs is assigned with null cost if the phonetic context are fully matched. This means that the left and right phonetic contexts of the input unit and the candidate are exactly matched. According to Fek et al.,(2006), the concatenation cost is assigned with null cost if units were consecutive in the speech database. The coverage of units in all different speaker attitudes and prosodic styles is one of the biggest problems in the unit selection approach (Sarathy and Ramakrishnan, 2008). It will not be enough to guarantee the full coverage of target feature combinations thoroughly even after recording of several hundred thousand sentences (Sarathy and Ramakrishnan, 2008). Therefore, unit selection is very important to substitutes the units not matching target specification with the closest substitutes. 11 To improve the naturalness of the synthesized speech, the possible approach is to increase the length of basic units from demi-phones, phones, di-phones, triphones, syllables, words to non-uniform or variable-length units (Wu et al., 2004). Since less concatenation points yields less spectral discontinuity, thus longer synthesis units will reduce the effect of spectral distortion (Wu et al., 2004). In concatenative systems, speech units can be either fixed-size diphones or variable length units (Hasim et al., 2006). Fixed-size units only allow one length size unit to be available in the speech unit database. The length size can either be phoneme, diphone, syllable or word unit length, depending on its application. The approach which utilizes variable length units is known as unit selection (Hasim et al., 2006). For Malay speech unit concatenation, there are two types of units used namely single unit and ‘unit selection’. The method which used single unit has only one occurrence of all possible units. For this type of method, the diphone level has been used by most of the research and available commercial TTS system namely Festival Speech Synthesis system, Mbrola, and German TTS-systems (Taylor et al., 1999). For unit selection method, since a large speech corpus containing more than one instance of a unit is recorded, it provides more accurate unit selection with multiple choices for all correspondence units. The variable length units are selected based on some estimated objective measure to optimize the synthetic speech quality (Hasim et al., 2006). 2.2.1 Non-Uniformed or Variable Length Unit Selection Nowadays, many speech synthesis systems used non-uniform or variable length unit concatenation as an effort to minimize audible signal discontinuities at the concatenation points (Stylianou and Syrdal, 2001). To produce the output speech, the most appropriate units of variable lengths, with the desired prosodic features within the corpus are automatically retrieved and selected on-line in real-time, and concatenated during the synthesis process (Chou and Tseng, 1998). Usually, longer units can be used in the synthesis if they appear in the corpus with desired prosodic 12 features and the need for signal modification to obtain the desired prosodic features for a voice unit is significantly reduced which signal modification usually degrades the naturalness of the synthesized speech (Chou and Tseng, 1998). The quality of a waveform generated depends mainly on the number of concatenation points. Therefore, higher perceived quality speech can be produced as the length of the elements used in the synthesized speech increases that is decreases in number of concatenation points (Nagy et al., 2005). 2.2.2 Corpus-based Unit Selection The concept of corpus-based or unit selection synthesis is that the corpus is searched for maximally long phonetic strings to match the sounds to be synthesized (Piits et al., 2007). Corpus-based speech tends to elicit considerably higher ratings of naturalness in auditory tests than diphone or triphone synthesis (Nagy et al., 2005). This is because there is less number of real concatenation points (Fek et al., 2006). To solve the problems with the fixed-size unit inventory synthesis, e.g. diphone synthesis, a promising methodology has been proposed using corpus-based concatenative speech synthesis (unit selection) (Hasim et al., 2006). There is more than one instance of each unit to capture prosodic and spectral variability found in natural speech in the speech corpus. The acoustic units of varying sizes are selected from a large speech corpus and concatenated in corpus-based systems. If an appropriate unit is found in the unit inventory, the needs for signal modifications on the selected units are minimized (Hasim et al., 2006). A unit selection algorithm is required to choose the units from the inventory that matches best the target specification of the input sequence of units because there are more than one instance of each unit (Hasim et al., 2006). A factor that has been argued to contribute to the perceived lack of naturalness of synthesis speech is the frequency of unit concatenations (Möbius, 2000). The unit selection algorithm favors choosing adjacent speech segments in order to minimize the number of joins (Hasim et al., 2006). Thus, the limitations of concatenative synthesis for fixed acoustic unit inventory have been overcome by corpus-based approaches. 13 The unit selection produced much better output speech quality than fixed-size unit inventory synthesis in terms of naturalness. However, some challenges still face by the unit selection. One of the problems is the inconsistency of the speech resulting quality (Hasim et al., 2006). The selected unit is needed to undergo some prosodic modifications which degrade the speech quality at this segment join if the unit selection algorithm fails to find a good match for a target unit. To overcome this problem, speech corpus should be designed to cover all the prosodic and acoustic variations of the units that can be found in an utterance. It is infeasible to record larger and larger databases since it will slow down the unit selection. A better way is to find a way for optimal coverage of the language. Another problem faced by the unit selection is some glitches exist at the concatenation points in the synthesized utterance during concatenating the speech waveforms. A speech model is generally used for speech representation and waveform generation to ensure smooth concatenation of speech waveforms and to enable prosodic modifications on the speech units (Hasim et al., 2006). Figure 2.3 shows the components of Malay waveform generator modules generation. Figure 2.3 Architecture of corpus-based unit selection concatenative speech synthesizer. 14 The Algorithm 2.1 shows the algorithm to detect appropriate speech unit sequence for synthesis using the purtubation of model parameters (Hirai et al., 2002). Algorithm 2.1 1. Estimate the values of speech features depending on the input text. The values are called “expected values” of a feature. 2. According to the expected values, a speech unit sequence which shows the minimum cost C0 is found in the speech database and the IDs of the units in the sequence are substituted S 0 . 3. The cost C0 and the speech unit sequence S 0 are substituted Cmin and Smin . 4. Choose a feature from among F0 , speech unit duration, power, etc., and substitute into F. Execute the processing shown below: 4.1 the model parameter values of F are adjusted within the range in which naturalness and clarity are maintained, and the new expected values sequence is substituted T. 4.2 Based on T, a speech unit sequence S, which shows the minimum cost C, is found in the speech database. If C < Cmin , then C and S are substituted Cmin and Smin . 5. Repeat step (4) until the number of repetitions exceeds the limit or until there is no expectation to renew the Cmin . 6. Systhesize the speech based on the speech unit sequence Smin . 2.3 Cost function for unit selection The cost function for unit selection is viewed as a transformation of objective features such as acoustic measures and linguistic information, into a perceptual measure (Figure 2.4). The predicted perceptual measure that is expected to capture the degradation of synthetic speech naturalness is considered a cost (Toda et al., 2006). 15 Perceptual experiments should be conducted to determine the components of the cost function based on the results of the experiments. In practice, it is almost impossible to experimentally transform acoustic measures into a perceptual measure provided the acoustic features have simple structures, as in the case of F0 or phoneme duration. Most of the times, acoustic features have such complex structures that this kind of experiment is infeasible (Toda et al., 2006). Nothing satisfactory has been found so far although various studies have been carried out to search for an acoustic measure that can capture perceptual characteristics (Klabbers and Veldhuis, 2001; Stylianou and Syrdal, 2001; Ding and Campbell, 1998; Wouters and Macon, 1998). Besides, phonetic information can be transformed into perceptual measures from perceptual experiments (Kawai and Tsuzaki, 2002). However, acoustic measures that can represent the characteristics of instances of waveform segments are still necessary since phonetic information can only evaluate the difference between phonetic categories (Toda et al., 2006). Figure 2.4 Schematic diagram of cost function (Toda, 2003). 16 2.3.1 The Acoustic Parameters The parameters that fall into this category are normally prosodic parameters that describe pitch and duration of unit. The use of prosody alone is not enough to reflect spectral mismatches. Both spectral parameters and prosodic parameters need to be included in the unit for unit selection (Dong and Li, 2008). MFCC is employed as parameters to represent spectral information in this dissertation. Basic synthesis chosen in this dissertation is phoneme. The 12 MFCC coefficients are used to represent spectral information. 2.3.2 Linguistic Features The linguistic features are used for predicting the acoustic parameters. The linguistic features can be obtained from input text. The linguistic features that can be derived from the utterance files included context units, syllable information, syllable position information, word information, phrase information and utterance information. For context units, it describes phone identities of the previous 2 and next 2 units. For syllable information, it describes stress, accent, length of the previous, current and next syllables. For syllable position information: syllable position in word and phrase, stressed syllable position in phrase, accented syllable position in phrase, distance from the stressed syllable, distance from the accented syllable, and name of the vowel in the syllable. For word information, it describes length and part-of-speech of the previous word, current word and next word, position of the word in phrase. For phrase information, it describes lengths (in number of words and syllables) of previous phrase, current phrase and next phrase, position of the current phrase in major phrase, boundary tone of the current phase. Finally, for utterance information, it describes lengths in number of syllables, words and phrases (Dong and Li, 2008). 17 2.3.3 Local cost The degradation of naturalness caused by individual candidate unit can be shown using local cost. The higher local cost means the less naturalness of the speech synthesis. The cost function is defined as a weighted sum of the five sub-cost functions (Table 2.1). Each sub-cost represents either source information or vocal tract information (Toda et al., 2006). Table 2.1 Sub-cost functions (Toda, 2003). Source information Vocal tract information Prosody ( F0 , duration ) C pro F0 discontinuity CF0 Phonetic environment Cenv Spectral discontinuity Cspec Phonetic Capp inappropriateness The local cost LC ( ui , ti ) at a candidate unit ui for a target phoneme ti is given by LC ( ui , ti ) = w pro ⋅ C pro ( ui , ti ) + wF0 ( ui − ui −1 ) + wenv ⋅ Cenv ( ui ,ui −1 ) + wspec ⋅ C ( ui ,ui −1 ) + wapp ⋅ Capp ( ui ,ti ) w pro + wF0 + wenv + wspec + wapp = 1 where wpro , wF0 , wenv , wspec and wapp denote weights for their respective sub-costs. All sub-costs need to be normalized so that they have positive values with the same mean. The previous unit ui −1 shows a candidate unit for the ( i − 1) th target phoneme ti −1 (Toda et al., 2006). The sub-costs Cenv , Cspec and Capp become null when the candidate units ui −1 and ui are adjacent in the corpus. The five sub-costs can be further divided into target cost Ct and concatenation cost Cc (Hunt and Black, 1996; Campbell and Black, 1997). 18 The unit selection process is based on two types of cost function which are target cost and concatenation cost (Dong and Li, 2008). The local cost (Toda, 2003; Toda et al., 2006) is given by LC ( ui , ti ) = wt ⋅ Ct ( ui , ti ) + wc ⋅ Cc ( ui , ui −1 ) wt + wc = 1 such that the target cost Ct and concatenation cost Cc can be written as Ct ( ui , ti ) = wpro wt Cc ( ui , ui −1 ) = wpro Since wt + ⋅ C pro ( ui , ti ) + wapp wt ⋅ Capp ( ui , ti ) wF w wenv ⋅ Cenv ( ui , ui −1 ) + spec ⋅ Cspec ( ui , ui −1 ) + 0 ⋅ CF0 ( ui , ui −1 ) wc wc wc wapp wt =1 , therefore wt = w pro + wapp . wenv wspec wF0 + + =1 w = wF0 + wenv + wspec Similarly, wc , therefore c wc wc The target cost takes into account, the phone label and its position in the word, the phone label and its position in the syllable, and the segment duration (according to statistical data). For the concatenation cost, the factors it takes into account is takes adjacent phones in the same speech segment, avoids too large duration, pitch and energy differences between adjacent phones, and finally favors similar phonetic features for adjacent phones (Prudon et al., 2002). It is not easy to reach a balance between the target cost and the join cost. If we give more weight to concatenation cost, then the target cost may be weighted low and thus result in bad synthesis. Combining these two costs is necessary as a way of lessening this behavior (Vepa and King, 2004). 19 2.3.3.1 Sub-cost on prosody: C pro The difference in prosodic parameters ( F0 contour and phoneme duration) between a candidate segment and the target caused the degradation of naturalness. The C pro sub-cost captures the degradation of naturalness caused by this phenomenon (Toda et al., 2006). To calculate the difference in the F0 contour, a phoneme is divided into various parts, then the difference in an averaged log-scaled F0 is computed in each part. The prosodic cost is represented as an average of the costs calculated in each phoneme (Toda, 2003). This sub-cost function (Toda et al., 2006) can be estimated from the results of perceptual experiments. The sub-cost C pro can be written as C pro ( ui , ti ) = 1 M ∑ P ( D (u , t , m ) , D (u , t )) M m =1 F0 i i d i i (2.1) where DF0 ( ui , ti , m ) is the difference in the averaged log-scaled F0 in the mth divided part. The DF0 is set to zero in the unvoiced phoneme. Dd is the difference in the duration. Dd is calculated for each phoneme and used in the calculation of the cost in each part. M is the number of divisions (Toda, 2003). P is the nonlinear function and was determined from the results of perceptual experiments on the degradation of naturalness caused by prosody modification, assuming that the output speech was synthesized with prosody modification. The function should be determined based on other experiments on the degradation of naturalness caused by using a different prosody from that of the target when prosody modification is not performed (Toda, 2003). 20 2.3.3.2 Sub-cost on F0 discontinuity: CF0 The F0 discontinuity at a segment boundary caused the degradation of naturalness. The CF0 sub-cost captures the degradation of naturalness caused by this phenomenon. This sub-cost (Toda et al., 2006) is computed as the distance based on the log-scaled F0 at the boundary and is given by ( CF0 ( ui , ui −1 ) = P DF0 ( ui , ui −1 ) , 0 ) where DF0 is the difference in log-scaled F0 at the boundary and is set to zero at the unvoiced phoneme boundary. The function P in Equation (2.1) is used to normalize a dynamic range of the sub-cost. The sub-cost becomes zero when the units ui −1 and ui are adjacent in the corpus (Toda, 2003). 2.3.3.3 Sub-cost on phonetic environment: Cenv The mismatch of phonetic environments between a candidate segment and the target caused the degradation of naturalness (Toda et al., 2006). The Cenv subcost captures the degradation of naturalness caused by this phenomenon and is given by Cenv ( ui , ui −1 ) = { } { } 1 S s ( ui −1 , Es ( ui −1 ) , ui ) + S p ( ui , E p ( ui ) , ui −1 ) 2 = 1 S s ( ui −1 , Es ( ui −1 ) , ti ) + S p ( ui , E p ( ui ) , ti −1 ) 2 (2.2) (2.3) Ss and S p is the sub-cost function that captures the degradation of naturalness caused by the mismatch with the succeeding and preceding environment respectively. Es and E p are the succeeding phoneme and preceding phoneme in the corpus respectively. Thus, S s ( ui −1 , Es ( ui −1 ) , ti ) is the degradation caused by the mismatch with the succeeding environment in the phoneme for ui −1 by substituting Es ( ui −1 ) with the 21 phoneme for ti . S p ( ui , E p ( ui ) , ti −1 ) is the degradation caused by the mismatch with the preceding environment in the phoneme ui by substituting E p ( ui ) with the phoneme for ti −1 . The sub-cost functions S s and S p are determined from the results of perceptual experiments. Equation (2.2) is transformed into Equation (2.3) by considering that a phoneme for ui is equivalent to a phoneme for ti and a phoneme for ui −1 is equivalent to a phoneme for ti −1 (Toda, 2003). This sub-cost is calculated from the results of perceptual experiments. The sub-cost does not always become zero even if the mismatch of phonetic environments does not exist since this sub-cost also reflects the difficulty of concatenation caused by the uncertainty of segmentation (Klabbers and Veldhuis, 2001). The sub-cost becomes null when the units ui −1 and ui are adjacent in the corpus (Toda, 2003). 2.3.3.4 Sub-cost on spectral discontinuity: Cspec The spectral discontinuity at a segment boundary caused the degradation of naturalness. The Cspec sub-cost captures the degradation of naturalness caused by this phenomenon (Toda et al., 2006). This sub-cost is determined as the weighted sum of a mel-cepstral distortion between frames of a segment and those of the preceding segment around the boundary (Toda, 2003). The sub-cost (Toda, 2003) can be written as Cspec ( ui , ui −1 ) = cs ⋅ w −1 2 ∑ h ( f )MCD ( u , u f =− w /2 i i −1 ,f) where h is the triangular weighting function of length w . MCD ( ui , ui −1 , f ) is the mel-cepstral distortion between the f th frame from the concatenation frame (f = 0 ) of the preceding segment ui −1 and the f th frame from the concatenation frame ( f = 0 ) of the succeeding segment ui in the corpus. Concatenation is 22 conducted between the -1-th frame of ui −1 and the 0-th frame of ui . cs is a coefficient to normalize the dynamic range of the sub-cost. The mel-cepstral distortion (Toda, 2003) calculated in each frame-pair is written as ( 40 20 d d ⋅ 2 ⋅ ∑ mcα( ) − mcβ( ) ln10 d =1 ) 2 where mcα( d ) and mcβ( d ) show the d-th order mel-cepstral coefficient of a frame α and that of a frame β , respectively. This sub-cost becomes zero when the segments ui −1 and ui are adjacent in the corpus (Toda, 2003). 2.3.3.5 Sub-cost on phonetic appropriateness: Capp The phonetic inappropriateness caused the degradation of naturalness. The Capp sub-cost captures the degradation of naturalness caused by this phenomenon using outlying segments (Toda et al., 2006). The sub-cost is computed as the melcepstral distortion between the mean vector of a candidate segment and that of the target (Toda et al., 2006). The sub-cost Capp (Toda, 2003) can be written as Capp ( ui , ti ) = ct ⋅ MCD ( CEN ( ui ) , CEN ( ti ) ) where CEN is a mean cepstrum calculated at the frames around the phoneme center. MCD is the mel-cepstral distortion between the mean cepstrum of the segment ui and that of the target ti . ct denotes a coefficient to normalize the dynamic range of the sub-cost (Toda, 2003). This sub-cost is set to zero in the unvoiced phoneme (Toda, 2003). 23 2.3.3.6 Other sub-costs Besides that, there exist many others sub-cost such as acoustic phonemic target cost, phoneme cost, phrase cost, etc. Acoustic phonemic target cost (Jan et al., 2005) is used to measure acoustic distance to acoustic template of a phonetic class. Foot or phoneme cost is assigned to violations of the same-class constraint. Phrase or foot cost measure mismatches between the target and the phrase match sequence in terms of the number and lengths of the feet. Sentence or phrase cost measure mismatches between the target and the phrase match sequence in terms of the number and lengths of the feet. The cost definition must consider spectral compatibility, addressed by subcosts phonological identity difference, phoneme characteristic difference, signal f0 , signal f0 derivative, signal energy difference, signal energy derivative difference. Long-term compatibility is addressed by sub-costs phonological identity difference, phoneme characteristic difference, syllabic position difference, word position difference, breath group position difference, system predicted duration difference and signal duration difference. Sub-costs syllabic position difference, word position difference and breath group position difference are devised to favour the coherence of prosodic groups (Blouin et al., 2002). 2.3.3.7 Integrated cost The optimum set of units for an utterance is selected from a speech corpus in unit selection. Therefore, local costs for individual units need to be integrated into a cost for a unit sequence. This cost is referred to as an integrated cost in this dissertation (Toda et al., 2006). The average cost ( AC ) with the formula AC = 1 N ⋅ ∑ LC ( ui , ti ) N i =1 is often used as the integrated cost (Hunt and Black, 1996; Campbell and Black, 1997) where N denotes the number of targets in the utterance. The silence before the 24 utterance are represented by the target t0 and the candidate u0 whereas the silence after the utterance are denoted by t N and uN . Both the sub-costs C pro and Capp are fixed to zero for the pause. Since the average cost shows the level of naturalness over the entire synthetic utterance, a unit with an expensive cost can also be selected in the output sequence of units; with the condition that it is optimal in terms of the average cost (Toda et al., 2006). It might be assumed that the largest cost in the sequence would have significant effect on the degradation of naturalness in synthetic speech. The maximum cost ( MC ) as the integrated cost given by MC = max { LC ( ui , ti )} , i 1< i < N is used to verify this assumption. Besides these two types of integration methods, there is another cost called norm cost, NC p , which is given by 1 p⎤p ⎡1 N NC p = ⎢ ⋅ ∑ { LC ( ui , ti )} ⎥ ⎣ N i =1 ⎦ The norm cost is equivalent to the average cost when the value p is set to 1. Whereas the norm cost is equal to the maximum cost when p approaches infinity. Thus, the mean value and the maximum value can be obtained using this norm cost by varying p . 2.4 Cost weighting There exist many combinations of features for which not every feature can be simultaneously satisfied. Therefore, some compromises need to be made. Since there will be relative importance for each sub-cost in the whole cost function, thus, tuning the weights is an important stage in the design of the selection algorithm to reflect their relative importance (Díaz and Banga, 2006). The highest weight will be assigned to the most important sub-cost. Various approaches have been presented for 25 automatically tuning the weights of the cost functions employed in the speech unit selection process. Various weight set has been integrated into the selection process and the set that gave the smallest mean square error between the original and the synthetic pitch contours was selected as the candidate. However, results show that there does not exist a set of weights with consistent performance across all (or almost all) of the sets (Díaz and Banga, 2006). Thus, adjusting the weights by manual tuning seems to be the only solution to this problem and further research needs to be carried out on optimizing the weights. However, there is a possible approach of weight optimization, which is using multiple linear regression (MLR). The multiple linear regression calculated on subcost values generated with the training corpus, as a function of an acoustic measure of concatenation quality is used to optimize the sub-cost weights (Blouin et al., 2002). Besides that, target cost weights can be adjusted automatically using linear regression in the synthesis system. Taking in the context cost as a target cost element makes it very critical to use the trained weights since some context mismatches exists when using such weights. To solve this problem, adjust-listen operations need to be performed starting from the automatically trained weights until satisfactory results are obtained. The need for such manual tuning may be caused by the objective function used in weight training is not perceptually suitable (Hamza et al., 2001). 2.5 Target cost The target cost is a measure of how much a candidate unit is remote from a desired position in the synthesized phrase. In the classical approach, it is measured against, for example, a desired pitch contour, the distance is derived directly from text. It is based on two factors which are syllable’s position in a word, and presence and type of a boundary tone. The target cost equals zero if both factors match the unit in the corpus and the target phrase. Else, various cost values are assigned for 26 different cases, depending on the syllable type (stressed, final, other) and type of a boundary tone (ending, low, high, none) (Janicki et al., 2008). The target cost function refers to how well a unit’s phonetic contexts and prosodic characteristics match with those required in the synthetic phone sequence (Vepa and King, 2004). In other words, target cost is a mismatch between desired and candidate's acoustic characteristics (Cepko et al., 2008). These characteristics are denoted by features as which prediction from input text is possible, like duration, intensity and intonation curve. The target cost function captures degradation of naturalness caused by the difference between a target and a selected unit in a mismatch of the phonetic environment, log F0 (fundamental frequency), phone duration, and MFCC (mel-frequency cepstral coefficients) (Nishizawa and Kawai, 2006). The target cost is usually calculated as the weighted sum of the differences between prosodic and phonetic parameters of target and candidate units (Vepa and King, 2004). The target cost can be further divided into two types which are phonetic target costs and prosodic target costs. The phonetic target cost (Zhao et al., 2006) contains sub-costs for the Left Phone Context and the Right Phone Context. The prosodic target costs (Zhao et al., 2006) contain the sub-costs for Position in Phrase, Position in Word, Position in Syllable, Accent Level in Word and Emphasis Level in Phrase, etc. 2.6 Concatenation cost Features that are included in the concatenation cost calculation may be certain spectral type coefficients that parameterize borders of the speech units in the corpus. The concatenation cost is the distortion between these parameters of two adjacent candidate units (Zhao et al., 2006). Concatenation cost is used to measure the acoustic smoothness between the concatenated units (Dong and Li, 2008). According to Nishizawa and Kawai (2006) the concatenation cost function captures the 27 degradation of naturalness caused by discontinuity at the unit boundary in F0 and MFCC. According to Janicki et al (2008), the concatenation cost is a measure of how well we loose on concatenating acoustic units. This is based also on linguistic information, because its main component is a cost related to the change of a phoneme’s context. This information can be obtained directly from the analysis of the input and corpus texts and is computed as follows (Janicki et al., 2008): • it is assigned with null cost if the left neighbor in the corpus is exactly the same unit as the left neighbor in the synthesized phrase, • the cost is assigned a certain value if the left neighbor in the corpus is the same phoneme as the left neighbor in the synthesized phrase. • higher cost is applied if the left corpus neighbor belongs only to the same phonetic category as the left neighbor in the phrase. • even higher cost is applied if only the voicing information agrees. • highest cost is set in all other cases. Component of the concatenation cost included measure of F0 difference, which is calculated only for units neighboring with voiced parts (Janicki et al., 2008). It is believed that unit selection with estimating concatenation cost basing on the phoneme’s context change, together with F0 difference, should bring similar results as methods using spectral distance measures (Wouters and Macon, 1998) but at much lower computational cost. The ideal concatenation cost should correlates highly with human listeners’ perceptions of discontinuity at concatenation points. That means the concatenation cost should predict the degree of perceived discontinuity although the computation of concatenation cost based only on measurable properties of the candidate units such as spectral parameters, amplitude, and F0 (Vepa and King, 2004). The concatenation cost function contains a component that measures differences in the spectral properties of the speech either side of a proposed join between two candidate units (Vepa and King, 2004). Blouin et al. (2002) conducted 28 a concatenation cost function based on phonetic and prosodic features. The function presented is defined as a weighted sum of sub-costs, each of which is a function of various symbolic and prosodic parameters. The multiple linear regression were used to optimize weights as a function of an acoustic measure of concatenation quality. Perceptual evaluation results showed that the concatenation sub-cost weights determined automatically were better than hand-tuned weights, with or without applying smoothing after unit concatenation (Vepa and King, 2004). The acoustic measures and phonetic features were compared by Kawai and Tsuzaki (2002) in term of their ability to predict audible discontinuities. The MFCCs were employed to derive acoustic measures and the distance between certain frames were measures using Euclidean distances. Phonetic features were found to be more efficient than acoustic measures in term of predicting the audible discontinuities (Vepa and King, 2004). Figure 2.5 shows the unit selection that based on cost minimization of target cost and concatenation cost. Figure 2.6 shows the search graph for unit selection for global optimum and local minimum. Figure 2.5 Example of unit search algorithm. The shortest path is marked in blue (Janicki et al., 2008). 29 (a) The search graph for the global optimum. (b) The search graph for the local minimum where d3 is fixed. Figure 2.6 Example of unit search algorithm. The difference in cost between the optimal sequences of two graphs is evaluated for d3 in pre-selection (Nishizawa and Kawai, 2008). 2.7 Spectral Distances The concatenation cost function consists of a distance measure that operates on some parameterization of the final and initial frames of two units to be concatenated (Vepa and King, 2004), as shown in Figure 2.7. Thus, a wide variety of distance measures and parameterizations are possible. 30 Figure 2.7 Objective Spectral distances (Vepa and King, 2004). 2.8 Feature Extraction 2.8.1 MFCC One of the most important issues in the field of speech synthesis is feature extraction (Khor, 2007). MFCCs are a representation defined as the real cepstrum of a windowed short-time signal and it is derived from the fast Fourier transform (FFT) of the speech signal (Vepa and King, 2004). The difference of MFCC from the real cepstrum is that a nonlinear, perceptually motivated frequency scale is used, which approximates the behavior of the human auditory system. Examples of parameterizations included linear prediction (LP) coefficients; LP spectrum; Mel frequency cepstral coefficients (MFCC); line spectral frequencies (LSF); perceptual linear prediction (PLP) coefficients; PLP spectrum; multiple centroid analysis (MCA) coefficients. 31 MFCC have good performance in speech recognition systems. Good performance in speech synthesis systems meanwhile is likely to be determined more by the relationship between parameter values and human perception of speech sounds (Donovan, 2003). A study conducted by Wouters and Macon (1998), to measure the correlation between a number of likely spectral distance measures based on different speech parameterizations and human perception of the differences between speech sounds. Their results indicated that MFCCs worked as well as any other parameterization they tested, when used in either a Euclidean distance measure or in a Mahalanobis distance measure computed using parameter variances estimated from their whole database. Speech Preemphasis Windowing & Overlapping FFT MelFrequency Filter bank Cepstrum Logged Energy MFCC Delta Figure 2.8 Block diagram of the conventional MFCC extraction algorithm (Khor, 2007). MFCC is capable of capturing the phonetically important characteristics of speech (Wong and Sridharan, 2001). MFCC is coefficients that represent audio, based on perception. The mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another. A cepstrum is the result of taking the Fourier Transform (FT) of the decibel spectrum as if it were a signal. Cepstrum used as an excellent feature vector for representing the human voice and musical signals. The spectrum is usually first transform using Mel Frequency Band. The result is called MFCCs. The result for mel scale is given as 32 f ⎞ ⎛ m = 2595log10 ⎜1 + ⎟. ⎝ 700 ⎠ The transforming from linear frequency to Mel-frequency can be written as: f ⎞ ⎛ Mel ( f ) = 1127 ln ⎜1 + ⎟ ⎝ 700 ⎠ It has been discovered that the perception of a particular frequency by auditory system is influenced by energy in a critical band around mel frequencies for the final inverse Fourier Transform in the calculation of cepstral coefficients will produce MFCC (Figure 2.8). 2.9 Distance Measures There are two stages involve in the computation of a distance measure which are feature extraction and quantifying it. The feature is extracted from each pair of candidate units and then the distance between the feature vectors representing the units is computed (Kirkpatrick and O'Brien, 2006). The distance measure is needed to measure the difference between two vectors of such parameters. Some examples of distance measure are absolute magnitude distance, Euclidean distance, Mahalanobis distance and Kullback-Leibler (KL) divergence. All the distance measure listed above are metrics except KL divergence. The symmetrical version of KL divergence can be used to compute the distance between two speech parameterizations (Vepa and King, 2004). A psychoacoustic experiment on listeners’ detectability of signal discontinuities in concatenative speech synthesis have been performed by Stylianou and Syrdal (2001).The results indicated that a symmetrical Kullback-Leibler (KL) distance between FFT-based power spectra and the Euclidean distance between MFCC have the highest prediction rates (Stylianou and Syrdal, 2001). According to Klabbers et al (2000), the Kullback-Leibler (KL) distance was found to be the best predictor of audible discontinuities (Klabbers et al., 2000). Distance measures have many applications in speech technologies. For speech coding, they are applied as objective measures of speech quality and also 33 applied in the design of vector quantization algorithms. Therefore, in unit selection synthesis, an objective distance measure which is able to predict audible discontinuities is very important (Stylianou and Syrdal, 2001). This is none other than concatenation cost distance measures which are best able to predict audible discontinuities. Higher concatenation costs will be assigned to the units that are predicted to produce audible discontinuities in concatenation, and thus they will be less likely to be selected. Recently, the most widely used distance in speech recognition is the Euclidean distance between MFCCs. From the inspiration of speech recognition methods, some speech synthesis unit selection algorithms use the Euclidean distance between MFCCs (Stylianou and Syrdal, 2001). 2.9.1 Simple Distance Measures 2.9.1.1 Absolute Distance Simple absolute distance is calculated as the sum of the absolute magnitude difference between individual features of the two feature vectors with the formula as shown in Equation (2.4) N Dabs ( X , Y ) = ∑ X i − Yi i =1 (2.4) 34 2.9.1.2 Euclidean Distance The Euclidean distance between two feature vectors, X and Y, is calculated as shown in Equation (2.5). However, the Euclidean distance does not take into account of variances or covariances of the distribution of the feature vectors (Vepa and King, 2004). DEu ( X , Y ) = 2.9.2 n ∑( X i =1 i − Yi ) 2 (2.5) Statistically Motivated Distance Measures Examples of famous distance measure from statistics are Mahalanobis distance and Kullback-Leibler divergence (Vepa and King, 2004). 2.9.2.1 Mahalanobis Distance Mahalanobis distance is a generalization of standardized (Euclidean) distance in that it takes account of the variance or covariance of individual features (Vepa and King, 2004). Donovan (2001) has used this distance measure in a concatenation cost function. The Mahalanobis needs the estimation of covariance matrices. The offdiagonal elements of the covariance matrix are assumed to be zero, and this will save computational costs and also storage requirement. Results showed that making this diagonal covariance assumption was reasonable since using full covariance matrices did not improve performance over using diagonal matrices. Equation (2.6) shows the Mahalanobis distance for two feature vectors, X and Y , with diagonal covariance matrix DMa ( X , Y ) 2 ⎡ X −Y ⎤ = ∑⎢ i i ⎥ σi ⎦ i =1 ⎣ n 2 (2.6) 35 where σ i is standard deviation of the i-th feature of the feature vectors (i.e., the diagonal entries of the covariance matrix). 2.9.2.2 Kullback–Leibler (KL) distance The KL distance originates from statistics and is asymmetrical. It is also known as the divergence, discrimination or relative entropy in information theory. It is used to quantify differences between two probability distributions or densities. The KL distance can also be employed to quantify differences in shape of strictly positive sequences (or functions) of which the sum (or integral) equals one. In concatenative speech synthesis, it has been used to quantify the differences between spectral envelopes at concatenation points (Veldhuis, 2002). The spectral envelopes at the boundary are viewed as the probabilistic distributions as shown in Equation (2.7) (Zhao et al., 2006). In others words, the distance between the vectors is computed to quantify the degree of similarity between two feature vectors, P and Q (Kirkpatrick and O'Brien, 2006). The original asymmetrical definition of the KL distance is changed into a symmetrical version as shown in Equation (2.8) with property that d SKL ( P , Q ) = d SKL ( Q , P ) . The Symmetric Kullback-Leibler distance measures the dissimilarity between two probability distributions, P (θ ) and Q (θ ) (Kirkpatrick and O'Brien, 2006). It has the important property that it emphasizes on the differences in spectral regions with high energy rather than differences in spectral regions with low energy. Thus, more emphasized fall on spectral peaks rather than valleys between the peaks and low frequencies are emphasized more than high frequencies (Klabbers et al., 2000). 1 2π ∫ 2π 0 1 2π P (θ )dθ = d KL ( P, Q ) = 1 2π ∫ 2π 0 ∫ 2π 0 Q (θ )dθ = 1 P (θ ) log P (θ ) dθ Q (θ ) (2.7) 36 d SKL ( P, Q ) = = 1 ( d KL ( P, Q ) + d KL ( Q, P ) ) 2 1 ⎧⎪ ⎡ 1 ⎨⎢ 2 ⎩⎪ ⎣ 2π ⎡ 1 =⎢ ⎣ 4π = 1 4π ∫ 2π 0 ∫ 2π 0 P (θ ) log P (θ ) log P (θ ) ⎤ ⎡ 1 dθ ⎥ + ⎢ Q (θ ) ⎦ ⎣ 2π P (θ ) 1 dθ + Q (θ ) 4π ∫ 2π 0 ∫ 2π 0 Q (θ ) log Q (θ ) log P (θ ) ⎤ ⎫⎪ dθ ⎥ ⎬ Q (θ ) ⎦ ⎭⎪ P (θ ) ⎤ dθ ⎥ Q (θ ) ⎦ ( ) ∫ ( P (θ ) − Q (θ ) ) log Q (θ ) dθ P θ 2π 0 (2.8) To evaluate the Equation (2.8), the standard procedure is by performing the integral as a summation over discrete frequencies. The discrete summation approximation can be written as N DSKL ( P, Q ) = ∑ ( Pi − Qi ) log i =1 Pi Qi (2.9) Equation (2.9) is valid only if the subscript i is a frequency index (Vepa and King, 2004). Thus, this distance measure has not been used for MFCC. 2.10 Heuristic Method There are two different types of heuristics which are constructive methods and destructive method (Manuel, 1997). Constructive methods can be categorized as iterative improvement algorithms. Its starts from an initial configuration, generate a sequence of configurations until a satisfactory one is found. Destructive method is a hybrid “strategic oscillation” approach to complement a constructive approach where it employs alternating constructive and destructive phases of varying amplitudes (Manuel, 1997). Metaheuristics have experienced remarkable growth over the past decade. The success of metaheuristics is their flexibility in applications to complex optimization problems (Vasan and Komaragiri, 2009). Among the most popular metaheuristics are Simulated Annealing, tabu search and Genetic Algorithms. These heuristics has a unique search mechanism that allows them to escape local optima (Vasan and Komaragiri, 2009). 37 The heuristic algorithms do not guarantee an optimal solution can be found most of the time. However, solutions that are close to optimal which are within a few percent of the optimum can be found quickly. An ideal global optimization method should simultaneously meets two requirements which are solve out the global minimum of multimodal objective function and have high convergence speed (Wang et al., 1996). Simple heuristics have been used in some previous speech synthesis systems reported in the publications to select suitable candidate of units (Qing et al., 2008). The selection criterion is based on cost degradation from the optimal sequence. A lattice generated by splitting all candidate units into instances is searched through with a Viterbi shortest path algorithm (Blouin et al., 2002). In concatenative speech synthesis systems, each speech unit in the speech database of the system will be evaluated in order to find out most appropriate unit, which shows the lowest cost (Hirai et al., 2002). 2.10.1 Simulated Annealing (SA) The SA algorithm is based on Monte-Carlo methods and it may be considered as a special form of iterative improvement (Manuel, 1997). It was Kirkpatrick et al (1983) who first proposed SA, as a method for solving combinatorial optimization problems. Simulated Annealing was given this name in analogy to the annealing process in thermodynamics (Gao and Tian, 2007), specifically with the way metal is heated and then is gradually cooled so that its particles will attain the minimum energy state and that the optimization process can be carried out by applying the Metropolis criterion (Triki et al., 2005). This algorithm was controlled by the parameters of the cooling schedule. The simulated annealing algorithm is a random search method where it can be employed to solve the mixed discretization problem and the complicated non-linearity problem (Turgut et al., 2003). Simulated annealing algorithm generally applicable for solving combinatorial optimization problems by generating a sequence of moves at decending values of a control parameter (Jeong and Kim, 1990). The aim of simulated annealing is to choose a good solution to an 38 optimization problem according to some cost function on the state space of possible solutions (Rose et al., 1990). Simulated annealing is a generalization of the local search algorithm (Turgut et al., 2003). In the iterative process for simulated annealing algorithm, it algorithm allowed accepting non-improving neighbouring solutions (Turgut et al., 2003) to avoid being trapped at a poor local optimum with a certain probability. Whereas iterative improvement algorithm would allow only cost-decreasing ones to be accepted (Turgut et al., 2003). The probability of acceptance with energy E that characterizes thermal equilibrium and kb denotes the Boltzmann constant is given by the Boltzmann distribution (Vasan and Komaragiri, 2009): ⎛ E ⎞ P ( E ) = exp ⎜ − ⎟ ⎝ kbT ⎠ This equation indicates that a system at a high temperature has almost uniform probability of being at any energy state, but at a low temperature it has a small probability of being at a high energy state (Vasan and Komaragiri, 2009). However, it is proven that eventually simulated annealing produces more optimal solution than the iterative improvement algorithm (Turgut et al., 2003). The generation scheme to get a new configuration is called a move and it is crucial to both the quality of the solution and the speed of the algorithm (Jeong and Kim, 1990). Since the searching strategy of SA can avoid the search process being trapped in the local optimum solution, so the SA is an effective global optimization algorithm (Gao and Tian, 2007). The SA algorithm has shown theoretically that a global optimum of the optimization problem can be reached with probability one, provided a set of conditions regarding acceptance and generation mechanisms is satisfied (Manuel, 1997). Besides that, simulated annealing has the ability to scale for large scale optimization problems and its robustness towards achieving local optima convergence (Turgut et al., 2003). Due to this advantage, SA-based methods may have a great potential for obtaining high quality solutions when applied to various discrete optimization problems. However, simulated annealing algorithm needs lots of iterative computation although it can obtain a global optimum solution (Manuel, 1997). Thus, SA has slow convergence rate. 39 2.10.2 Approaches to improve SA algorithm There are two approaches to speed up simulated annealing which are cooling schedule improvements and generation mechanism design (Manuel, 1997). For cooling schedule improvements, it has to deal with careful control of the annealing process. There are a number of problem-independent general annealing processes that have been reported in the literature (Manuel, 1997). The adaptive cooling schedules which explain their efficiency has been used in the literature. The annealing parameters are determined automatically from measures of statistical quantities related to the particular problem at hand (Manuel, 1997). For generation mechanism design, smart generation mechanisms based on the idea of range limiting or changes in cost function are employed to reduce the chances of generating a next state that is going to be rejected (Manuel, 1997). This kind of move would lead to a significant reduction in computational time at low temperatures where the probability of acceptance is very low. The enlargement of the neighbourhood structure obtained by combining several simple moves to obtain a complex one is also another method of generation mechanism design (Manuel, 1997). The larger neighbourhood structure allows a faster exploration of the configuration set and a higher probability to escape from local minima. However, these approaches are usually problem-dependent and therefore have not been widely used (Manuel, 1997). However, although the theoretical basis of the algorithm has been known for almost two decades, there is still lack of enough practical information about it. Thus, it is not an easy task for user to design his own algorithm. More importantly, no theoretical results can give clear statement about which temperature decrement rule should be used or what kind of neighborhood should be chosen (Triki et al., 2005). 40 2.10.3 Polynomial approximation As asymptotic convergence requires infinite computing time, and as simple list of the configuration set has an exponential time complexity, some polynomialtime approximations are used in practice, while preserving as much as possible the flavour of the convergence theory (Manuel, 1997). The proper procedure (Manuel, 1997) is to choose a suitable cooling schedule, that is, to decide for i) the initial condition, i.e. the initial temperature, t0 ii) the decrement rule (annealing schedule) for the temperature, iii) the equilibrium condition, i.e. the length of the Markov chains, and iv) the stop condition, i.e. the final temperature. In doing so, most cooling schedules try to establish and maintain equilibrium on each temperature level by appropriately adjusting the length of the Markov chains and the cooling rate. Polynomial-time approximation can at best reach pseudoequilibrium (Manuel, 1997). To solve a discrete optimization problem using simulated annealing, the following steps needs to be performed: Initially, express the problem as a cost function optimization problem by defining the configuration set S, the cost function C, and the neighbourhood structure, N. Next, choose an annealing schedule. Finally, conducts the annealing process. A general approach for a simulated annealing procedure (Algorithm 2.2) described in the following pseudo-code program (Manuel, 1997): Algorithm 2.2: 1. 2. 3. 4. Initialize ( s0 , t0 ) k := 0 s := s0 Until Equilibrium reached do: 4.1 Generate s ' from Ns 4.2. Metropolis test: ⎛ C ( s ') − C ( s ) ⎞ ⎪⎫ ⎪⎧ 4.2.1 If min ⎨1, exp ⎜ ⎟ ⎬ > random [ 0,1) , then continue tk ⎝ ⎠ ⎭⎪ ⎩⎪ with s := s ' . 4.2.2 else continue with old s . 5. if stopping criterion valid, stop. k := k + 1 . 6. 41 7. 8. Calculate tk . Goto 4. Initially, randomly generate an initial solution, s0 . To allow that most of the proposed transitions pass the Metropolis criterion, the corresponding initial temperature, t0 has to be set high enough. A free search in the configuration set is intended at the beginning of the algorithm. The search becomes strict as t decreases. Thus, fewer proposed transitions are accepted. Finally, local minimum is reach with very small values of t. At this stage, the annealing process evolves to a final configuration where no proposed transition is accepted at all. The last optimum solution found could be interpreted as a solution of the discrete optimization problem. 2.10.4 Annealing Schedule There are several theoretical and empirical cooling schedules that could be categorized into classes such as monotonic schedules, adaptive schedules, geometric schedules and quadratic cooling schedules (Nader and Saeed, 2004). SA algorithm usually contains two imbricated loops which are outer loop and inner loop. The outer loop controls the decrease of the temperature and the inner loop in which the temperature remains constant and that consists in a Metropolis algorithm (Triki et al., 2005). 2.10.4.1 Theoretically optimum cooling schedule The annealing schedule given by Formula (2.10) ensures convergence of the SA to the optimum solution (Hajek, 1988) Tk = C ln (1 + k ) (2.10) 42 where k = 1, 2,... is the index of the outer loop and C is the depth of the deepest local minimum. However, this optimum cooling schedule is only of theoretical interest since the decrease of the temperature is too slow. 2.10.4.2 Geometric cooling schedule The most frequently used annealing schedule is given by α ( 0 < α < 1) Tk +1 = α ⋅ Tk , where α denotes the cooling factor (Triki et al., 2005). Usually the value of α is chosen in the range between 0.5 and 0.99. This cooling schedule provides a baseline for comparison with more sophisticated schedules since it is very simple. 2.10.4.3 Cooling schedule of Van Laarhoven et al. The annealing schedule presented in Equation (2.11) that is shown to be terminates in polynomial time has been proposed by Van Laarhoven and Aarts (1987) Tk +1 = Tk ⋅ 1 ln (1 + δ ) Tk 1+ 3σ (Tk ) (2.11) where δ is a “small” real number. By maintaining the homogeneous Markov chains close to each other, it is hopes that a small number of transitions will be sufficient to reach thermal equilibrium between each temperature decrement. Lundy and Mees (1986) has described similar annealing schedule given in Equation (2.12) Tk +1 = Tk ⋅ 1 1+ γ U Tk where U is some upper bound on ( f ( x ) − f opt ) and γ a “small” real number. (2.12) 43 2.10.4.4 Cooling schedule of Otten et al. The annealing schedule proposed by Otten and van Ginneken (1984) can be written as T3 1 ⋅ 2k M k σ (Tk ) (2.13) Cmax + Tk ⋅ ln (1 + δ ) ⋅T σ 2 (Tk ) ⋅ ln (1 + δ ) k (2.14) Tk +1 = Tk − where M k is given by Mk = and Cmax is an estimation of the maximum value of the cost function. The annealing schedule in Equation (2.13) can be simplified as Tk +1 = Tk 1 ln (1 + δ ) 1 + Tk ⋅ Cmax after substituting Equation (2.14) into (2.13) which is similar to the annealing schedule (2.11). 2.10.4.5 Cooling schedule of Huang et al. The annealing schedule proposed by Huang et al. (1986) is based on the average cost values of consecutive Markov chains. From Equation (2.15), d 〈 f (T )〉 σ 2 (T ) (2.15) = dT T2 where 〈 f (T )〉 is the expected cost in equilibrium and σ 2 (T ) is the variance in the cost at equilibrium, authors obtained d 〈 f (T )〉 σ 2 (T ) = d ln (T ) T Thus where ln (T ) − ln (T − δ T ) T Δ (T ) σ 2 (T ) Δ (T ) = 〈 f (T )〉 − 〈 f (T − δ T )〉 44 Finally, ⎛ Δ (T ) ⎞ T − δ T = T exp ⎜⎜ −T 2 ⎟ σ (T ) ⎟⎠ ⎝ (2.16) The Δ (T ) in Equation (2.16) is replaced by Huang et al. (1986) with λσ (T ) where λ ( 0 < λ ≤ 1) is a constant parameter that has to be determined by the user. It is expected that the quasi-equilibrium will be maintained by setting the difference Δ (T ) to be less than the standard deviation of the cost. Finally, the annealing schedule in (2.17) with typical value of λ equal 0.7 is obtained ⎛ λTk ⎞ Tk +1 = Tk exp ⎜⎜ − ⎟⎟ ⎝ σ (Tk ) ⎠ (2.17) This annealing schedule has been widely used and is known to be an efficient general cooling schedule (Triki et al., 2005). 2.10.4.6 Adaptive cooling schedules In adaptive cooling schedules, the computation of the next temperature value is based on the past history of the run. The aims of the adaptive cooling schedules are to try to achieve the annealing as close to equilibrium as possible and keep the annealing process as short as possible (Triki et al., 2005). These two goals are contradictory to each others. 2.10.4.7 A new adaptive cooling schedule Triki et al. (2005) looking for a cooling schedule that would allow them to control the difference in average cost between two sequences of trials. This cooling schedule is based on the observation that the average cost does not vary proportionally to the temperature. The annealing schedule given by Triki et al. (2005) is as in Equation (2.18). 45 ⎛ Δ (T ) ⎞ Tk +1 = Tk ⋅ ⎜⎜ 1 − Tk 2 k ⎟⎟ σ (Tk ) ⎠ ⎝ (2.18) Δ (T ) can be control by the user where Δ (T ) is defined by Formula (2.19). Δ (T ) = f (T ) − f (T − δ T ) (2.19) The annealing schedule in Equation (2.18) has several good properties. Firstly, this schedule relies only on the parameter Δ (T ) .The information on the problem difficulty can be obtained during the execution of the SA algorithm. The theoretical evolution of the expected cost is assign once for all by the choice of Δ (T ) . To determine whether the current temperature fitted well or not in the problem, the comparison is made between the practical average expected cost at temperature T and theoretical expected cost. If these two costs fit well, the problem is said to be easy at this temperature. Otherwise, the problem is said to be difficult at this temperature and it is necessary to have new tuning of SA parameters (Triki et al., 2005). The practical expected cost and the theoretical expected cost can function as a guideline for thermal equilibrium. When the difference between the practical expected cost and the theoretical expected cost becomes significant, that means thermal equilibrium has not been reached (Triki et al., 2005). Therefore, the possible action includes increase the number of trials at the current temperature, or to choose a new smaller Δ (T ) , or to stop SA and start a greedy algorithm (Triki et al., 2005). 46 2.11 Parallel SA Parallel Simulated Annealing Synchronous Serial-Like Functional Decomposition Simple Serializable Set Decision Tree Asynchronous Altered Generation Spatial Decomposition Shared State-Space Spatial Decomposition Shared State-Space Systolic Figure 2.9 Parallel Simulated Annealing Taxonomy (Daniel, 1995). Parallel simulated annealing techniques in taxonomy of three major classes have been shown in Figure 2.9. The three major classes are serial-like, altered generation and asynchronous (Daniel, 1995). An algorithm is called synchronous if adequate synchronization ensures that cost function calculations are the same as those in a similar sequential algorithm (Daniel, 1995). Serial-like and altered generation are two major categories of synchronous (Daniel, 1995). Serial-like convergence algorithms maintain the convergence properties of sequential annealing (Daniel, 1995). Altered generation algorithms modify state generation but calculate the same cost function (Daniel, 1995). Asynchronous algorithms eliminate some synchronization and allow errors to get a better speedup but with the possibility of causing reduced outcome quality (Daniel, 1995). Furthermore, parallelizing simulated annealing can be performed by generating several perturbations to the current solution simultaneously, which requires synchronization to guarantee correct evaluation of the cost function (Durand and White, 2000). Many parallel versions of SA have been developed in order to improve SA performance. One of the approaches of parallel SA is to generate and evaluate several moves simultaneously. This approach is application independent and allows the exploitation of a reasonable amount of parallelism (Durand and White, 2000). 47 2.12 Segmented Simulated Annealing Figure 2.10 shows the procedure for segmented simulated annealing. One of the weaknesses of SA is the limited coverage of the SA method. In certain cases, this weakness causes SA to prevent the search from converging to an acceptable solution if the initial parameters are not near the optimum region (McGookin and MurraySmith, 2006). This problem can be eliminated by the development of the segmented simulated annealing (SSA) algorithm (McGookin et al., 1996; Atkinson, 1992). The idea behind SSA is consecutively executed a number of single SA processes. SSA covers more of the search space than the conventional SA because of single SA processes starts at a different point in the search space with a wide range of possible initial values, which segments the search space into smaller regions (McGookin and Murray-Smith, 2006). The number of final costs available is equal to the number of consecutively executed single SA processes. This final cost values will be sorted into ascending order. The best cost which is smallest value is taken to be the optimum with its corresponding parameters providing an optimal result. Because SSA provides much wider exploration of the search space particularly in the initial stages than the conventional SA, therefore SSA is a better search method. 48 Randomly generate initial values Generate desired number of scaling factors (one for each run) Apply Simulated Annealing for one scaling factor Store results Repeat until the completion of all desired number of runs Sort runs into ascending cost order Take minimum cost run as optimum End Figure 2.10 Segmented simulated annealing (McGookin and Murray-Smith, 2006). CHAPTER 3 PROPOSED SYSTEM AND IMPLEMENTATION 3.0 Introduction The main function of this system is to select the best candidates unit to form the speech utterance based on cost functions. This is done by implement the unit selection using speech features (MFCC) and applies a distance measure (Euclidean distance) between pairs of speech features vectors. The heuristic method namely Simulated Annealing had been employed during the search process. The shortest distance units are selected and concatenate it. A listening test on the speech utterance (waveform) is conducted. The conclusions were made based on the result of the listening test. 50 3.1 System Design Flow Input Text Search and selects the candidates units from the database based on phonetic context. (Target cost). Take all the selected units as an input for unit selection. Implement SA in unit selection. The selection of unit was based on cost minimization. (Concatenation cost). Concatenate the selected units. Output sound Words design and performed listening test. Conclusion is made based on listening test. Performance of SA is evaluated. Figure 3.1 Block Diagram of System Design Flow. 51 3.2 Malay Phonetics and Phone Sets The smallest unit of language is phone sets (Farid, 1980). There are a total of 32 phone sets in Malay Language including 6 vowels and 26 consonants (Nik et al., 1989). Malay language has less vowels than US English because the rules in pronunciation of Malay language are more direct compared to English (Raminah and Rahim, 1987).The vowels sound in Malay language such as “a” has its same pronunciation in wherever it place, no matter is at front, middle or back (Onn, 1993). In English, the vowels of “a” in English may vary from its difference context or position. 3.3 Malay Phoneme Malay phonemes can be classified into two main categories namely Malay vowels and consonants (Tan, 2003). 3.3.1 Malay Vowels Vowels are steady state voiced sound (Luis, 1997). There is unique area function for each of the vowel. The area function is influence by the shape of oral cavity and the position of the tongue (Rowden, 1992). There are six vowels in Malay language: three front vowels, one central vowel and two back vowels (Farid, 1980). There are 6 vowels in Malay language: “a”, “e” (unstressed”e”), “e” (stressed “e”), “i”, “o” and “u” (Farid, 1980). 3.3.2 Malay Consonant There are 26 consonants in Malay language (Farid, 1980). The special consonants in Malay are “kh”, and “sy” (Raminah and Rahim, 1987). 52 3.4 Phoneme Units Database The most commonly used units in speech synthesis are probably phonemes because they are the normal linguistic presentation of speech. Using phonemes provides maximum flexibility with the rule-based systems (Khor, 2007). The diphone, triphone or variable length of unit can be formed from phoneme since phoneme is the basic unit of speech (Tan, 2009). Unit selection required a very large speech database. In this dissertation, phoneme database is used where it has been build by Center of Biomedical Engineering, Universiti Teknologi Malaysia. The database was made up by many sentences and cut all these sentences into smaller unit, phoneme. Since a single sentence may consist of several same type of phoneme, therefore a database may consist various sample of same type phoneme. Each phoneme may have up to thousand of samples. Each phoneme’s waveform was transformed to a set of coefficients after MFCC feature extraction. In each phoneme’s waveform, the initial frame and final frame was transformed into 12 coefficients respectively. In other words, each phoneme’s waveform consists a total of 24 coefficients as shown in Figure 3.2. 53 Phoneme Initial frame Final frame M MFCC MFCC Coefficients Coefficients C1 C2 C3 C4 C5 C6 C1’ C2’ C3’ C4’ C5’ C6’ C7 C8 C9 C10 C11 C12 C7’ C8’ C9’ C10’ C11’ C12’ Figure 3.2 A set of coefficient transform from MFCC algorithm. Phoneme database used in this dissertation consists of a total of 73 different phonemes. The total units after extracting the phoneme units from the carrier sentences are shown in Table 3.1. The two phonemes with highest frequency of occurrence are “a” and “e” where they contribute almost 18.36 and 8.6 percent respectively (Tan, 2009). The two phonemes with lowest frequency of occurrence are “_z” and “iu”. Instead of keeping the phoneme speech units in its original carrier sentence, Tan (2009) grouped the same phoneme in their respective folder as shown in Figure 3.3. This will reduce the buffer or memory allocation when the system wants to access certain phoneme from its source (Tan, 2009). For example, to form a sentence consisting of 15 phonemes, it may require allocation of 15 times of memory for each origin sentence before it can be extracted from the origin source. This will consume a lot of memory and slow down the process of concatenation (Tan and Sheikh, 2008a). Each phoneme folder consist of the wave files. For example, phoneme’s folder “a” consists of 107 samples, so there will be 300 wave file inside the folder “a”. 54 …. …. …. ………. Figure 3.3 Speech unit database (Tan and Sheikh, 2008b) Table 3.1: Total units after extracting the phoneme units from the carrier sentences (Tan and Sheikh, 2008a) Pho Total % _a 107 0.64 _ai 4 0.02 _au 1 0.01 _b 256 1.52 _c 29 0.17 _d 269 1.6 _e 3 0.02 _eh 10 0.06 _f 17 0.1 _g 30 0.18 _h 65 0.39 _i 74 0.44 _ia 12 0.07 _j 49 0.29 _k 248 1.47 _kh 8 0.05 _l 72 0.43 _m 447 2.66 Pho _n _ny _o _p _r _s _sy _t _u _v _w _y _z a ai au b c Total 33 2 21 320 58 258 5 178 42 5 18 59 1 3076 97 26 253 77 % 0.2 0.01 0.12 1.9 0.34 1.53 0.03 1.06 0.25 0.03 0.11 0.35 0.01 18.3 0.58 0.15 1.5 0.46 Pho Total % d 313 1.86 e 1448 8.6 eh 124 0.74 f 38 0.23 g 169 1 h 374 2.22 i 970 5.76 ia 87 0.52 io 3 0.02 iu 1 0.01 j 164 0.97 k 665 3.95 kh 7 0.04 l 514 3.05 m 492 2.92 n 1293 7.68 ng 500 2.97 ny 91 0.54 Pho Total % o 206 1.22 p 276 1.64 q 1 0.01 r 838 4.98 s 410 2.44 sy 8 0.05 t 652 3.87 u 696 4.13 ua 107 0.64 ui 2 0.01 v 10 0.06 w 72 0.43 y 52 0.31 z 13 0.08 55 3.5 Feature Extraction To compute the spectral distance between two phoneme, e.g., M and A, feature vectors of the final frame of phoneme M and the initial frame of phoneme A have to be computed. There exists various distance measure in measuring the distance between these two feature vectors. In this dissertation, the MFCC as speech feature or parameterization was implemented and Euclidean distance was applied to measure pairs of these feature vectors. The formula for Euclidean distance is as follow: DEu ( X , Y ) = n ∑( X i =1 i − Yi ) 2 Figure 3.4 The GUI to extract MFCCs coefficients. (3.1) 56 Figure 3.5 The GUI to extract MFCCs coefficients. In Figure 3.4 and Figure 3.5, the circled red is the phoneme and circled black is the number of candidates available for that particular phoneme. After click the button “MFCC”, all the front and end of the MFCCs coefficients of that particular phoneme are extracted. Figure 3.6 The 12 coefficients extracted for phoneme “_m”. 57 Figure 3.7 The 12 coefficients extracted for phoneme “a”. In Figure 3.6 and Figure 3.7, the circled red represents the phoneme, number of candidates and either front or end of the MFCCs. The circled black represents the 12 coefficients. M A Feature Vector (C’) Feature Vector (C) MFCCs Euclidean Distance MFCCs Figure 3.8 Distance measure and speech feature. 58 Figure 3.8 shows the distance measure and speech feature which are Euclidean distance and MFCCs. For example, to compute the spectral distance between phonemes “m” and “a” from a database, feature vectors of final frame “m” and the feature vectors of initial frame “a” have to be obtained. Once these features are obtained, distance measure which is Euclidean distance is applied to these pair of feature vectors. 3.6 Phonetic context The best candidate of the input unit is the unit that the left and right phonetic contexts of the input unit and the candidate is exactly matched, therefore “filtering” process was conducted by using phonetic context to reduce computational time and efforts. This process will further be discussed in the next chapter. For example, to form a word “nasi”, the right phonetic context for “n” is “a”, the left and right phonetic context for “a” is “n” and “s” respectively. Therefore, only the unit that matched left and right phonetic context has the chance to be selected while units that do not match left and right phonetic context are eliminated. Figure 3.9 The candidate unit for phoneme “_n” that matched right phonetic context. 59 Figure 3.10 The candidate unit for phoneme “a” that matched left and right phonetic context. In Figure 3.9, the circled red is the right phonetic context for phoneme “_n” which is “a”. There are a total of 33 candidate units for phoneme “_n” in the database. After filtering by phonetic context, 22 candidates units that do not matched the phonetic context were eliminated while 11 candidates units were remained. In Figure 3.10, the circled black is the left phonetic context for phoneme “a” which is “_n” and the circled red is the right phonetic context for phoneme “a” which is “s”. There are a total of 3076 candidate units for phoneme “a” in the database. After filtering by phonetic context, 3073 candidates units that do not matched the phonetic context are eliminated while only 3 candidates units were remained. The retained candidates units are used as an input for Simulated Annealing. 3.7 Unit Selection Unit selection (Figure 3.11) will read the input from the text file. It will start to search for the shortest path of the desired word from database. The total number of sequences is equal to total number of candidates in phoneme i times total number of candidates in phoneme i+1 until phoneme n in a word where n is total phonemes. The heuristic method is employed since there was huge number of combination of sequences to form a word. The heuristic method helps to search for minimum “path” within reasonable times without having to go through all combination. The heuristic method (meta-heuristic) mention here refers to Simulated Annealing. After that unit selection will keep the selected units from the database into a result file. This result is used to call the waveform from database to concatenate and produce sound. 60 Figure 3.11 Unit selection (Tan and Sheikh, 2008c). 3.8 Concatenation To generate natural-sounding synthesized speech waveforms, one approach is to select and concatenate units from a large speech database (Hunt and Black, 1996). All the selected units will be joined together according to desired sequence of phonemes to form an utterance of word or sentences. Figure 3.12, Figure 3.13, Figure 3.14 and Figure 3.15 are the waveform for phonemes “_n”, “a”, “s” and “i”. These four waveforms are joined together to form the output sound for Malay word, “nasi” as shown in Figure 3.16. 61 Figure 3.12 Waveform for phoneme “_n”. Figure 3.13 Waveform for phoneme “a”. Figure 3.14 Waveform for phoneme “s”. 62 Figure 3.15 Waveform for phoneme “i”. Figure 3.16 Concatenation of the best matching units for the word “nasi”. CHAPTER 4 SIMULATED ANNEALING 4.0 Introduction Annealing, a physical process in statistical mechanics, is often performed in order to relax the system to the state with the minimum free energy. A crystalline solid is heated up by increasing the temperature until the solid is melted into liquid, then the temperature is lowered slowly until it achieves its regular crystal lattice shape. All particles of the solid arrange themselves randomly at each temperature. A solid is capable of reaching thermal equilibrium at each temperature provided the cooling is slow enough. When the system reaches its frozen state, a crystalline solid with low energy (non-defected) would be formed. However, the solid may become glass with non-crystalline structure or the defected crystal with meta-stable amorphous structures if temperature is lowered too fast (Jeon and Kim, 2004). Simulated annealing (SA) was developed by Kirkpatrick et al., (1983) by combining statistical mechanics with optimization principles. SA is a local search algorithm. Simulated annealing (SA) is known to be one of the most efficient heuristic algorithms particularly well suited for solving combinatorial optimization problems because it can avoid local minima. Therefore, SA has been recognized as a powerful tool for solving large number of optimization problems in a variety of application areas (Koulamas et al, 1994). 64 The stochastic approach is used by SA to direct the search (Hasan et al., 2000). The operation used in obtaining the neighborhood of a solution is called a move. The search is allowed to approach a neighboring state even if the move causes the value of the objective function to become worse. The random design of simulated annealing method changes with a probabilistic acceptance criterion during the search (Pantelides and Tzan, 2000). This unique characteristic enables the search process to avoid being trapped in local minima since the non-improving moves may also be accepted as current solution. Moves that increase the value of objective function are accepted with probability. Therefore, establishing simulated annealing as a global optimization method (Pantelides and Tzan, 2000). The simulation of the annealing process is performed in the simulated annealing. The cost function corresponds to the energy function and configuration in optimization corresponds to state of statistical physics. The aim of SA is to minimize the cost function. SA consists of control parameter, annealing schedule. Each temperature can have one or a few iterations. Simulated annealing converges to the global optimal solutions under certain conditions. These conditions are the ways of generating neighborhood solutions and on the cooling schedule (Liu, 1999). Simulated annealing converges to the global optimal solution provided that the computation time is long enough (Liu, 1999). However, it requires excessive computation time to obtain a near-optimal solution which is often unrealistic. In the application of unit selection in Malay Text-to-Speech system, it is essential to get a good solution within a reasonable time. Instead of allowing the SA to consume much of the time in searching for minimum solution, it is more practical that more efforts were put in improving the parameters since performance of SA depends very much on the selection of parameters. Since SA has little use of memory, therefore it cannot efficiently perform intensification and diversification mechanisms (Jeon and Kim, 2004). It is difficult to simultaneously apply intensification and diversification mechanisms in simulated annealing due to memory structures and histories because a current solution is only affected by a previous solution (Metropolis criterion). Diversification mechanism is mainly performed at high temperature by accepting most of the solutions (random 65 walks). Intensification mechanism is mainly performed as temperature decreases because of the small-perturbed solutions are accepted. 4.1 Procedure of Simulated Annealing First, an initial solution needs to be generated. In the beginning, the initial configurations including initial temperature and annealing schedule need to be determined. After the initial temperature is chosen, an initial solution is randomly generated or by applying a simple construction procedure. The temperature is set at a high level so that almost all moves will be accepted initially. Then, the value of the cost function will be calculated. This value of the cost function based on the initial solution is straightaway accepted as current solution. Next, a new solution from the neighborhood of the current solution will be generated by applying a move. The temperature is lowered according to annealing schedule during the procedure until almost no moves will be accepted. The obtained value of the cost function based on the new solution will be calculated and compared to the current cost function value. If it is better than the current value, it will be accepted as current solution. Else, the new value would be accepted as current solution only when the Metropolis's criterion is met, which is based on Boltzmann’s probability. This process then continues from the new current solution. 66 Start Initialize a sequence at initial temperature Swap the phoneme using the move to obtain the neighbourhood solution Evaluate the neighbourhood solution Increment counter & decrease the temperature according to annealing schedule. Accept or reject the neighbourhood solution based on Metropolis criterion Is stopping criteria met? No Yes End Figure 4.1 SA flow diagram to find best speech unit sequence. 67 4.2 Initial Solution An initial solution can also be considered as a random solution where it is a starting solution that will be used in the search process (Ghazanfari et al., 2007). In this dissertation, the initial solution is fixed. The first candidate of related phoneme is chosen as an initial solution. For example, to form a Malay word “masin”, it’s required 5 different phonemes which are “_m”, “a”, “s”, “i” and “n”. Example: _m [1] a [1] s [1] i[1] n[1] The numerical value in bracket “[ ]” represents the candidates numbers for the corresponding phonemes. The objective function for cost minimization in unit selection is represented by ⎛ N ⎞ u * = arg min ⎜ ∑ CC ( ui , ui −1 ) ⎟ u∈U ⎝ i =1 ⎠ (4.1) where u = u1 , u2 ,....u N are the units in the inventory U which minimize the concatenation cost in Equation (4.1). To get the local concatenation cost value, it requires the parameterization of units and a distance measure. Concatenation cost between units ui and ui −1 can be written as Cc ( ui , ui −1 ) = 4.3 12 ∑ (u − u ) j =1 i 2 i −1 The cooling schedule The choice of a cooling schedule has an important effect on performance of SA algorithm (Ali et al., 2002). For this reason, modifications and improvements have been tried by tuning the parameters (cooling rate) for better quality or time tradeoff. The annealing schedule must be specified in any implementation of SA. The value of temperature parameter, T, varies from relatively large value to a small value close to zero. These temperature values are controlled by a cooling schedule that specifies the initial and decreasing temperature values at each stage of the 68 algorithm. The following geometric function has been taken as the temperature reduction function: Tk +1 = α Tk k=0,1,2,3,.. 0 <α <1 where Tk is the temperature at stage k, α is the temperature reduction rate. In this dissertation, the various temperature reduction rates was included which are 0.80, 0.85, 0.90 and 0.95. Figure 4.2 and Figure 4.3 show the temperature reduction pattern for these temperature reduction rates for Markov chain equal to one and two respectively. The initial temperature T0 is set relatively high so that most of the moves are accepted in the early stages and there is little chance of the algorithm intensifies into the region of local minimum. The initial temperature and final temperature (Chen and Su, 2002) in unit selection is set according to the Equation (4.2) and Equation (4.3) respectively Ti = −1 ln Pi (4.2) Tf = −1 ln Pf (4.3) where Pi is the desired initial probability and Pf is the desired final probability. The parameters values in unit selection are given as follows: −1 = 999.499 ≈ 1000 ln 0.999 - initial temperature, T0 = - final temperature, T f = - temperature reduction rate, α = 0.80, 0.85, 0.90, 0.95 −1 = 0.0869 ≈ 0.1 ln 0.00001 69 Temperature 1000 900 800 0.95 700 0.90 600 0.85 500 0.80 400 300 200 100 0 0 20 40 60 80 100 120 140 160 180 Iterations Figure 4.2 Temperature reduction pattern for various reduction rate with Markov chain length 1. Temperature 1000 900 800 0.95 700 0.90 600 500 0.85 400 0.80 300 200 100 0 0 50 100 150 200 250 300 350 400 Iterations Figure 4.3 Temperature reduction pattern for various reduction rate with Markov chain length 2. 70 4.3.1 Markov chain The length of the Markov chain is required to decide how many trials are to be used at each value of T. There were two Markov chain length used in unit selection in this dissertation. The first Markov chain used was reduced the temperature according to annealing schedule in every iterations. The next Markov chain used was reduces the temperature according to annealing schedule after two successive iterations. Table 4.1 Maximum number of iterations for Markov Chain length 1 and 2 to reach final temperature greater than 0.1. Temperature reduction rate Maximum number of iterations to reach T f > 0.1 Markov Chain length=1 Markov Chain length=2 0.80 42 84 0.85 57 114 0.90 88 176 0.95 180 360 4.4 Neighbourhood Generation Mechanism The effect of neighbourhood structure in SA has been studied by Cheh et al. (1991). Cheh et al. (1991) found that a small neighbourhood is better than a larger one for a number of problems. Different neighbourhood sizes on the travelling salesman problem (TSP) has been tested by Goldstein and Waterman (1988) and found that the best neighbourhood size is related to the problem size. Yao (1991) found that a larger neighbourhood is better if the initial solution is far away from the optimal solution. Yao (1993) showed that the performance of dynamic neighbourhood size on TSP was significantly better than the standard SA with fixed neighbourhood. 71 At each move, the generation scheme to get a new configuration is crucial to both the quality of the solution and the speed of the algorithm. There are four different moves used in this dissertation as follows: Move 1: Randomly swap each of the phonemes once at a time. Move 2: Swap the next phoneme of greatest local cost. Move 3: Swap the previous phoneme of greatest local cost. Move 4: Let greatest local cost at i. Swap the next phoneme when the local cost at i+1>i-1. 1≤ i ≤ n Swap the previous phoneme when the local cost at i+1<i-1. Swap the first phoneme when greatest local cost at i = 1. Swap the last phoneme when greatest local cost at i = n. 4.4.1 Move 1. Move 1: Randomly swapping each of the phonemes once at a time. Example: Initial Solution: phoneme1[1] phoneme2[1] phoneme3[1] Local Cost(i+1) Local Cost(i) The total cost is computed as phoneme4[1] Local Cost(i+2) phoneme5[1] Local Cost(i+3) i=n ∑ Local Cost [i ] where n is the total number of Local i =1 Cost. Iteration 1: Apply the move to obtain neighborhood solution. Neighborhood solution: phoneme1[1] phoneme2[1] Local Cost(i) phoneme3[7] Local Cost(i+1) phoneme4[1] Local Cost(i+2) phoneme5[1] Local Cost(i+3) 72 The phoneme 3 is chosen randomly to swap. The candidate’s number is changed randomly from 1 to 7. When phoneme 3 is changed, the Local Cost (i+1) and Local Cost (i+2) will be changed while Local Cost (i) and Local Cost (i+3) will remained unchanged. If this neighborhood solution is accepted as current solution, then generate another neighborhood solution based on this newly accepted current solution. Iteration 2: Neighborhood solution: phoneme1[9] phoneme2[1] LocalCost (i) phoneme3[7] LocalCost(i+1) phoneme4[1] LocalCost(i+2) phoneme5[1] LocalCost(i+3) The phoneme 1 is chosen randomly to swap. The candidate’s number is changed randomly from 1 to 9. When phoneme 1 is changed, only the Local Cost (i) will be changed while other Local Cost will remained unchanged. If this neighborhood solution is accepted as current solution, then generate another neighborhood solution based on this newly accepted current solution. Iteration 3: Neighborhood solution: phoneme1[9] phoneme2[1] LocalCost (i) phoneme3[7] LocalCost(i+1) phoneme4[1] LocalCost(i+2) phoneme5[14] LocalCost(i+3) The phoneme 5 is chosen randomly to swap. The candidate’s number is changed randomly from 1 to 14. When phoneme 5 is changed, only the Local Cost (i+3) will be changed while other Local Cost will remained unchanged. If this neighborhood solution is accepted as current solution, then generate another neighborhood solution based on this newly accepted current solution. 73 Iteration 4: Neighborhood solution: phoneme1[9] phoneme2[26] LocalCost (i) phoneme3[7] LocalCost(i+1) phoneme4[1] LocalCost(i+2) phoneme5[14] LocalCost(i+3) The phoneme 2 is chosen randomly to swap. The candidate’s number is changed randomly from 1 to 26. When phoneme 2 is changed, the Local Cost (i) and Local Cost (i+1) will be changed while Local Cost (i+2) and Local Cost (i+3) will remained unchanged. If this neighborhood solution is accepted as current solution, then generate another neighborhood solution based on this newly accepted current solution. Iteration 5: Neighborhood solution: phoneme1[9] phoneme2[26] Local Cost(i) phoneme3[7] Local Cost(i+1) phoneme4[8] Local Cost(i+2) phoneme5[14] Local Cost(i+3) The phoneme 4 is chosen randomly to swap. The candidate’s number is changed randomly from 1 to 8. When phoneme 4 is changed, the Local Cost (i+2) and Local Cost (i+3) will be changed while Local Cost (i) and Local Cost (i+1) will remain unchanged. If this neighborhood solution is rejected as current solution, then go back to the previous iterations and generate another neighborhood solution based on the solution of previous iterations (current solution). 4.4.2 Move 2 Move 2: Swap the next phoneme of greatest local cost. Since the task assigned to Simulated Annealing is minimization, the greatest Local Cost should have the higher priority to be swap. Therefore, in this case, the 74 values of each local cost were used as a guidance to decide which phoneme that needs to be swap. The move used is swap the next phoneme of greatest local cost. Example: Initial Solution: phoneme1[1] phoneme2[1] Local Cost(i) phoneme3[1] Local Cost(i+1) The total cost is computed as phoneme4[1] Local Cost(i+2) phoneme5[1] Local Cost(i+3) i=n ∑ Local Cost [i ] where n is the total number of i =0 Local Cost. Iteration 1: Apply the move to obtain neighborhood solution. For Local Cost (i) > Local Cost (i+1) > Local Cost (i+2) > Local Cost (i+3), since the greatest distance is from phoneme 1 to phoneme 2, we can either swap next phoneme of the greatest local cost or previous phoneme of the greatest local cost or both phoneme so that the distance between them will be changed. In this case, the move used is swap the next phoneme that has greatest local cost. The phoneme 2 is chosen to swap since Local Cost (i) has greatest magnitude. The candidate’s number is changed randomly from 1 to 9. When phoneme 2 is changed, the Local Cost (i) and Local Cost (i+1) will be changed while Local Cost (i+2) and Local Cost (i+3) will remain unchanged. If this neighborhood solution is accepted as current solution, then generate another neighborhood solution based on this newly accepted current solution. Neighborhood solution: phoneme1[1] phoneme2[9] Local Cost(i) phoneme3[1] Local Cost(i+1) phoneme4[1] phoneme5[1] Local Cost(i+2) Local Cost(i+3) 75 Iteration 2: For Local Cost (i+2) > Local Cost (i) > Local Cost (i+1) > Local Cost (i+3), since the next phoneme for Local Cost (i+2) is phoneme 4, therefore phoneme 4 needs to be swap. The candidate’s number is changed randomly from 1 to 21. When phoneme 4 is changed, Local Cost (i+2) and Local Cost (i+3) will be changed while Local Cost (i) and Local Cost (i+1) remain unchanged. If this neighborhood solution is accepted as current solution, then generate another neighborhood solution based on this newly accepted current solution. Neighborhood solution: phoneme1[1] phoneme2[9] Local Cost(i) phoneme3[1] Local Cost(i+1) phoneme4[21] Local Cost(i+2) phoneme5[1] Local Cost(i+3) Iteration 3: For Local Cost (i+1) > Local Cost (i) > Local Cost (i+2) > Local Cost (i+3), since the next phoneme for Local Cost (i+1) is phoneme 3, therefore phoneme 3 needs to be swap. The candidate’s number is changed randomly from 1 to 42. When phoneme 3 is changed, Local Cost (i+1) and Local Cost (i+2) will be changed while Local Cost (i) and Local Cost (i+3) remain unchanged. If this neighborhood solution is accepted as current solution, then generate another neighborhood solution based on this newly accepted current solution. Neighborhood solution: phoneme1[1] phoneme2[9] Local Cost(i) phoneme3[42] phoneme4[21] Local Cost(i+1) Local Cost(i+2) phoneme5[1] Local Cost(i+3) Iteration 4: For Local Cost (i+3) > Local Cost (i) > Local Cost (i+2) > Local Cost (i+1), since the next phoneme for Local Cost (i+3) is phoneme 5, therefore phoneme 5 needs to be swap. The candidate’s number is changed randomly from 1 to 7. When phoneme 5 is changed, only Local Cost (i+3) will be changed while others remain unchanged. If this neighborhood solution is accepted as current solution, then 76 generate another neighborhood solution based on this newly accepted current solution. Neighborhood solution: phoneme1[1] phoneme2[9] Local Cost(i) 4.4.3 phoneme3[42] Local Cost(i+1) phoneme4[21] Local Cost(i+2) phoneme5[7] Local Cost(i+3) Move 3 Move 3: Swap the previous phoneme of greatest local cost. In this case, the concept of choosing which phoneme to swap is similar with move 2 but the different is, this time the previous phoneme of the greatest Local Cost is chosen to swap. Example: Initial Solution: phoneme1[1] phoneme2[1] Local Cost(i) phoneme3[1] phoneme4[1] Local Cost(i+1) Local Cost(i+2) The total cost is computed as phoneme5[1] Local Cost(i+3) i=n ∑ Local Cost [i ] where n is the total number of i =0 Local Cost. Iteration 1: Apply the move to obtain neighborhood solution. For Local Cost (i) > Local Cost (i+1) > Local Cost (i+2) > Local Cost (i+3), since the previous phoneme for Local Cost (i) is phoneme 1, therefore phoneme 1 needs to be swap. The candidate’s number is changed randomly from 1 to 8. When phoneme 1 is changed, only Local Cost (i) will be changed while other remain unchanged. If 77 this neighborhood solution is accepted as current solution, then generate another neighborhood solution based on this newly accepted current solution. Neighborhood solution: phoneme1[8] phoneme2[1] Local Cost(i) phoneme3[1] Local Cost(i+1) phoneme4[1] Local Cost(i+2) phoneme5[1] Local Cost(i+3) Iteration 2: For Local Cost (i+2) > Local Cost (i) > Local Cost (i+1) > Local Cost (i+3), since the previous phoneme for Local Cost (i+2) is phoneme 3, therefore phoneme 3 needs to be swap. The candidate’s number is changed randomly from 1 to 7. When phoneme 3 is changed, Local Cost (i+1) and Local Cost (i+2) will be changed while Local Cost (i) and Local Cost (i+3) remain unchanged. If this neighborhood solution is accepted as current solution, then generate another neighborhood solution based on this newly accepted current solution. Neighborhood solution: phoneme1[8] phoneme2[1] Local Cost(i) phoneme3[7] Local Cost(i+1) phoneme4[1] Local Cost(i+2) phoneme5[1] Local Cost(i+3) Iteration 3: For Local Cost (i+1) > Local Cost (i) > Local Cost (i+2) > Local Cost (i+3), since the previous phoneme for Local Cost (i+1) is phoneme 2, therefore phoneme 2 needs to be swap. The candidate’s number is changed randomly from 1 to 3. When phoneme 2 is changed, Local Cost (i) and Local Cost (i+1) will be changed while Local Cost (i+2) and Local Cost (i+3) remain unchanged. If this neighborhood solution is accepted as current solution, then generate another neighborhood solution based on this newly accepted current solution. 78 Neighborhood solution: phoneme1[8] phoneme2[3] Local Cost(i) 4.4.4 phoneme3[7] Local Cost(i+1) phoneme4[1] Local Cost(i+2) phoneme5[1] Local Cost(i+3) Move 4 Move 4: Let greatest local cost at i. Swap the next phoneme when the local cost at i+1>i-1. Swap the previous phoneme when the local cost at i+1<i-1. 1 ≤ i ≤ n Swap the first phoneme when greatest local cost at i = 1. Swap the last phoneme when greatest local cost at i = n. The move 2 and move 3 were combined to form move 4. In move 4, whether to swap previous phoneme or next phoneme is determined by the magnitude of previous and next Local Cost of the greatest Local Cost. The purpose of using this move is to speed up the algorithm towards minimum solution. When a phoneme other than first and last phoneme is swapped, two Local Costs will also be changed. The decision whether to swap previous phoneme or next phoneme will determine which two Local Costs that will be changed. The two Local Costs that yields the higher magnitude is selected to change using move 4. Example: Initial Solution: phoneme1[1] phoneme2[1] Local Cost(i) phoneme3[1] Local Cost(i+1) The total cost is computed as i=n Local Cost(i+2) phoneme5[1] Local Cost(i+3) ∑ Local Cost [i ] where n is the total number of i =0 Local Cost. phoneme4[1] 79 Iteration 1: Apply the move to obtain neighborhood solution. For Local Cost (i+1) > Local Cost (i) > Local Cost (i+2) > Local Cost (i+3), since Local Cost (i+1) has greatest magnitude, whether to swap phoneme 2 or phoneme 3 is determined by magnitude of Local Cost (i) and Local Cost (i+2). Since Local Cost (i) > Local Cost (i+2), therefore Local Cost (i) should be assigned with higher probability to changed if compared to Local Cost (i+2). Since Local Cost (i) > Local Cost (i+2), therefore phoneme 2 (previous phoneme) is swapped. The Local Cost (i) and Local Cost (i+1) are changed when phoneme 2 is swapped. The candidate’s number for phoneme 2 is changed randomly from 1 to 5. Neighborhood solution: phoneme1[1] phoneme2[5] Local Cost(i) phoneme3[1] Local Cost(i+1) phoneme4[1] Local Cost(i+2) phoneme5[1] Local Cost(i+3) Iteration 2: For Local Cost (i+2) > Local Cost (i+3) > Local Cost (i) > Local Cost (i+1), since Local Cost (i+3) > Local Cost (i+1), therefore phoneme 4 is swapped. The Local Cost (i+2) and Local Cost (i+3) are changed when phoneme 4 is swapped. The candidate’s number for phoneme 4 is changed randomly from 1 to 13. Neighborhood solution: phoneme1[1] phoneme2[5] Local Cost(i) phoneme3[1] Local Cost(i+1) phoneme4[13] phoneme5[1] Local Cost(i+2) Local Cost(i+3) 80 Iteration 3: For Local Cost (i) > Local Cost (i+1) > Local Cost (i+2) > Local Cost (i+3), phoneme 1 is swapped. The Local Cost (i) is changed when phoneme 1 is swapped. The candidate’s number for phoneme 1 is changed randomly from 1 to 11. Neighborhood solution: phoneme1[11] phoneme2[5] Local Cost(i) phoneme3[1] Local Cost(i+1) phoneme4[13] Local Cost(i+2) phoneme5[1] Local Cost(i+3) Iteration 4: For Local Cost (i+3) > Local Cost (i+1) > Local Cost (i+2) > Local Cost (i), phoneme 5 is swapped. The Local Cost (i+3) is changed when phoneme 3 is swapped. The candidate’s number for phoneme 3 is changed randomly from 1 to 32. Neighborhood solution: phoneme1[11] phoneme2[5] Local Cost(i) 4.5 phoneme3[1] Local Cost(i+1) phoneme4[13] Local Cost(i+2) phoneme5[32] Local Cost(i+3) Metropolis's criterion According to Metropolis's criterion (Figure 4.4), a random number λ in [0, 1] is generated from a uniform distribution when the difference between the cost function values of the newly generated solutions (ΔE) is equal to or greater than zero. The newly generated solution is accepted as the current solution if Equation (4.4) is met. ⎛ ΔE ⎞ P (δ ) = exp ⎜ − ⎟ ⎝ T ⎠ (4.4) 81 METROPOLIS CRITERION Start Is Cnew < C prev ? Yes No Yes Is prob > λ ? ⎛ ( C prev − Cnew ) ⎞ ⎟ prob= exp ⎜ ⎜ ⎟ T ⎝ ⎠ ( λ is a random number from 0 to 1) No Accept new solution as current solution Reject new solution as current solution End Figure 4.4 Metropolis criterion. 82 4.6 Stopping criteria In this dissertation, the stopping criterion is based on three conditions: - The maximum number of iteration is used as a stopping criterion. When the total number of iterations, N > ε ( ε is user determine value). - If the neighbor solution was not improved after β iterations, ( β is user determine value) terminate the algorithm. - If the newly generated temperature, Tk is less than final temperature, ω , that is, Tk < ω , ( ω is user determine value), terminate the algorithm. The user determine value ε , β and ω may influence the performance of SA when these values is changed. 4.7 Unit Selection Unit selection will read the input text and start to search for the most suitable unit from database. The process of unit selection involved two stages. The first stage is filtering the candidate units with phonetic context and the next stage is selection of unit sequences which results in smallest sum of local cost. The target word utterance is formed by concatenate the waveform of phonemes according to units sequence that hold the smallest sum of local cost. 4.7.1 Phonetic context According to Fek et al.,(2006), the target costs is assigned with null cost if the phonetic context are fully matched. The phonetic context is used as a target cost in this dissertation. The computational of this target cost is not included in this dissertation but the phonetic context is used as a “filtering tool” to narrow down the search space. Since the best candidate of the input unit is the unit that matched fully 83 in term of phonetic context, therefore “filtering” process was conducted by using phonetic context to reduce computational time and efforts. During the “filtering” process, only the phonetic contexts that are fully matched were retained. The fully matched phonetic context means the left and right phonetic contexts of the input unit and the candidate is exactly matched. For example, to form a word “saya”, the left phonetic context for “a” is “s” and right phonetic context for “a” is “y”. If this combination of left and right phonetic context for “a” can be found in database, this is called fully matched phonetic context. If the left phonetic context for “a” is other than “s” and right phonetic context for “s” is “a”, this is called partially matched phonetic context since it matched only the right phonetic context. If the right phonetic context for “a” is “s” and right phonetic context for “a” is other than “y”, this is also called partially matched phonetic context since it matched only the left phonetic context. If the left phonetic context for “a” is other than “s” and right phonetic context for “s” is other than “a”, this is called not matched phonetic context. In this dissertation, the phonetic contexts that are fully matched after the “filtering” process were used as an input for Simulated Annealing. The search region for Simulated Annealing is significantly reduced after the “filtering” process. Based on the finding of Fek et al.,(2006), since the best candidate unit will be the one that matched fully the phonetic context, therefore, it is appropriate to conduct “filtering” process before performing Simulated Annealing in this dissertation so that the SA algorithm will exclude the unit that partially matched or totally does not matched the phonetic context. For example, to form a Malay word “kampung”, it required 6 different phonemes which are “_k”, “a”, “m”, “p”, “u” and “ng”. The total number of candidates for each phonemes are 248, 3076, 492, 276, 696 and 500 respectively as shown in Figure 4.5. Therefore, before the phonetic context “filtering” process, the total numbers of possible sequences are 248 × 3076 × 492 × 276 × 696 × 500 = 2.339 × 1010 . Figure 4.6 shows the feasible search region after filter using partially matched phonetic context. 84 After the phonetic context “filtering” process using fully match phonetic context, the total numbers of possible sequences are 40 × 2 × 22 ×17 × 3 × 36 = 3231360 ≈ 3.23 ×106 . The percentage of reduction in term of total number of candidates for phoneme “_k”, “a’, “m”, “p”, “u”, and “ng” are 83.87%, 99.93%, 95.53%, 93.84%, 99.57% and 92.80% respectively. The percentage of reduction in term of total numbers of possible sequences after phonetic context “filtering” process is almost 100% . Figure 4.5 The feasible search region to form a Malay word “kampung” before filter using phonetic context. 85 Figure 4.6 The feasible search region to form a Malay word “kampung” after filter using partially matched phonetic context (left phonetic context). Figure 4.7 The feasible search region to form a Malay word “kampung” after filter using fully matched phonetic context (left and right phonetic context). 86 After the “filtering” process, the total numbers of candidates for each individual phoneme is significantly reduced. These new total numbers of candidates for each individual phoneme will becomes the feasible search region for Simulated Annealing (Figure 4.7). Before performing the SA algorithm, the total numbers of candidates for each individual phoneme need to be determined. This is to ensure the algorithm does not step in the region of infeasible. Table 4.2 The information of the 10 words before filter using phonetic context. Words Phoneme involved Total number of candidates for each phoneme according to target sequences Possible number of sequences Category 1 (4 ≤ x ≤ 6) nasi x = number of phonemes _n, a, s, i 33, 3076, 410, 970 4.037 × 1010 musim _m, u, s, i, m 447, 696, 410, 970, 492 6.087 × 1013 janji _j, a, n, j, i 49, 3076, 1293, 164, 970 3.10 × 1013 kampung _k, a, m, p, u, ng 248, 3076, 492, 276, 696, 500 3.605 × 1016 Category 2 (7 ≤ x ≤ 9) vitamin x = number of phonemes _v, i, t, a, m, n 5, 970, 652, 3076, 492, 970, 1293 6.002 × 1018 demikian _d, e, m, i, k, ia, n 269, 1448, 492, 970, 665, 87, 1293 1.391× 1019 muktamad _m, u, k, t, a, m, d 447, 696, 665, 652, 3076, 492, 3076, 313 1.9655 × 1023 informasi _i, n, f, o, r, m, a, s, i 74, 1293, 38, 206, 838, 492, 3076, 410, 970 3.778 × 10 23 Category 3 ( x ≥ 10 ) selanjutnya x = number of phonemes 1.591× 10 28 berpengetahuan _b, e, r, p, ng, t, a, h, ua, n 258, 1448, 514, 3076, 1293, 164, 696, 652, 91, 3076 256, 1448, 838, 276, 1448, 500, 1448, 652, 3076, 374, 107, 1293 _s, e, l, a, n, j, u, t, ny 9.327 × 1033 87 Table 4.3 The information of the 10 words after filter using partially matched phonetic context (left phonetic context). Words Phoneme involved Total number of candidates for each phoneme according to target sequences Category 1 x = number of phonemes (4 ≤ x ≤ 6) nasi _n, a, s, i 11, 11, 129, 74 1155066 musim _m, u, s, i, m 16, 16, 59, 74, 41 45825536 janji _j, a, n, j, i 20, 20, 755, 50, 13 196300000 kampung _k, a, m, p, u, ng 40, 40, 137, 76, 39, 36 2.339 × 1010 Category 2 (7 ≤ x ≤ 9) vitamin x = number of phonemes _v, i, t, a, m, n 3, 3, 65, 199, 137, 38, 74 4.4848 × 1010 demikian _d, e, m, i, k, ia, n 24, 24, 190, 38, 87, 3, 29 3.1477 × 1010 muktamad _m, u, k, t, a, m, d 16, 16, 87, 20, 199, 137, 130, 89 1.4051× 1014 informasi _i, n, f, o, r, m, a, s, i 19, 19, 3, 4, 32, 19, 130, 129, 74 3.2686 × 1012 Category 3 ( x ≥ 10 ) selanjutnya x = number of phonemes _s, e, l, a, n, j, u, t, ny 138, 138, 111, 227, 755, 50, 29, 61, 1, 64 2.0508 × 1018 berpengetahuan _b, e, r, p, ng, t, a, h, ua, n 141, 141, 410, 12, 30, 138, 24, 36, 199, 200, 1, 28 3.899 × 1020 Possible number of sequences 88 Table 4.4 The information of the 10 words after filter using fully matched phonetic context (left and right phonetic context). Words Phoneme involved Total number of candidates Possible for each phoneme number of according to target sequences sequences Category 1 x = number of phonemes (4 ≤ x ≤ 6) nasi _n, a, s, i 11, 3, 25, 74 61050 musim _m, u, s, i, m 16, 6, 7, 2, 41 55104 janji _j, a, n, j, i 20, 2, 12, 2, 13 12480 kampung _k, a, m, p, u, ng 40, 2, 22, 17, 3, 36 3231360 Category 2 (7 ≤ x ≤ 9) vitamin x = number of phonemes _v, i, t, a, m, n 3, 1, 26, 10, 10, 11, 74 6349200 demikian _d, e, m, i, k, a, n 24, 2, 14, 5, 2, 3, 29 584640 muktamad _m, u, k, t, a, m, d 16, 1, 7, 3, 10, 40, 4, 89 47846400 informasi _i, n, f, o, r, m, a, s, i 19, 2, 1, 2, 5, 6, 11, 25, 74 46398000 Category 3 ( x ≥ 10 ) selanjutnya x = number of phonemes _s, e, l, a, n, j, u, t, ny 138, 13, 58, 44, 12, 10, 5, 1, 1, 64 1.7581× 1011 berpengetahuan _b, e, r, p, ng, t, a, h, ua, n 141, 109, 11, 3, 3, 24, 3, 16, 20, 1, 1, 28 9.8157 × 1011 4.7.2 Concatenation Cost The performance of each move used in this dissertation was tested on the 10 different words. The same conditions are set for all the moves. These conditions are length of Markov chain is 1 and temperature reduction rate α = 0.90 , while initial solution is fixed. These conditions also included the common stopping criteria, that is, final temperature T f ≤ 0.1 , maximum number of iterations is 100 and number of non-improving move is 50. When the move with best performance is found, the performance of that move is investigated further by varying the cooling schedule and length of the Markov Chain. The words were run for 10 times of the same conditions. 89 The mean, variance and standard deviation for the concatenation cost were calculated. The performance of the moves was evaluated based on sum of the mean values for the 10 words. 4.7.2.1 Concatenation cost for Move 1 Table 4.5 Information of concatenation cost (Move 1) with temperature reduction rate, α = 0.90 Words mean variance 4.7705 Standard deviation 2.1841 Initial solution 51.9894 Best solution 37.1871 Worst solution 44.2674 nasi 40.1364 musim 57.1876 1.2936 1.1374 66.3190 55.6639 58.9858 janji 57.4715 1.5463 1.2435 65.8612 55.7146 59.4232 kampung 45.5589 6.0959 2.4690 57.0421 42.3074 51.4532 vitamin 57.5345 9.2171 3.0360 71.3736 54.3926 64.0537 demikian 61.0513 3.4004 1.8440 66.3976 58.841 64.4931 muktamad 56.5940 4.0877 2.0218 69.2485 52.7411 59.9678 informasi 92.8231 7.7344 2.7811 102.8064 90.1583 98.8206 selanjutnya 88.0075 25.3428 5.0342 96.2091 96.0506 berpengetahuan 95.0096 29.3829 5.4206 109.5395 85.2593 Total 651.3744 78.6626 103.3070 90 4.7.2.2 Concatenation cost for Move 2 Table 4.6 Information of concatenation cost (Move 2) with temperature reduction rate, α = 0.90 Words mean variance 1.7800 Standard deviation 1.3342 Initial solution 51.9894 Best solution 38.0371 Worst solution 42.3188 nasi 38.7232 musim 55.6213 0.5375 0.7332 66.3190 54.8516 56.4404 janji 65.8612 0 0 65.8612 65.8612 65.8612 kampung 51.9043 0 0 57.0421 51.9043 51.9043 vitamin 68.1684 0 0 71.3736 68.1684 68.1684 demikian 62.3619 0 0 66.3976 62.3619 62.3619 muktamad 66.9015 0 0 69.2485 66.9015 66.9015 informasi 93.7267 4.7221 2.1731 102.8064 91.0594 96.6425 selanjutnya 90.7814 0 0 96.2091 90.7814 3.5564 1.8859 109.5395 94.9970 berpengetahuan 98.1096 Total 90.7814 100.4276 692.1595 4.7.2.3 Concatenation cost for Move 3 Table 4.7 Information of concatenation cost (Move 3) with temperature reduction rate, α = 0.90 Words mean variance Standard Initial Best Worst deviation solution solution solution nasi 51.6553 0 0 51.9894 51.6553 51.6553 musim 57.3512 0.3507 0.5922 66.3190 56.5752 57.8808 janji 58.8172 0 0 65.8612 58.8172 58.8172 kampung 51.3213 0 0 57.0421 51.3213 51.3213 vitamin 65.5459 1.7338 1.3168 71.3736 62.9330 67.1848 demikian 61.9506 2.9976 1.7314 66.3976 59.9471 64.7144 muktamad 59.6220 1.2380 1.1127 69.2485 58.2360 60.9776 informasi 102.7709 0 0 102.8064 102.7709 102.7709 selanjutnya 91.7929 0.8504 0.9222 96.2091 berpengetahuan 95.3239 6.0698 2.4637 109.5395 94.0278 Total 696.1512 91.1065 93.4464 101.2508 91 4.7.2.4 Concatenation cost for Move 4 Table 4.8 Information of concatenation cost (Move 4) with temperature reduction rate, α = 0.90 Words mean variance 4.0708 Standard deviation 2.0176 Initial solution 51.9894 Best solution 39.7428 Worst solution 44.8005 nasi 41.9025 musim 58.1836 2.3585 1.5357 66.3190 56.0629 59.9161 janji 58.8172 0 0 65.8612 58.8172 58.8172 kampung 50.6723 3.1929 1.7869 57.0421 46.5059 51.9677 vitamin 62.0547 5.2145 2.2835 71.3736 59.5890 64.6396 demikian 65.1283 2.6202 1.6187 66.3976 62.3619 66.3976 muktamad 66.9015 0 0 69.2485 66.9015 66.9015 informasi 94.5730 4.5584 2.1350 102.8064 91.1099 97.3322 selanjutnya 91.3895 0.2920 0.5404 96.2091 92.2531 10.1918 3.1925 109.5395 94.9008 berpengetahuan 98.8900 Total 90.7814 104.8383 688.5126 From Table 4.5 to Table 4.8, the move that yields the smallest sum of the mean values for the 10 words is move 1. In other words, the performance of move 1 is the best compared to other. The advantage of move 1 is its flexibility in swapping the phonemes. The move 1 does not take into consideration on values of the local cost in determining which phoneme that needs to be swap. In other words, there exist various neighborhood sizes for move 1 in every iteration. This advantage is significant when there exists one or more local costs with high magnitudes compared to other. This is due to all phonemes have the equal chance to be swap regardless of their magnitude of the local cost. Due to this advantage, the performance of move 1 is superb compared to other for the 10 words tested. For the 10 words tested, all the words besides “berpengetahuan” exists one or more local cost with relatively high magnitudes of local costs compared to other. Therefore, move 2, move 3 and move 4 are under performed in unit selection. The disadvantage of move 1 is that the best local cost may not be maintained in the next iteration due to its randomness. 92 Since move 1 has the best performance, various annealing schedule and length of the Markov chain were tested on move 1 to investigate the performance of the move under different conditions. Case 1: Markov chain length =1. Stopping criteria: - final temperature , T f ≤ 0.1 - maximum number of iterations = 200 - number of non-improving move = 0.5(200) =100 Table 4.9 Information of concatenation cost with temperature reduction rate, α = 0.95 Words mean variance 1.5701 Standard deviation 1.2531 Initial solution 51.9894 Best solution 38.1209 Worst solution 40.5318 nasi 39.777 musim 56.2531 0.3269 0.5718 66.3190 55.4287 57.1868 janji 56.8056 1.7449 1.3209 65.8612 55.3994 59.2640 kampung 43.9745 6.3534 2.5206 57.0421 41.1393 48.4903 vitamin 56.2057 7.3691 2.7146 71.3736 51.1585 60.4980 demikian 58.8773 1.2913 1.1363 66.3976 56.7844 60.6892 muktamad 56.0925 3.0699 1.7521 69.2485 53.7636 58.2923 informasi 92.3077 4.0464 2.0116 102.8064 89.8871 96.8190 selanjutnya 87.7347 9.6309 3.1034 96.2091 80.8717 90.9065 berpengetahuan 95.1154 6.2367 2.4973 109.5395 91.4759 98.8498 Total 643.1435 93 Case 2: Markov chain length =1. Stopping criteria: - final temperature , T f ≤ 0.1 - maximum number of iterations = 60 - number of non-improving move = 0.5(60) =30 Table 4.10 Information of concatenation cost with temperature reduction rate, α = 0.85 Words mean variance 2.1525 Standard deviation 1.4671 Initial solution 51.9894 Best solution 39.2396 Worst solution 43.5353 nasi 41.1636 musim 57.4694 0.8764 0.9361 66.3190 56.4528 59.2198 janji 58.1928 2.5889 1.6090 65.8612 55.3994 59.8581 kampung 46.8516 5.5348 2.3526 57.0421 43.5470 50.3501 vitamin 59.7214 13.1650 3.6284 71.3736 56.5820 66.2310 demikian 60.5946 5.6822 2.3837 66.3976 57.5656 65.8790 muktamad 55.3585 13.4849 3.6722 69.2485 53.4601 60.0769 informasi 95.8624 10.2939 3.2084 102.8064 90.5174 100.3186 selanjutnya 89.3682 9.5203 3.0855 96.2091 91.7274 18.4947 4.3005 109.5395 87.0187 berpengetahuan 97.6107 Total 662.1932 87.2320 102.6322 94 Case 3: Markov chain length =1. Stopping criteria: - final temperature , T f ≤ 0.1 - maximum number of iterations = 50 - number of non-improving move = 0.5(50) =25 Table 4.11 Information of concatenation cost with temperature reduction rate, α = 0.80 Words mean variance 4.1179 Standard deviation 2.0293 Initial solution 51.9894 Best solution 38.1209 Worst solution 44.0197 nasi 37.0614 musim 57.5359 0.8788 0.9374 66.3190 56.3752 59.0334 janji 58.1860 4.5681 2.1373 65.8612 56.0573 61.5893 kampung 47.8566 10.5943 3.2549 57.0421 43.8004 52.7271 vitamin 60.5878 17.1983 4.1471 71.3736 55.4193 67.6045 demikian 61.6131 4.8180 2.1950 66.3976 57.6987 64.9662 muktamad 59.4896 23.0488 4.8009 69.2485 52.7343 65.3014 informasi 94.5739 15.6052 3.9503 102.8064 88.3331 99.2731 selanjutnya 89.8916 13.6941 3.7006 96.2091 96.2091 24.3634 4.9359 109.5395 94.1026 berpengetahuan 100.4161 Total 88.2295 106.1951 667.2120 From Table 4.5, Table 4.9, Table 4.10 and Table 4.11, move 1 (Table 4.9) is best performed under the conditions Markov chain length =1 and temperature reduction rate, α = 0.95. When the cooling is slow enough, it is capable to reach thermal equilibrium for each temperature value, avoiding to be trapped in local minimum. The solution quality obtained for slow temperature reduction rate, α = 0.95 , is better than the faster temperature reduction rates, α = 0.80, 0.85 and 0.90. When the cooling is too fast, it is not capable to reach thermal equilibrium for each temperature value, resulting in trapped in local minimum. Although the solution quality obtained for slow temperature reduction rate is better, but at the same time it result in slow convergence rate. 95 Case 4: Markov chain length =2. Stopping criteria: - final temperature , T f ≤ 0.1 - maximum number of iterations = 400 - number of non-improving move = 0.5(400) =200 Table 4.12 Information of concatenation cost with temperature reduction rate, α = 0.95 Words mean variance 0.7515 Standard deviation 0.8669 Initial solution 51.9894 Best solution 37.5301 Worst solution 39.8161 nasi 38.8704 musim 55.8377 0.1841 0.4291 66.3190 55.3601 56.3920 janji 55.8641 3.7639 1.9401 65.8612 53.6312 58.9424 kampung 43.3521 8.0114 2.8304 57.0421 38.0765 46.3593 vitamin 56.0249 1.7922 1.3387 71.3736 53.0617 58.3926 demikian 58.3768 1.9648 1.4017 66.3976 55.9996 60.1783 muktamad 53.4289 3.2450 1.8014 69.2485 50.1260 56.1836 informasi 90.1820 2.2295 1.4932 102.8064 88.5677 91.6438 selanjutnya 88.6255 8.6738 2.9451 96.2091 84.8192 93.8592 berpengetahuan 91.6801 5.8925 2.4274 109.5395 86.6432 94.6761 Total 632.2425 96 Case 5: Markov chain length =2. Stopping criteria: - final temperature , T f ≤ 0.1 - maximum number of iterations = 200 - number of non-improving move = 0.5(200) =100 Table 4.13 Information of concatenation cost with temperature reduction rate, α = 0.90 Words mean variance 2.6193 Standard deviation 1.6184 Initial solution 51.9894 Best solution 37.765 Worst solution 42.6251 nasi 39.5469 musim 56.0799 0.4742 0.6886 66.3190 55.4613 57.3321 janji 56.3270 3.1464 1.7738 65.8612 53.6312 58.4735 kampung 42.6240 5.1535 2.2701 57.0421 39.1755 47.157 vitamin 56.8142 1.2225 1.1057 71.3736 55.1818 58.8009 demikian 58.6278 1.6375 1.2800 66.3976 56.7614 60.2722 muktamad 53.2059 4.9892 2.2336 69.2485 50.1260 57.1076 informasi 90.2507 2.9655 1.7221 102.8064 87.8247 92.9634 selanjutnya 89.7432 7.9122 2.8129 96.2091 85.8571 94.0317 berpengetahuan 94.0035 2.4743 1.5730 109.5395 91.6833 95.2747 Total 637.2231 97 Case 6: Markov chain length =2. Stopping criteria: - final temperature , T f ≤ 0.1 - maximum number of iterations = 150 - number of non-improving move = 0.5(150) =75 Table 4.14 Information of concatenation cost with temperature reduction rate, α = 0.85 Words mean variance 2.7807 Standard deviation 1.6675 Initial solution 51.9894 Best solution 37.1871 Worst solution 42.3749 nasi 39.6497 musim 56.7848 0.6914 0.8315 66.3190 55.5627 58.0270 janji 56.6454 4.2224 2.0549 65.8612 53.6312 59.5977 kampung 46.9636 9.0054 3.0009 57.0421 43.4254 51.5412 vitamin 57.7452 9.0776 3.0129 71.3736 52.7207 62.6273 demikian 59.0589 2.1474 1.4654 66.3976 55.7617 60.9374 muktamad 56.8856 3.5898 1.8947 69.2485 54.4198 59.7742 informasi 90.0772 1.6477 1.2836 102.8064 88.4301 92.3604 selanjutnya 89.7157 9.7443 3.1216 96.2091 85.8864 94.0497 berpengetahuan 95.1791 8.5402 2.9224 109.5395 91.0962 99.3100 Total 648.6352 98 Case 7: Markov chain length =2. Stopping criteria: - final temperature , T f ≤ 0.1 - maximum number of iterations = 100 - number of non-improving move = 0.5(100) =50 Table 4.15 Information of concatenation cost with temperature reduction rate, α = 0.80 Words mean variance 0.9839 Standard deviation 0.9919 Initial solution 51.9894 Best solution 38.5778 Worst solution 41.6367 nasi 39.3888 musim 56.9937 0.6505 0.8065 66.3190 55.7416 58.2279 janji 56.5346 1.9137 1.3834 65.8612 53.6312 58.6272 kampung 45.5048 11.1886 3.3449 57.0421 40.5324 51.5542 vitamin 58.0246 10.0880 3.1762 71.3736 53.4715 62.9814 demikian 61.0635 3.9300 1.9824 66.3976 59.1747 64.9493 muktamad 56.9372 6.8791 2.6228 69.2485 53.8201 62.4109 informasi 93.8204 8.5837 2.9298 102.8064 87.9695 97.4976 selanjutnya 92.7170 11.3913 3.3751 96.2091 96.2091 berpengetahuan 99.4340 21.0200 4.5847 109.5395 92.7515 Total 86.2980 105.3478 660.4186 According to Triki et al., (2005), the probability distribution is closer to the quasi-equilibrium for longer length of Markov chain. From Table 4.12, Table 4.13, Table 4.14 and Table 4.15, the length of Markov chain is 2. The solution quality obtained for Markov chain=2 is better for Markov chain=1 for the entire four temperature reduction rate. Therefore, move 1 has best performance under longer Markov chain length which is 2 and slow temperature reduction rate, α = 0.95 as shown in Table 4.12. The best solution, mean and worst solution from Table 4.12 is plotted (Figure 4.8). 99 Concatenation cost 100 90 Best solution mean 80 Worst solution 70 60 50 40 30 1 2 3 4 5 6 7 8 9 10 Problem Figure 4.8 SA best solutions, mean and worst solutions for ten problems from Table 4.12. . 100 4.8 Concatenation The concatenation of waveform to form the target word utterance is based on the result in Table 4.12 since it results in smallest sum of the mean values for the 10 words. The sequences of the 10 words in Table 4.12 are presented in Table 4.16. Table 4.16 The sequences of the 10 selected words. Words Sequences nasi _n[1] a[1084] s[246] i[805] musim _m[26] u[528] s[31] i[929] m[478] janji _j[6] a[2943] n[1053] j[131] i[644] kampung _k[5] a[549] m[407] p[168] u[664] ng[26] vitamin _v[1] i[784] t[89] a[2675] m[235] i[691] n[243] demikian _d[13] e[425] m[336] i[831] k[491] ia[46] n[956] muktamad _m[26] u[292] k[486] t[397] a[83] m[395] a[968] d[243] informasi _i[1] n[948] f[22] o[129] r[655] m[440] a[2826] s[305] i[442] selanjutnya _s[1] e[537] l[362] a[2710] n[1031] j[7] u[206] t[142] ny[1] a[2060] _b[1] e[1426] r[30] p[172] e[1013] ng[347] e[1028] t[595] a[661] h[221] ua[65] n[245] berpengetahuan 101 Figure 4.9 Waveform “_s1” Figure 4.10 Waveform “e537” Figure 4.11 Waveform “l362” Figure 4.12 Waveform “a2710” 102 Figure 4.13 Waveform “n1031” Figure 4.14 Waveform “j7” Figure 4.15 Waveform “u206” Figure 4.16 Waveform “t142” 103 Figure 4.17 Waveform “ny1” Figure 4.18 Waveform “a2060” Figure 4.19 Concatenation waveform for the word “selanjutnya”. 104 Figure 4.20 Spectrogram for the word “nasi”. Figure 4.21 Spectrogram for the word “berpengetahuan”. 105 Figure 4.22 Spectrogram for the word “demikian”. Figure 4.23 Spectrogram for the word “demikian” that do not consider concatenation cost. Figure 4.9 to Figure 4,18 is the waveform for phonemes “_s”, “e”, “l”, “a”, “n”, “j”, “u”, “t”, “ny” and “a” respectively. Figure 4.19 is the waveform for the word “selanjutnya” after concatenates the waveform from Figure 4.9 to Figure 4.18. Figure 4.20, Figure 4.21 and Figure 4.22 is the spectrogram obtained for the words “nasi”, “berpengetahuan” and “demikian”. These Figures are obtained according to the speech unit sequences as in Table 4.12. Figure 4.23 is the spectrogram for the word “demikian” that do not consider concatenation cost. In fact, it only considers the matching of the left and right phonetic context. The circled red part in Figure 106 4.22 and Figure 4.23 are zoomed in and presented in Figure 4.24 and Figure 4.25 respectively. Figure 4.24 Spectrogram zoom in for the word “demikian” from Figure 4.22. Figure 4.25 Spectrogram zoom in for the word “demikian” from Figure 4.23. The spectrogram considers the concatenation cost (Figure 4.24) produce better joined condition. The red and green lines in Figure 4.24 are straight line which shows the smoothness of join between two segments during synthesis. The spectrogram that do not considers the concatenation cost (Figure 4.25) is a curved line which shows the discontinuities or spectral mismatch between concatenated units. CHAPTER 5 TESTING, ANALYSIS AND RESULT 5.1 Experiment A formal listening test was conducted to evaluate the output sound. The following sections will describe test materials, test conditions, test procedures, listeners, and the statistical analysis of result. 5.2 Test Materials Ten words are selected in the listening test. These ten words are the words formed by unit selection system after concatenation. These selected words ranged from 4 to 12 phonemes. 5.3 Test Conditions The ten words selected from unit selection system were using the following synthesis conditions: ‐ Units: Phoneme ‐ Feature Extraction: Mel Frequency Cepstral Coefficient ‐ Spectral Distance: Euclidean distance 108 5.4 Test Procedure The test was carried out in two parts. The first part consists of ten words selected from the unit selection system. Listeners could take as long as they pleased over each word and take a short break between each word. The first part of the listening test is to test for the intelligibility of the synthesis words. Listeners are requested to play the sound file more than one time for each word. Listeners are required to write down the answer of the output sound. The second part of listening test is mean opinion score (MOS), listeners are requested to play and listen to the output sound and tick the words they think is better in term of naturalness. In this part, there are a total of two sound files for each word, divided into “a” and “b” respectively. Sound files “a” are the ten words selected from the unit selection system. The sound files “b” is the ten similar words selected from the unit selection system but do not considering concatenation cost. The purpose of this part is to test the naturalness of the synthesized words. 109 5.5 Profiles of Listeners There are a total 15 listeners participating in the listening test. They are from different background, gender, races and state of origin. Table 5.1 shows the profiles of the listeners. Table 5.1 Profiles of Listeners Number of Participants Percentage Gender Male 8 53.33% Female 7 46.67% Race Malay 6 40% Chinese 8 53.33% Others 1 6.67% State of Participants Johor 5 33.33% Pulau Pinang 2 13.33% Perak 2 13.33% Kedah 1 6.67% Kuala Lumpur 1 6.67% Selangor 1 6.67% Sarawak 2 13.33% Pahang 1 6.67% Malay First Spoken Language Yes 6 40% No 9 60% 110 5.5.1 Percentages of Listeners by Gender There are a total of fifteen listeners participated (six Malay, eight Chinese and one other) in the listening test. The ages of the listeners were range between 22 and 34, with a mean age of 25 years old. All the participants of the listening test were native speakers of Malay Language with no hearing loss. Figure 5.1 shows the percentage of listeners by gender. Figure 5.1 Percentage of listeners by gender. 111 5.5.2 Percentage of Listeners by Race There were three races of listeners participated in the listening test. The three races are Malay (40%), Chinese (53.33%) and others (6.67%). Figure 5.2 shows the percentage of listeners by race. Figure 5.2 Percentage of listeners by race. 112 5.5.3 Percentage of Listeners by State of Origin The listeners that come from eight different states were selected in the listening test. These states are Johor (33.33%), Pulau Pinang (13.33%), Perak (13.33%), Kedah (6.67%), Kuala Lumpur (6.67%), Selangor (6.67%), Sarawak (13.33%) and Pahang (6.67%). Figure 5.3 shows the percentage of listeners by state of origin. Figure 5.3 Percentage of listeners by state of origin. 113 5.6 Result and Analysis 5.6.1 Word Level Testing This section is word level testing which test for the intelligibility of the synthesis words. The listeners are required to write down the answer from what they heard for all the 10 words as in Table 5.2. Table 5.2 Words selected for listening test. Number Words No. of phoneme 1 nasi 4 2 musim 4 3 janji 5 4 kampung 6 5 vitamin 7 6 demikian 7 7 muktamad 8 8 informasi 9 9 selanjutnya 10 10 berpengetahuan 12 All the listeners wrote the correct answers for all the selected words except word 2 which is “musim”. Six participants wrote the wrong answer for word 2. The six participants are confused about the pronunciation of the first phoneme, “_m”. Five participants wrote it as “busim” and one participant wrote it as “pusim”. Therefore, the intelligibility rate of the 10 selected synthesis words is 96%. Figure 5.4 shows the level of intelligibility of the 10 selected words. 114 Figure 5.4 Level of intelligibility of the 10 selected words. 5.6.2 Mean Opinion Score This section of the listening test is mean opinion score. The objective of this section is test for the naturalness of the synthesis words between synthesis words with consider the concatenation cost and without consider the concatenation cost. Figure 5.5 shows the results of the mean opinion score. From Figure 5.5, the number of listeners who rate that the synthesis words with considering concatenation cost sound better is 13 out of 15. There is 1 listener who rate that the quality of this 2 categories of synthesis words is the same while another 1 participant states that the synthesis words that do not consider the concatenation cost sound better. Therefore, the naturalness rate of the 10 selected synthesis words is 92.86%. Table 5.3 shows the score line of synthesis words with considers the concatenation cost. 115 Figure 5.5 Results of the mean opinion score. Table 5.3 The score line of synthesis words with considers the concatenation cost. Score Frequency 9/10 1 8/10 2 7/10 6 6/10 4 5/10 1 3/10 1 116 Table 5.4 The score line of the 10 synthesis words with considers the concatenation cost. Words Score nasi 7/15 musim 10/15 janji 15/15 kampung 12/15 vitamin 8/15 demikian 15/15 muktamad 4/15 informasi 3/15 selanjutnya 12/15 berpengetahuan 13/15 From Table 5.4, there are two words that have perfect score which are “janji” and “demikian”. There are 5 words that score more than or equal to 8. These words are “musim”, “kampung”, “selanjutnya”, “vitamin” and “berpengetahuan”. However, there are three words that do not score well. These 3 words are “nasi”, “muktamad” and “informasi”. The reason for the words that do not score well is that there exists spectral mismatch in the word especially the first and last phoneme. There does not exist left phonetic context for first phoneme and right phonetic context for last phoneme. Therefore, the first and last phoneme only matched partially the phonetic context. Therefore the quality of the synthesized words is degraded due to spectral mismatch in the first and last phoneme. CHAPTER 6 CONCLUSION AND RECOMMENDATION 6.1 Conclusion This dissertation has reviewed related methods and algorithms for unit selection and a first version of unit selection using Simulated Annealing for Corpusbased Malay Text-to-Speech system has been implemented. The main purpose of this dissertation is to select the speech unit sequences which result in lowest concatenation cost. This system has achieved its aim of improving the speech quality derived from first version of Corpus-based Malay Text-to-Speech system. The system level of concatenation is phoneme based using variable length unit selection. The speech units have been selected using Corpus-based unit selection and carries 381 sentences and 16826 phonemes with total size of 37.6Mb. The storage format for Corpus-based Malay Text-to-Speech is wave and the sampling frequency is 16kHz. In order to produce high quality output speech, the lowest overall cost is a must due to minimization of contextual differences and spectral discontinuities. The unit selection is based on two cost functions which are target cost and concatenation cost. Since there will be relative importance for target cost and concatenation cost in the whole cost function, thus, tuning the weights is an important stage in the design of the selection algorithm to reflect their relative importance (Díaz and Banga, 2006). However, results show that there does not exist a set of weights with consistent performance across all (or almost all) of the sets (Díaz and Banga, 2006). 118 The target cost can be further divided into two types which are phonetic target costs and prosodic target costs. The phonetic target cost (Zhao et al., 2006) contains sub-costs for the Left Phone Context and the Right Phone Context. In the proposed method, only the phonetic target cost is employed. The contextual linguistic (target cost) is used as a filtering tool. Only the speech units that match left and right phonetic context have the possibility to be chosen. For first phoneme, only the matched of right phonetic context and for last phoneme, only the matched of left phonetic context have the possibility to be chosen. The retained candidate units are used as an input for Simulated Annealing. Since the computational of target cost is not included in the cost function for unit selection, therefore the cost function for unit selection left only with concatenation cost. Therefore, the advantage for the proposed method is weight tuning is not required for the cost function in unit selection since there is only one component in the cost function which is concatenation cost. The retained candidate units are undergoing for concatenation cost minimization. Features that are included in the concatenation cost calculation were MFCC type coefficients that parameterize borders of the speech units in the corpus. The concatenation cost is the distortion between these parameters of two adjacent candidate units (Zhao et al., 2006). Two stages involve in the computation of a distance measure which are feature extraction and quantifying it. The speech unit sequence that yield smaller concatenation cost yield better join condition at the concatenation point and thus produce better speech quality. To calculate the concatenation cost, feature extraction (MFCC) is needed for all the speech units and transform in 12 dimensional MFCCs. Concatenation cost distance measure used is Euclidean distance. Higher concatenation costs are predicted to exist audible discontinuities and thus less likely to be selected. The searching for the minimum cost sequences is solved using SA. There are four different types of moves are used to obtain the neighbourhood solution. The moves can be further divides into two categories which are the move that swap the phoneme without based on the magnitude of local cost and the move that swap the phoneme based on the magnitude of local cost. In this research, the former move has the better performance than the latter moves since the latter has the weakness of 119 trapped in local minimum. For annealing schedule, there are four different temperature reduction rates used in this research. The slower temperature rate has better performance than the faster temperature reduction rate. For the length of Markov chain, there are two different lengths of Markov chain used in this research which are reduced the temperature in every iteration and reduced the temperature after two iterations. The latter approach that has longer Markov chain length has better performance than the shorter. Therefore, the SA implemented in this research has high robustness since it performance is sensitive to small changes of parameter setting. The contribution of concatenation cost was evaluated by conducting a listening test. There are ten different Malay words have been selected for the listening test. These ten Malay words can be further divides into three categories based on the number of phoneme in the word. The listening test is conducted to validate the concatenation cost ability to predict spectral discontinuity. The listening tests have justified the contribution of concatenation cost in unit selection. Therefore, Simulated Annealing is a suitable method for unit selection since it has made a contribution in improving the speech quality by selecting the best speech unit sequence within reasonable computational time and effort. 120 6.2 Suggestion for Future Work In this dissertation, speech feature selected is MFCCs and spectral distance used is Euclidean distance. The suggestion for future work is as follows to investigate which combination will better predict the spectral discontinuities. 1. Implementation of unit selection using Euclidean distance with Linear Predictive Coefficient (LPC). 2. Implementation of unit selection using Kullback-Leibler distance with MFCCs. 3. Implementation of unit selection using Mahalanobis distance with MFCCs. 4. Comparison of the performance of three combinations above. For heuristic method, the move to use Simulated Annealing needs to be enhanced. The decision of swapping which phonemes should be based on total number of candidate units for that particular phoneme, not the magnitude of the local cost. In other words, the phoneme with a large number of candidate units should have higher frequency of swapping. The different annealing schedule and Markov chain length can also be conducted for future work. Other heuristic method such as Genetic Algorithm and Tabu search in unit selection can also be conducted for future work. Hybridization of these heuristic methods and comparison of the performance of these heuristic methods can be performed in the future. The two meta-heuristic methods which are SA and GA can be hybridized to yield a more effective algorithm. One possible approach is to replace the crossover and mutation processes of GA by SA operators (Hwang and He, 2006). This approach maintains the advantages and avoids the disadvantages of both search algorithms. This hybrid algorithm has better fine-tuning ability to search for the global optimum solutions and more strong hill-climbing ability for escaping from local minima than the standard GA thank to the special characteristic of SA (Hwang and He, 2006). 121 REFERENCES Ali, M. M., Törn, A. and Viitanen, S. (2002). A direct search variant of the simulated annealing algorithm for optimization involving continuous variables. Computers & Operations Research. 29(1), 87-102. Atkinson, A. C. (1992). A segmented algorithm for simulated annealing. Statistics and Computing, (2), 221–230. Bellegarda, J. R. (2008). Unit-Centric Feature Mapping for Inventory Pruning in Unit Selection Text-to-Speech Synthesis. Audio, Speech, and Language Processing, IEEE Transactions. Jan. 2008. 74-82. Blouin, C., Rosec, O., Bagshaw, P.C. and Alessandro, C. (2002). Concatenation cost calculation and optimisation for unit selection in TTS. Proceedings of 2002 IEEE Workshop on Speech Synthesis.. 11-13 September. Santa Monica, USA, 231-234. Campbell, W.N. and Black, A. W. (1997). Prosody and the selection of source units for concatenative synthesis. In: Van Santen, J.P.H., Sproat, R.W., Olive, J.P., Hirschberg, J. (Eds.), Progress in Speech Synthesis. Springer. New York, 279– 292. Cepko, J., Talafova, R. and Vrabec, J. (2008). Indexing join costs for faster unit selection synthesis. Systems, Signals and Image Processing, 2008. IWSSIP 2008. 15th International Conference.25-28 June. Bratislava, Slovakia, 503-506. Chappell, D. T. and Hansen, J. H. L. (2002). A comparison of spectral smoothing methods for segment concatenation based speech synthesis. Speech Communication, 36(3-4). 343-373. Cheh, K.M., J.B. Goidberg and R.G. Askin (1991). A note on the effect of neighbourhood structure in simulated annealing. Computers and Operations Research, 18. 537-547. 122 Chen, T. Y. and Su, J. J. (2002). Efficiency improvement of simulated annealing in optimal structural designs. Advances in Engineering Software. 33(7-10), 675680. Chou Fu-chiang (1999). Corpus-based Technologies for Chinese Text-To-Speech Synthesis. Ph.D Dissertation, Department of Electrical Engineering National Taiwan University, ROC. Chou, F. C. and Tseng, C. Y. (1998). Corpus-based Mandarin speech synthesis with contextual syllabic units based on phonetic properties. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing. 1215 May. Seattle, WA, 893 – 896. Daniel Rex Greening (1995). Simulated Annealing with Errors. Ph.D. Thesis. University of California. Los Angeles. Díaz, F. C. and Banga, E. R. (2006). A method for combining intonation modelling and speech unit selection in corpus-based speech synthesis systems. Speech Communication. 48(8), 941-956. Ding, W. K. F. and Campbell, N. (1998). Improving speech synthesis of CHATR using a perceptual discontinuity function and constraints of prosodic modification. Proc. 3rd ESCA/COCOSDA International Workshop on Speech Synthesis. Nov. 1998. Jenolan Caves, Australia, 191-194. Dong, M. H. and Li, H. Z. (2008). Predicting Spectral and Prosodic Parameters for Unit Selection in Speech Synthesis. Chinese Spoken Language Processing. ISCSLP '08. 6th International Symposium.16-19 December. 1-4. Donovan, R. E. (2001). A new distance measure for costing spectral discontinuities in concatenative speech synthesizers. The 4th ISCA Tutorial and Research Workshop on Speech Synthesis. Pethshire, Scotland, 59–62. Donovan, R. E. (2003). Topics in decision tree based speech synthesis. Computer Speech & Language, 17( 1). January 2003. 43-67. Durand, M. D. and White, S. R. (2000). Trading accuracy for speed in parallel simulated annealing with simultaneous moves. Parallel Computing. 26(1), 135150. Farid, M. O. (1980). Aspects of Malay Phonology and Morphology. Bangi: Universiti Kebangsaan Malaysia. 123 Fek, M. Pesti, P. Nemeth, G. Zainko, C. Olaszy, G. (2006). Lecture Notes in Computer Science. Text, Speech and Dialogue, 9th International Conferences. 11-15 September. Czech Republic, 367-374. Ghazanfari, M., Alizadeh, S., Fathian, M. and Koulouriotis, D. E. (2007). Comparing simulated annealing and genetic algorithm in learning FCM. Applied Mathematics and Computation.192(1). 56-68. Gao, M. and Tian, J. (2007). Path Planning for Mobile Robot Based on Improved Simulated Annealing Artificial Neural Network. Natural Computation, 2007. ICNC 2007. Third International Conference. 24-27 August. Haikou, China, 812. Goldstein, L. and M. Waterman (1988). Neighborhood size in the simulated annealing algorithm. American Journal of Mathematical and Management Sciences, 8, 409-423. Hajek, B. (1988). Cooling schedules for optimal annealing. Mathematics of Operations Research. 13(2), 311–329. Hamza, W., Rashwan, M. and Afify, M. (2001). Quantitative method for modeling context in concatenative synthesis using large speech database. Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference. 7-11 May. Salt Lake City, Utah, USA, 789-792. Hasan, M., AlKhamis. T., and Ali, J. (2000). A comparison between simulated annealing, genetic algorithm and tabu search methods for the unconstrained quadratic Pseudo-Boolean function. Computers & Industrial Engineering. 38(3), 323-340. Hasim, S., Tunga, G., and Yasar, S. (2006). A Corpus-Based Concatenative Speech Synthesis System for Turkish. Turkish Journal Of Electrical Engineering & Computer Sciences, 14(2), 209-223. Hirai, T. and Tenpaku, S. (2004). Using 5 ms segments in concatenative speech. Fifth ISCA ITRW on Speech Synthesis. 16 Jun. Pittsburgh, PA, USA. 37-42. Hirai, T., Tenpaku, S.and Shikano, K. (2002). Speech unit selection based on target values driven by speech data in concatenative speech synthesis. Speech Synthesis, 2002. Proceedings of 2002 IEEE Workshop. 11-13 September. Santa Monica, USA, 43-46. 124 Huang, M. D., Romeo, F. and Sangiovanni-Vincentelli, A.L. (1986). An efficient general cooling schedule for simulated annealing, In: Proceedings of the IEEE International Conference on Computer-Aided Desig. Santa Clara, 381–384. Hunt, A. and Black, A. (1996). Unit selection in a concatenative speech synthesis system using large speech database. Proc. Int. Conf. Acoust., Speech, Signal Process. Atlanta, GA, 373–376. Hwang, S. F. and He, R. S. (2006). Improving real-parameter genetic algorithm with simulated annealing for engineering problems. Advances in Engineering Software. 37(6), 406-418. Jan, V. S., Alexander, K., Esther, K. and Taniya, M. (2005). Synthesis of prosody using multi-level unit sequences. Speech Communication.46(3-4), 365-375. Janicki, A., Meus, P.and Topczewski, M. (2008). Taking advantage of pronunciation variation in unit selection speech synthesis for polish. Communications, Control and Signal Processing, 2008. ISCCSP 2008. 3rd International Symposium. 1214 March. St. Julians, 1133 – 1137. Jeon, Y. J. and Kim, J. C. (2004). Application of simulated annealing and tabu search for loss minimization in distribution systems. International Journal of Electrical Power & Energy Systems. 26(1), 9-18. Jeong, C. S. and Kim, M. H. (1990). Fast parallel simulated annealing for traveling salesman problem. Neural Networks, 1990., 1990 IJCNN International Joint Conference. 17-21 June. Washington, D.C, 947-953. John, R. D., Jr., John, G. P. and John H. L. H. (1993). Discrete-Time Processing of Speech Signals, Macmillan Publishing Company, New York, 1993. Kawai, H. and Tsuzaki, M. (2002). Acoustic measures vs. phonetic features as predictors of audible discontinuity in concatenative speech synthesis. Proc. ICSLP. September 2002. Denver, U.S.A., 2621-2624. Khor Ai Peng (2007). Implementation of Unit Selection by Using Euclidean Distance in Malay Text To Speech. Bachelor Degree Thesis: Universiti Teknologi Malaysia, Skudai. Kirkpatrick, B., and O'Brien, D. S. R. (2006). A Comparison of Spectral Continuity Measures as a Join Cost in Concatenative Speech Synthesis. Irish Signals and Systems Conference, 2006. IET. 28-30 June. Dublin, Ireland, 515–520. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983). Optimization by Simulated Annealing, Science 220, 671-680. 125 Klabbers, E. and Veldhuis, R. (2001). Reducing audible spectral discontinuities. IEEE Transactions on Speech and Audio Processing. 9(1), 39-51. Klabbers, E., Veldhuis, R. and Koppen, K. (2000). A solution to the reduction of concatenation artifacts in speech synthesis. Proc. ICSLP. 16-20 October. Beijing, China. 35-62. Koulamas, C., Antony, S.R., and R. Jaen (1994). A survey of simulated annealing applications to operations research problems. Omega, 22. 41-56. Liu, J. (1999). The impact of neighbourhood size on the process of simulated annealing: Computational experiments on the flowshop scheduling problem. Computers & Industrial Engineering. 37(1-2), 285-288. Luis, M. T. (1997). Speech Coding and Synthesis Using Parametric Curves. University of East Anglia: Master Thesis. Lundy, M. and Mees, A. (1986). Convergence of an annealing Algorithm. Mathematical Programming 34, 111–124. Manuel, D. A. (1997). Constructing efficient simulated annealing algorithms. Discrete Applied Mathematics. 77(2), 139-159. McGookin, E. W. and Murray-Smith, D. J. (2006). Submarine manoeuvring controllers’ optimisation using simulated annealing and genetic algorithms. Control Engineering Practice. 14(1), 1-15. McGookin, E. W., Murray-Smith, D. J., & Li, Y. (1996). Segmented simulated annealing applied to sliding mode controller design. Proceedings of the 13th world congress of IFAC, San Francisco, USA. 333–338. Min C., Hu, P., Hong, Y. and Chang, E. (2001). Selecting non-uniform units from a very large corpus for concatenative speech synthesizer. Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference. 7-11 May. Salt Lake City, UT, USA, 785-788. Möbius, B. (2000). Corpus-Based Speech Synthesis: Methods and Challenges. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung. 6(4), 87-116. Nader, A. and Saeed, Z. (2004). Adaptive temperature control for simulated annealing: a comparative study. Computers & Operations Research. 31(14). 2439-2451. Nagy, A., Pesti, P., Németh, G. and Bőhm, T. (2005). Design Issues of a CorpusBased Speech Synthesizer. Hungarian Journal on Communications. 6, 18-24. 126 Nik, S. K., Farid, M. O. and Hashim, M. (1989). Tatabahasa Dewan: Perkataan. Kuala Lumpur: Dewan Bahasa Dan Pustaka. Nishizawa, N. and Kawai, H. (2006). A Short-Latency Unit Selection Method with Redundant Search for Concatenative Speech Synthesis. Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference .14-19 May. Toulouse, France, I - I Nishizawa, N. and Kawai, H. (2008). Unit database pruning based on the cost degradation criterion for concatenative speech synthesis. Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference. 31 March. Las Vegas, Nevada, U.S.A, 3969-3972. Onn, H. M. (1993). Binaan dan Fungsi Perkataan dalam Bahasa Melayu: Suatu Huraian dari Sudut Tatabahasa Generatif. Kuala Lumpur: Dewan Bahasa Dan Pustaka. Otten, R. H. J. M. and van Ginneken, L.P.P.P. (1984). Floor plan design using simulated annealing. In: Proceedings of the IEEE International Conference on Computer-Aided Design. Santa Clara, 96–98. Pantelides, C. P., and Tzan, S. R. (2000). Modified iterated simulated annealing algorithm for structural synthesis. Advances in Engineering Software. 31(6), 391-400. Piits, L., Mihkla, M., Nurk, T. and Kiisel, I. (2007). Designing a Speech Corpus for Estonian Unit Selection Synthesis. Proceeding of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007. 24-26 May. Tartu. Prudon, R., Alessandro, C. and Mareuil, P.B. (2002). Prosody synthesis by unit selection and transplantation on diphones. Speech Synthesis. Proceedings of 2002 IEEE Workshop. 11-13 September. Santa Monica, USA, 119 – 122. Qing, G., Bin, W. and Katae, N. (2008). Speech Database Compacted for an Embedded Mandarin TTS System. Chinese Spoken Language Processing, 2008. ISCSLP '08. 6th International Symposium. 16-19 December. Kunming, China, 1-4. Rabiner, L.R. and Juang, B.H. (1993). Fundamentals of Speech Recognition. second ed. Prentice-Hall, Englewood Cliffs, NJ. Raminah, S. and Rahim, S. (1987). Kajian Bahasa untuk Pelatih Maktab Perguruan. 8th ed. Petaling Jaya: Penerbit Fajar Bakti Sdn. Bhd. 127 Robert, A. J. C., Korin, R. and Simon, K. (2007). Multisyn: Open-domain unit selection for the Festival speech synthesis system. Speech Communication. 49(4), 317-330. Rose, J., Klebsch, W.and Wolf, J.(1990). Temperature measurement and equilibrium dynamics of simulated annealing placements. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions. March 1990. 253 – 259. Rowden, C. (1992). Speech Processing. UK: McGraw-Hill, Inc. Sagisaka, Y. (1994). Recent advances in Japanese speech synthesis research. International Symposium on Speech, Image Processing and Neural Networks. 13-16 April 1994. Hong Kong, 146 – 150. Sakai, S., Kawahara, T. and Nakamura, S. (2008). Admissible stopping in viterbi beam search for unit selection in concatenative speech synthesis. Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference. 31 March-4 April. Las Vegas, Nevada, U.S.A, 4613-4616. Sarathy, K.P. and Ramakrishnan, A.G. (2008). A research bed for unit selection based text to speech synthesis. Spoken Language Technology Workshop. 15-19 December. Goa, India, 229 – 232. Schwarz, D. (2007). Corpus-Based Concatenative Synthesis. Signal Processing Magazine, IEEE.24(2), 92-104. Stylianou, Y. and Syrdal, A. K. (2001). Perceptual and objective detection of discontinuities in concatenative speech synthesis. Proc. ICASSP. May 2001. Salt Lake City, U.S.A., 837-840. Tan Tian Swee (2003). The Design and Verification of Malay Text to Speech Synthesis System. Master Thesis. Universiti Teknologi Malaysia, Skudai. Tan, T. S. and Sheikh H. (2008a). Corpus Design for Malay Corpus-based Speech Synthesis System. American Journal of Applied Sciences 6(4): 696-702, ISSN 1546-9239. Tan, T. S. and Sheikh, H. (2008b). Corpus-based Malay text-to-speech synthesis system. APCC 2008. 14th Asia-Pacific Conference on Communications, 2008. 14-16 October. 1-5. Tan, T. S. and Sheikh, H. (2008c). Implementation of Phonetic Context Variable Length Unit Selection Module for Malay Text to Speech. Science Publications. Journal of Computer Science 4(7), 550-556, ISSN 1549-3636. 128 Tan, T. S. and Sheikh. H. (2003). Building Malay TTS Using Festival Speech Synthesis System. Conference of The Malaysia Science and Technology. September 2-3. Johor Bahru, Malaysia, MSTC 2002, 120. Tan, T. S., Sheikh, H., and Hussain, A. (2003). Building Malay Diphone Database for Malay Text to Speech Synthesis System Using Festival Speech Synthesis System. Proc of The International Conference on Robotics, Vision, Information and Signal Processing 2003, 22-24 January 2003: ROVISP: 634-648. Tan Tian Swee. (2009). Corpus-based Malay Text-To-Speech Synthesis System. Ph.D. Thesis. Universiti Teknologi Malaysia, Skudai. Taylor, P., Black, A. and Caley, R. (1999). Festival Speech Synthesis System: system documentation (1.4.0). Human Communication Research Centre Technical Report. HCRC/TR, 83-202. Toda, T. (2003). High-Quality and Flexible Speech Synthesis with Segment Selection and Voice Conversion. Doctoral thesis. Nara Institute of Science and Technology. Toda, T., Kawai, H., Tsuzaki, M. and Shikano, K. (2006). An evaluation of cost functions sensitively capturing local degradation of naturalness for segment selection in concatenative speech synthesis. Speech Communication. 48(1), 4556. Triki, E., Collette, Y. and Siarry, P. (2005). A theoretical study on the behavior of simulated annealing leading to a new cooling schedule. European Journal of Operational Research, 166(1), 77-92. Tsiakoulis, P., Chalamandaris, A., Karabetsos, S. and Raptis, S. (2008). A statistical method for database reduction for embedded unit selection speech synthesis. Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference. 31 March-4 April. Las Vegas, Nevada, U.S.A., 46014604. Turgut, D., Turgut, B., Elmasri, R. and Le, T.V. (2003). Optimizing clustering algorithm in mobile ad hoc networks using simulated annealing. Wireless Communications and Networking, 2003. WCNC 2003. 2003 IEEE.20-20 March. New Orleans, Louisiana, USA, 1492-1497. Van Laarhoven, P. J. M. and Aarts, E.H.L. (1987). Simulated Annealing: Theory and Applications. Kluwer Academic Publishers. 129 Vasan., A. and Komaragiri, S. R. (2009). Comparative analysis of Simulated Annealing, Simulated Quenching and Genetic Algorithms for optimal reservoir operation. Applied Soft Computing. 9(1). 274-281. Veldhuis, R. (2002).The Centroid of the Symmetrical Kullback–Leibler Distance. IEEE Signal Processing Letters. 9(3). 16 March 2002. 96-99. Vepa, J. and King, S. (2004). Join cost for unit selection synthesis. Text to Speech Synthesis, S. Naranyan, A. Alwan, Eds., Prentice Hall, 2004. Vepa, J., King, S. and Taylor, P. (2002). New objective distance measures for spectral discontinuities in concatenative speech synthesis. Proc. IEEE 2002 Workshop on Speech Synthesis, 11-13 September 2002. Santa Monica, USA. Wang, Y., Yan, W. and Zhang, G. (1996). Adaptive simulated annealing for the optimal design of electromagnetic devices. Magnetics, IEEE Transactions. 32(3). 1214-1217. Wei, H., Chan, C. F., Chiu, S. and Pun, K. P. (2006). An efficient MFCC extraction method in speech recognition. IEEE International Symposium on Circuits and Systems. 21-24 May 2006. Island of Kos, Greece, 4. Wong, E. and Sridharan, S. (2001). Comparison of Linear Prediction Cepstrum Coefficients and Mel-Frequency Cepstrum Coefficients for Language Identification. Proc. 2001 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2-4 May, Hong Kong. 95-98. Wouters, J. and Macon, M. W. (1998). A perceptual evaluation of distance measures for concatenative speech synthesis. Proc. ICSLP. 1998. Sydney, Australia, 2747-2750. Wu, C. H., Hsia, C. C., Chen, J. F. and Liu, T. H. (2004). Variable-length unit selection using LSA-based syntactic structure cost. International Symposium on Chinese Spoken Language Processing. 15-18 December 2004. Hong Kong, 201 -204. Yao, X. (1991). Simulated annealing with extended neighbourhood. International Journal of Computer Mathematics, 40. 169-189. Yao, X. (1993). Comparison of different neighbourhood sizes in simulated annealing. Proceedings of the Fourth Australian Conference on Neural Networks (A CNN'9 3), 216-219. 130 Zhao, Y., Liu, P., Li, Y., Chen, Y. and Chu, M. (2006). Measuring Target Cost in Unit Selection With KL-Divergence Between Context-Dependent HMMs. In proceeding of ICASSP 2006. 14-19 May. Toulouse, France, 725-728. APPENDIX A Source Code of MFCC void CFeatureMfccDlg::MfccFront(CString fileName) { int i; fileName.Replace (".wav", "F.snd"); short *data = new short[512]; CFile file; for( i=0;i<512;i++) { data[i]=(short)m_Wave.buf[i]; } file.Open(fileName, CFile::modeCreate|CFile::modeWrite|CFile::typeBinary); if(file) { file.Write((void*)data,sizeof(short)*512); file.Close(); } m_cMFCC.MFCC(12,fileName,0); delete [] data; } void CFeatureMfccDlg::MfccEnd(CString fileName) { int i; fileName.Replace (".wav", "E.snd"); short *data = new short[512]; CFile file; int j=0; for( i=m_Wave.NoOfSample-512;i<m_Wave.NoOfSample;i++) { data[j]=(short)m_Wave.buf[i]; j++; } file.Open(fileName, CFile::modeCreate|CFile::modeWrite|CFile::typeBinary); if(file) { file.Write((void*)data,sizeof(short)*512); file.Close(); } m_cMFCC.MFCC(12,fileName,0); delete [] data; 131 void CFeatureMfccDlg::OnButtonMfcc() { UpdateData(TRUE); int i; for(i=m_iStart;i<=m_iEnd;i++) { CString tempStr; tempStr.Format(m_sPhoName+"%d.wav",i); ReadWav(tempStr); MfccFront(tempStr); MfccEnd(tempStr); } } void CFeatureMfccDlg::OnButtonFmfcc() { int i; short *data = new short[512]; CFile file; for( i=0;i<512;i++) { data[i]=(short)m_Wave.buf[i]; } file.Open("front.snd", CFile::modeCreate|CFile::modeWrite|CFile::typeBinary); if(file) { file.Write((void*)data,sizeof(short)*512); file.Close(); } m_cMFCC.MFCC(12,"front.snd",0); delete [] data; } void CFeatureMfccDlg::OnButtonReadwav() { m_Wave.Load("_e1.wav"); m_Wave.SetFormatToSamples(); WORD nChannels; DWORD nSamplesPerSec; WORD nBitsPerSample; m_Wave.GetParameters(nChannels, nSamplesPerSec, nBitsPerSample); m_Wave.Stop (); m_Wave.ReadWav("_e1.wav"); m_Wave.Play (this, 0, m_Wave.NoOfSample-1); } 132 void CFeatureMfccDlg::ReadWav(CString fileName) { m_Wave.Load(fileName); m_Wave.SetFormatToSamples(); WORD nChannels; DWORD nSamplesPerSec; WORD nBitsPerSample; m_Wave.GetParameters(nChannels, nSamplesPerSec, nBitsPerSample); m_Wave.Stop (); m_Wave.ReadWav(fileName); } void CFeature::MFCC(int coeffNo, CString file, int flag) { SpeechFile = file; Num_Feature = coeffNo; MODE = flag; for(int z=0;z<NO_FILTER;z++) { filters[z].first=((float)filter_data[z].first/SAMPLINGFREQ)*(float)S AMPLES + 0.5; filters[z].middle=((float)filter_data[z].middle/SAMPLINGFREQ)* (float)SAMPLES + 0.5; filters[z].final=((float)filter_data[z].final/SAMPLINGFREQ)* (float)SAMPLES + 0.5; if(filters[z].first==filters[z].middle) { printf("Error filter_data %d. The first sample is equal to the midlle sample !\n",z); puts("Use CALC_MEL to recalculate the frequencies."); exit(1); } if(filters[z].final==filters[z].middle) { printf("Error filter_data %d. The final sample is equal to the midlle sample !\n",z); puts("Use CALC_MEL to recalculate the frequencies."); exit(1); } if(filters[z].final==filters[z].first) { printf("Error filter_data %d. The final sample is equal to the initial sample !\n",z); puts("Use CALC_MEL to recalculate the frequencies."); exit(1); } } 133 for(z=0;z<NO_FILTER;z++) { weights[z]=new float [SAMPLINGFREQ/2.0+1]; weights[z][filters[z].first]=0.0; weights[z][filters[z].middle]=1.0; weights[z][filters[z].final]=0.0; m=1.0/(float)(filters[z].middle-filters[z].first); c=-m*filters[z].first; for(int w=filters[z].first+1;w<filters[z].middle;w++) weights[z][w]=m*w+c; m=-1.0/(float)(filters[z].final-filters[z].middle); c=-m*filters[z].final; for(w=filters[z].middle+1;w<filters[z].final;w++) weights[z][w]=m*w+c; } PreProcessing(); SampleFeature=new float[Frame]; SampleBlock = new double[Frame]; SampleWindow = new double[Frame]; cmel=new double[Num_Feature]; int i,j,k; for(i=0;i<TotalFrame;i++) { for(j=0;j<=Frame-1;j++) { FrameBlocking(i,j); HammingWindow(j); SampleFeature[j]=(float)SampleWindow[j]; } rsfft(SampleFeature, (int)log(512.0)/log(2.0)); for(k=0;k<Frame/2;k++) { mag[k] = (SampleFeature[k]*SampleFeature[k])+ (SampleFeature[Frame-k-1]*SampleFeature[Frame-k-1]); //fout<<"mag["<<k<<"] = "<<mag[k]<<endl; } for(int p=0;p<NO_FILTER;p++) { xk[p]=0; avgE=0; for(int q=filters[p].first;q<=filters[p].final;q++) { xk[p] += mag[q]*weights[p][q]; avgE++; 134 } xk[p] /= (float)avgE; if(xk[p]==0) xk[p]= 0.1; xk[p]=log10(xk[p]); } for(int r=0;r<Num_Feature;r++) { cmel[r]=0; for(int s=0;s<NO_FILTER;s++) cmel[r]+=xk[s]*cos((float)(r+1)*(s+10.5)*(3.1428571/NO_FILTER)); COEFFBuf[i][r]=cmel[r]; //copy cmel ke COEFFBuf } if(MODE==3||MODE==4||MODE==5) { Energy(i); } } if(SpeechFile.Find(".snd")>0) SpeechFile.Replace(".snd",".mfc"); else if(SpeechFile.Find(".txt")>0) SpeechFile.Replace(".txt",".mfc"); else if(SpeechFile.Find(".wav")>0) SpeechFile.Replace(".wav",".mfc"); if(MODE==0) { WriteFeature(); } else if(MODE==1) { Delta(); WriteFeature(); } else if(MODE==2) { Delta(); DeltaDelta(); WriteFeature(); } else if(MODE==3) { for(int i=0;i<TotalFrame;i++) COEFFBuf[i][Num_Feature]=EnergyBuf[i]; 135 WriteFeature(); delete []EnergyBuf; } else if(MODE==4) { Delta(); DeltaEnergy(); for(int i=0;i<TotalFrame;i++) { COEFFBuf[i][2*Num_Feature]=EnergyBuf[i]; COEFFBuf[i][2*Num_Feature+1]=EnergyBuf[TotalFrame+i]; } WriteFeature(); delete []EnergyBuf; } else if(MODE==5) { Delta(); DeltaDelta(); DeltaEnergy(); DeltaDeltaEnergy(); for (int i=0;i<TotalFrame;i++) { COEFFBuf[i][3*Num_Feature]=EnergyBuf[i]; COEFFBuf[i][3*Num_Feature+1]=EnergyBuf[TotalFrame+i]; COEFFBuf[i][3*Num_Feature+2]=EnergyBuf[2*TotalFrame+i]; } WriteFeature(); delete []EnergyBuf; } delete []SampleFeature; delete []cmel; delete []HamWindow; delete []Sample; delete []SamplePreemphasis; delete []SampleBlock; delete []SampleWindow; for (int y=0;y<TotalFrame;y++) delete [] COEFFBuf[y]; delete [] COEFFBuf; for (y=0;y<NO_FILTER;y++) delete [] weights[y]; } APPENDIX B Source Code of Simulated Annealing (Move 1) typedef struct { float mfccF[12]; float mfccE[12]; }UnitMFCC; typedef struct { UnitMFCC iUnitMFCC[150]; int iTotalUnit; }Unit; typedef struct { Unit iunitSel[20]; Unit iStage[20]; int stage,nextStage; int iTUnitSel; }UnitSel; typedef struct { float UnitCost[20]; float fTotalCost; CString sequence[200]; }UnitCost; class CSimanDlg : public CDialog { // Construction public: int iStage; Unit iunitSel[50]; CString Inpho[20]; int BU[20]; int CU[20]; int RU[20]; int iTotalstage; CString sCurProjPath; CSimanDlg(CWnd* pParent = NULL); // standard constructor 137 // Dialog Data //{{AFX_DATA(CSimanDlg) enum { IDD = IDD_SIMAN_DIALOG }; CString m_dDistant; CString m_dDistance; CString m_dshortestDis; CString m_sPho; int m_iNumber; //}}AFX_DATA // ClassWizard generated virtual function overrides //{{AFX_VIRTUAL(CSimanDlg) protected: virtual void DoDataExchange(CDataExchange* pDX); // DDX/DDV support //}}AFX_VIRTUAL // Implementation protected: HICON m_hIcon; // Generated message map functions //{{AFX_MSG(CSimanDlg) virtual BOOL OnInitDialog(); afx_msg void OnSysCommand(UINT nID, LPARAM lParam); afx_msg void OnPaint(); afx_msg HCURSOR OnQueryDragIcon(); afx_msg void OnButtonCompute(); afx_msg void OnButtonNext(); //}}AFX_MSG DECLARE_MESSAGE_MAP() }; CSimanDlg::CSimanDlg(CWnd* pParent /*=NULL*/) : CDialog(CSimanDlg::IDD, pParent) { //{{AFX_DATA_INIT(CSimanDlg) m_dDistant = _T(""); m_dDistance = _T(""); m_dshortestDis = _T(""); m_sPho = _T(""); m_iNumber = 0; //}}AFX_DATA_INIT // Note that LoadIcon does not require a subsequent DestroyIcon in Win32 m_hIcon = AfxGetApp()->LoadIcon(IDR_MAINFRAME); iTotalstage=4; iStage=0; } 138 void CSimanDlg::OnButtonCompute() { UpdateData(TRUE); int i,j; double UnitCost[200]; double d=0; double f; double temperature=1000; double curSol[200],TotalCost; double delta,prob; double g,u,shortest; int join,stage,iNonImproveIte=0,largestjoinPosition; CString sequence[200],Bestsequence[200],selectedUnit[200]; TotalCost=0; double LocalCost[20],greatestJoin=0,LocalCostNe[20],greatestJoinNe=0; int imaxIteration=200, iMaxNoOfJoin=2,iMaxNoOfStage=4,cc,ranNo; ofstream ofp,ofpseq; ofp.open("result.dat",ios::out); ofpseq.open("bestsequence.dat",ios::out); srand(time(0));//ori for(j=0;j<imaxIteration;j++)//number of iterations. { cc=iMaxNoOfJoin+1; ranNo=rand()%cc; join=ranNo; if(j==0)//initial solution { for(join=0;join<=iMaxNoOfJoin;join++) { d=0; BU[join]=1;//sample number .. for(i=0;i<12;i++) { d+=(pow(iunitSel[join].iUnitMFCC[BU[join]].mfccE[i] -iunitSel[join+1].iUnitMFCC[BU[join]].mfccF[i],2)); } LocalCost[join]=sqrt(d);/ TotalCost+=sqrt(d); if(LocalCost[join]>greatestJoin) { greatestJoin=LocalCost[join]; TRACE("greatestJoin:%lf\n",greatestJoin); } CString str,s,st; str.Format("%d",join); s.Format("%d",BU[join]); st.Format("%d",join+1); 139 selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+s+"]"; if(join==iMaxNoOfJoin) { selectedUnit[join+1]="iunitSel["+st+"].iUnitMFCC["+s+"]"; } } UnitCost[j]=TotalCost; curSol[0]=UnitCost[j]; shortest=curSol[0]; CString tempStr; tempStr.Format("%lf",curSol[0]); m_dDistant=tempStr; for(stage=0;stage<iMaxNoOfStage;stage++) { sequence[0]=selectedUnit[stage]; } } if(j>0 && temperature <=1000 && temperature>0.1 &&iNonImproveIte<100) { int c; f=0; TotalCost=0; for(join=0;join<=iMaxNoOfJoin;join++) { f=0; if(j==1) { CU[join]=BU[join]; } if(j>1) { LocalCost[join]=LocalCostNe[join]; } if(join==ranNo) { RU[join]=c; for(i=0;i<12;i++) { f+=(pow(iunitSel[join].iUnitMFCC[CU[join]].mfccE[i] -iunitSel[join+1].iUnitMFCC[RU[join]].mfccF[i],2)); } CString str,s,st,string; str.Format("%d",join); s.Format("%d",RU[join]); st.Format("%d",BU[join]); 140 string.Format("%d",join+1); selectedUnit[join+1]="iunitSel["+string+"].iUnitMFCC["+s+"]"; if(join==0) { selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+st+"]"; } largestjoinPosition=join; RU[largestjoinPosition]=c; } if(largestjoinPosition+1==join ) { CU[join]=RU[join]; for(i=0;i<12;i++) { f+=(pow(iunitSel[join].iUnitMFCC[RU[largestjoinPosition]].mfccE[i] -iunitSel[join+1].iUnitMFCC[CU[join]].mfccF[i],2)); } CU[join+1]=RU[join]; } if(join!=ranNo && largestjoinPosition+1!=join) { for(i=0;i<12;i++) { f+=(pow(iunitSel[join].iUnitMFCC[CU[join]].mfccE[i] -iunitSel[join+1].iUnitMFCC[RU[join]].mfccF[i],2)); } CString str,s,st,string; str.Format("%d",join); s.Format("%d",RU[join]); st.Format("%d",BU[join]); string.Format("%d",join+1); selectedUnit[join+1]="iunitSel["+string+"].iUnitMFCC["+s+"]"; if(join==0) { selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+st+"]" } CU[join+1]=RU[join]; } LocalCostNe[join]=sqrt(f); TotalCost+=sqrt(f); if(join==iMaxNoOfJoin) { for(stage=0;stage<iMaxNoOfStage;stage++) { sequence[j]=selectedUnit[stage]; 141 } } } UnitCost[j]=TotalCost; ofp<<j<<" "<<UnitCost[j]<<endl; if (j>=1 ) { if(UnitCost[j]<curSol[j-1]) { curSol[j]=UnitCost[j]; if(curSol[j]<shortest) { shortest=curSol[j]; iNonImproveIte=0; for(stage=0;stage<iMaxNoOfStage;stage++) { Bestsequence[j]=selectedUnit[stage]; ofpseq<<j<<" "<<Bestsequence[j]<<endl; } } else { iNonImproveIte++; } CString tempStr; tempStr.Format("%lf",curSol[j]); m_dDistance=tempStr; temperature*=0.95; } else { iNonImproveIte++; g=1+rand()%100; u=1/g; delta=UnitCost[j]-curSol[j-1]; prob=exp(-fabs((delta)/temperature)); if (prob>u) { curSol[j]=UnitCost[j]; CString tempStr; tempStr.Format("%lf",curSol[j]); m_dDistance=tempStr; } 142 else { curSol[j]=curSol[j-1]; } temperature*=0.95; } } CString sh; sh.Format("%lf",shortest); m_dshortestDis=sh; } } ofp.close(); ofpseq.close(); UpdateData(FALSE); } void CSimanDlg::OnButtonNext() { UpdateData (TRUE); int i,len2; int a,b; char input[10]; CString wavName,mfccF,mfccE; CString tempStr; tempStr.Format("%d",m_iNumber); ifstream ifp3; ifp3.open(tempStr+".dat",ios::in); ifp3>>len2; ifstream ifp; ifstream ifp1; Inpho[iStage]=m_sPho; iunitSel[m_iNumber-1].iTotalUnit=len2; for(i=0;i<len2;i++) { ifp3>>input; wavName=input; ifp3>>input; mfccF=input; ifp3>>input; mfccE=input; ifp.open(sCurProjPath+"\\"+Inpho[m_iNumber-1]+"\\"+mfccE,ios::in); for (a=0;a<12;a++) { ifp>>iunitSel[m_iNumber-1].iUnitMFCC[i+1].mfccE[a]; } ifp.close(); 143 ifp1.open(sCurProjPath+"\\"+Inpho[m_iNumber1]+"\\"+mfccF,ios::in); for (b=0;b<12;b++) { ifp1>>iunitSel[m_iNumber-1].iUnitMFCC[i+1].mfccF[b]; } ifp1.close(); } ifp3.close(); iStage++; } APPENDIX C Source Code of Simulated Annealing (Move 2) typedef struct { float mfccF[12]; float mfccE[12]; }UnitMFCC; typedef struct { UnitMFCC iUnitMFCC[150]; int iTotalUnit; }Unit; typedef struct { Unit iunitSel[20]; Unit iStage[20]; int stage,nextStage; int iTUnitSel; }UnitSel; typedef struct { float UnitCost[20]; float fTotalCost; CString sequence[200]; }UnitCost; class CSimanDlg : public CDialog { // Construction public: int iStage; Unit iunitSel[50]; CString Inpho[20]; int BU[20]; int CU[20]; int RU[20]; int iTotalstage; CString sCurProjPath; CSimanDlg(CWnd* pParent = NULL); // standard constructor 145 // Dialog Data //{{AFX_DATA(CSimanDlg) enum { IDD = IDD_SIMAN_DIALOG }; CString m_dDistant; CString m_dDistance; CString m_dshortestDis; CString m_sPho; int m_iNumber; //}}AFX_DATA // ClassWizard generated virtual function overrides //{{AFX_VIRTUAL(CSimanDlg) protected: virtual void DoDataExchange(CDataExchange* pDX); // DDX/DDV support //}}AFX_VIRTUAL // Implementation protected: HICON m_hIcon; // Generated message map functions //{{AFX_MSG(CSimanDlg) virtual BOOL OnInitDialog(); afx_msg void OnSysCommand(UINT nID, LPARAM lParam); afx_msg void OnPaint(); afx_msg HCURSOR OnQueryDragIcon(); afx_msg void OnButtonCompute(); afx_msg void OnButtonNext(); //}}AFX_MSG DECLARE_MESSAGE_MAP() }; CSimanDlg::CSimanDlg(CWnd* pParent /*=NULL*/) : CDialog(CSimanDlg::IDD, pParent) { //{{AFX_DATA_INIT(CSimanDlg) m_dDistant = _T(""); m_dDistance = _T(""); m_dshortestDis = _T(""); m_sPho = _T(""); m_iNumber = 0; //}}AFX_DATA_INIT // Note that LoadIcon does not require a subsequent DestroyIcon in Win32 m_hIcon = AfxGetApp()->LoadIcon(IDR_MAINFRAME); iTotalstage=4; iStage=0; } 146 void CSimanDlg::OnButtonCompute() { UpdateData(TRUE); int i,j; double UnitCost[200]; double d=0; double f; double temperature=1000; double curSol[200],TotalCost; double delta,prob; double g,u,shortest; int join,stage,iNonImproveIte=0,largestjoinPosition; CString sequence[200],Bestsequence[200],selectedUnit[200]; TotalCost=0; double LocalCost[20],greatestJoin=0,LocalCostNe[20],greatestJoinNe=0; int imaxIteration=200, iMaxNoOfJoin=2,iMaxNoOfStage=4,cc,ranNo; ofstream ofp,ofpseq; ofp.open("result.dat",ios::out); ofpseq.open("bestsequence.dat",ios::out); srand(time(0));//ori for(j=0;j<imaxIteration;j++)//number of iterations. { cc=iMaxNoOfJoin+1; ranNo=rand()%cc; join=ranNo; if(j==0)//initial solution { for(join=0;join<=iMaxNoOfJoin;join++) { d=0; BU[join]=1;//sample number .. for(i=0;i<12;i++) { d+=(pow(iunitSel[join].iUnitMFCC[BU[join]].mfccE[i] -iunitSel[join+1].iUnitMFCC[BU[join]].mfccF[i],2)); } LocalCost[join]=sqrt(d);/ TotalCost+=sqrt(d); if(LocalCost[join]>greatestJoin) { greatestJoin=LocalCost[join]; TRACE("greatestJoin:%lf\n",greatestJoin); } CString str,s,st; str.Format("%d",join); s.Format("%d",BU[join]); st.Format("%d",join+1); 147 selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+s+"]"; if(join==iMaxNoOfJoin) { selectedUnit[join+1]="iunitSel["+st+"].iUnitMFCC["+s+"]"; } } UnitCost[j]=TotalCost; curSol[0]=UnitCost[j]; shortest=curSol[0]; CString tempStr; tempStr.Format("%lf",curSol[0]); m_dDistant=tempStr; for(stage=0;stage<iMaxNoOfStage;stage++) { sequence[0]=selectedUnit[stage]; } } if(j>0 && temperature <=1000 && temperature>0.1 &&iNonImproveIte<50) { int c; f=0; TotalCost=0; greatestJoinNe=0; for(join=0;join<=iMaxNoOfJoin;join++) { f=0; if(j==1) { CU[join]=BU[join]; } if(j>1) { LocalCost[join]=LocalCostNe[join]; greatestJoin=TMP[iMaxNoOfJoin]; } if(greatestJoin==LocalCost[join]) { c=1+rand()%iunitSel[join+1].iTotalUnit; RU[join]=c; for(i=0;i<12;i++) { f+=(pow(iunitSel[join].iUnitMFCC[CU[join]].mfccE[i]iunitSel[join+1].iUnitMFCC[RU[join]].mfccF[i],2)); } 148 CString str,s,st,string; str.Format("%d",join); s.Format("%d",RU[join]); st.Format("%d",BU[join]); string.Format("%d",join+1); selectedUnit[join+1]="iunitSel["+string+"].iUnitMFCC["+s+"]"; if(join==0) { selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+st+"]"; } largestjoinPosition=join; RU[largestjoinPosition]=c; } if(largestjoinPosition+1==join ) { CU[join]=RU[join]; for(i=0;i<12;i++) { f+=(pow(iunitSel[join].iUnitMFCC[RU[largestjoinPosition]].mfccE[i ]-iunitSel[join+1].iUnitMFCC[CU[join]].mfccF[i],2)); } } if(greatestJoin!=LocalCost[join] && largestjoinPosition+1!=join ) { for(i=0;i<12;i++) { f+=(pow(iunitSel[join].iUnitMFCC[CU[join]].mfccE[i]iunitSel[join+1].iUnitMFCC[RU[join]].mfccF[i],2)); } CString str,s,st,string; str.Format("%d",join); s.Format("%d",RU[join]); st.Format("%d",BU[join]); string.Format("%d",join+1); selectedUnit[join+1]="iunitSel["+string+"].iUnitMFCC["+s+"]"; if(join==0) { selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+st+"]"; } CU[join+1]=RU[join]; } LocalCostNe[join]=sqrt(f); TotalCost+=sqrt(f); 149 if(join==iMaxNoOfJoin) { for(stage=0;stage<iMaxNoOfStage;stage++) { sequence[j]=selectedUnit[stage]; } } } for(join=0;join<=iMaxNoOfJoin;join++) TMP[join]=LocalCostNe[join]; for(int z=0;z<=iMaxNoOfJoin;z++) for(join=0;join<iMaxNoOfJoin;join++) if( TMP[join]>TMP[join+1]) { tmp=TMP[join]; TMP[join]=LocalCostNe[join+1]; TMP[join+1]=tmp; } UnitCost[j]=TotalCost; if (j>=1 ) { if(UnitCost[j]<curSol[j-1]) { curSol[j]=UnitCost[j]; if(curSol[j]<shortest) { shortest=curSol[j]; iNonImproveIte=0; for(stage=0;stage<iMaxNoOfStage;stage++) { Bestsequence[j]=selectedUnit[stage]; } } else { iNonImproveIte++; } CString tempStr; tempStr.Format("%lf",curSol[j]); m_dDistance=tempStr; temperature*=0.9; } 150 else { iNonImproveIte++; g=1+rand()%100; u=1/g; delta=UnitCost[j]-curSol[j-1]; prob=exp(-fabs((delta)/temperature)); if (prob>u) { curSol[j]=UnitCost[j]; CString tempStr; tempStr.Format("%lf",curSol[j]); m_dDistance=tempStr; } else { curSol[j]=curSol[j-1]; } temperature*=0.9; } } CString sh; sh.Format("%lf",shortest); m_dshortestDis=sh; } } UpdateData(FALSE); } void CSimanDlg::OnButtonNext() { UpdateData (TRUE); int i,len2; int a,b; char input[10]; CString wavName,mfccF,mfccE; CString tempStr; tempStr.Format("%d",m_iNumber); ifstream ifp3; ifp3.open(tempStr+".dat",ios::in); ifp3>>len2; ifstream ifp; ifstream ifp1; Inpho[iStage]=m_sPho; iunitSel[m_iNumber-1].iTotalUnit=len2; for(i=0;i<len2;i++) { ifp3>>input; wavName=input; 151 ifp3>>input; mfccF=input; ifp3>>input; mfccE=input; ifp.open(sCurProjPath+"\\"+Inpho[m_iNumber-1]+"\\"+mfccE,ios::in); for (a=0;a<12;a++) { ifp>>iunitSel[m_iNumber-1].iUnitMFCC[i+1].mfccE[a]; } ifp.close(); ifp1.open(sCurProjPath+"\\"+Inpho[m_iNumber1]+"\\"+mfccF,ios::in); for (b=0;b<12;b++) { ifp1>>iunitSel[m_iNumber-1].iUnitMFCC[i+1].mfccF[b]; } ifp1.close(); } ifp3.close(); iStage++; } APPENDIX D Source Code of Simulated Annealing (Move 3) typedef struct { float mfccF[12]; float mfccE[12]; }UnitMFCC; typedef struct { UnitMFCC iUnitMFCC[150]; int iTotalUnit; }Unit; typedef struct { Unit iunitSel[20]; Unit iStage[20]; int stage,nextStage; int iTUnitSel; }UnitSel; typedef struct { float UnitCost[20]; float fTotalCost; CString sequence[200]; }UnitCost; class CSimanDlg : public CDialog { // Construction public: int iStage; Unit iunitSel[50]; CString Inpho[20]; int BU[20]; int CU[20]; int RU[20]; int iTotalstage; CString sCurProjPath; CSimanDlg(CWnd* pParent = NULL); 153 // Dialog Data //{{AFX_DATA(CSimanDlg) enum { IDD = IDD_SIMAN_DIALOG }; CString m_dDistant; CString m_dDistance; CString m_dshortestDis; CString m_sPho; int m_iNumber; //}}AFX_DATA // ClassWizard generated virtual function overrides //{{AFX_VIRTUAL(CSimanDlg) protected: virtual void DoDataExchange(CDataExchange* pDX); // DDX/DDV support //}}AFX_VIRTUAL // Implementation protected: HICON m_hIcon; // Generated message map functions //{{AFX_MSG(CSimanDlg) virtual BOOL OnInitDialog(); afx_msg void OnSysCommand(UINT nID, LPARAM lParam); afx_msg void OnPaint(); afx_msg HCURSOR OnQueryDragIcon(); afx_msg void OnButtonCompute(); afx_msg void OnButtonNext(); //}}AFX_MSG DECLARE_MESSAGE_MAP() }; CSimanDlg::CSimanDlg(CWnd* pParent /*=NULL*/) : CDialog(CSimanDlg::IDD, pParent) { //{{AFX_DATA_INIT(CSimanDlg) m_dDistant = _T(""); m_dDistance = _T(""); m_dshortestDis = _T(""); m_sPho = _T(""); m_iNumber = 0; //}}AFX_DATA_INIT // Note that LoadIcon does not require a subsequent DestroyIcon in Win32 m_hIcon = AfxGetApp()->LoadIcon(IDR_MAINFRAME); iTotalstage=4; iStage=0; } 154 void CSimanDlg::OnButtonCompute() { UpdateData(TRUE); int i,j; double UnitCost[200]; double d=0; double f; double temperature=1000; double curSol[200],TotalCost; double delta,prob; double g,u,shortest; int join,stage,iNonImproveIte=0,largestjoinPosition; CString sequence[200],Bestsequence[200],selectedUnit[200]; TotalCost=0; double LocalCost[20],greatestJoin=0,LocalCostNe[20],greatestJoinNe=0; int imaxIteration=200, iMaxNoOfJoin=2,iMaxNoOfStage=4,cc,ranNo; ofstream ofp,ofpseq; ofp.open("result.dat",ios::out); ofpseq.open("bestsequence.dat",ios::out); srand(time(0));//ori for(j=0;j<imaxIteration;j++)//number of iterations. { if(j==0)//initial solution { for(join=0;join<=iMaxNoOfJoin;join++) { d=0; BU[join]=1;//sample number .. for(i=0;i<12;i++) { d+=(pow(iunitSel[join].iUnitMFCC[BU[join]].mfccE[i] -iunitSel[join+1].iUnitMFCC[BU[join]].mfccF[i],2)); } LocalCost[join]=sqrt(d);/ TotalCost+=sqrt(d); if(LocalCost[join]>greatestJoin) { greatestJoin=LocalCost[join]; } CString str,s,st; str.Format("%d",join); s.Format("%d",BU[join]); st.Format("%d",join+1); selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+s+"]"; if(join==iMaxNoOfJoin) { selectedUnit[join+1]="iunitSel["+st+"].iUnitMFCC["+s+"]"; } } 155 UnitCost[j]=TotalCost; curSol[0]=UnitCost[j]; shortest=curSol[0]; CString tempStr; tempStr.Format("%lf",curSol[0]); m_dDistant=tempStr; for(stage=0;stage<iMaxNoOfStage;stage++) { sequence[0]=selectedUnit[stage]; } } if(j>0 && temperature <=1000 && temperature>0.1 &&iNonImproveIte<50) { int c;//random integer number between 1-13 f=0; TotalCost=0; greatestJoinNe=0; for(join=iMaxNoOfJoin;join>=0;join--) { f=0; if(j==1) { CU[join]=BU[join]; } if(j>1) { LocalCost[join]=LocalCostNe[join]; greatestJoin=TMP[iMaxNoOfJoin]; } if(greatestJoin==LocalCost[join]) { c=1+rand()%iunitSel[join+1].iTotalUnit; RU[join]=c; for(i=0;i<12;i++) { f+=(pow(iunitSel[join+1].iUnitMFCC[CU[join]].mfccF[i]iunitSel[join].iUnitMFCC[RU[join]].mfccE[i],2)); } CString str,s,st,string; str.Format("%d",join+1); s.Format("%d",RU[join]); st.Format("%d",BU[join]); string.Format("%d",join); selectedUnit[join]="iunitSel["+string+"].iUnitMFCC["+s+"]"; 156 if(join==iMaxNoOfJoin) { selectedUnit[join+1]="iunitSel["+str+"].iUnitMFCC["+st+"]"; } largestjoinPosition=join; RU[largestjoinPosition]=c; } if(largestjoinPosition-1==join ) { CU[join]=RU[join]; for(i=0;i<12;i++) { f+=(pow(iunitSel[join+1].iUnitMFCC[RU[largestjoinPosition]].mfccF[i]iunitSel[join].iUnitMFCC[CU[join]].mfccE[i],2)); } } if(greatestJoin!=LocalCost[join] && largestjoinPosition-1!=join ) { for(i=0;i<12;i++) { f+=(pow(iunitSel[join+1].iUnitMFCC[CU[join]].mfccF[i]iunitSel[join].iUnitMFCC[RU[join]].mfccE[i],2)); } CString str,s,st,string; str.Format("%d",join); s.Format("%d",RU[join]); st.Format("%d",BU[join]); string.Format("%d",join+1); selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+s+"]";//new if(join==iMaxNoOfJoin) { selectedUnit[join+1]="iunitSel["+string+"].iUnitMFCC["+st+"]"; } CU[join-1]=RU[join]; } LocalCostNe[join]=sqrt(f); TotalCost+=sqrt(f); if(join==0) { for(stage=0;stage<iMaxNoOfStage;stage++) { sequence[j]=selectedUnit[stage]; } } } 157 for(join=0;join<=iMaxNoOfJoin;join++) TMP[join]=LocalCostNe[join]; for(int z=0;z<=iMaxNoOfJoin;z++) for(join=0;join<iMaxNoOfJoin;join++) if( TMP[join]>TMP[join+1]) { tmp=TMP[join]; TMP[join]=LocalCostNe[join+1]; TMP[join+1]=tmp; } UnitCost[j]=TotalCost; if (j>=1 ) { if(UnitCost[j]<curSol[j-1]) { curSol[j]=UnitCost[j]; if(curSol[j]<shortest) { shortest=curSol[j]; iNonImproveIte=0; for(stage=0;stage<iMaxNoOfStage;stage++) { Bestsequence[j]=selectedUnit[stage]; } } else { iNonImproveIte++; } CString tempStr; tempStr.Format("%lf",curSol[j]); m_dDistance=tempStr; temperature*=0.9; } else { iNonImproveIte++; g=1+rand()%100; u=1/g; delta=UnitCost[j]-curSol[j-1]; prob=exp(-fabs((delta)/temperature)); if (prob>u) 158 { curSol[j]=UnitCost[j]; CString tempStr; tempStr.Format("%lf",curSol[j]); m_dDistance=tempStr; } else { curSol[j]=curSol[j-1]; } temperature*=0.9; } } CString sh; sh.Format("%lf",shortest); m_dshortestDis=sh; } } UpdateData(FALSE); } void CSimanDlg::OnButtonNext() { UpdateData (TRUE); int i,len2; int a,b; char input[10]; CString wavName,mfccF,mfccE; CString tempStr; tempStr.Format("%d",m_iNumber); ifstream ifp3; ifp3.open(tempStr+".dat",ios::in); ifp3>>len2; ifstream ifp; ifstream ifp1; Inpho[iStage]=m_sPho; iunitSel[m_iNumber-1].iTotalUnit=len2; for(i=0;i<len2;i++) { ifp3>>input; wavName=input; ifp3>>input; mfccF=input; ifp3>>input; mfccE=input; ifp.open(sCurProjPath+"\\"+Inpho[m_iNumber-1]+"\\"+mfccE,ios::in); for (a=0;a<12;a++) { ifp>>iunitSel[m_iNumber-1].iUnitMFCC[i+1].mfccE[a]; 159 } ifp.close(); ifp1.open(sCurProjPath+"\\"+Inpho[m_iNumber1]+"\\"+mfccF,ios::in); for (b=0;b<12;b++) { ifp1>>iunitSel[m_iNumber-1].iUnitMFCC[i+1].mfccF[b]; } ifp1.close(); } ifp3.close(); iStage++; } 160 APPENDIX E Evaluation Questionnaire Malay Text To Speech Please Note: • You do not have to take part in this questionnaire. • If you find any of these questions intrusive feel free to leave them unanswered • Any data collected will remain strictly confidential, and anonymity will be preserved. SECTION 1Age:_____________ Race: □ Malay PERSONAL AND BACKGROUND DETAILS Gender: □ Female □ Chinese □ Indian Is Malay your first spoken language? □ Yes Where do you use computers? □ Home □ Work □ School □ I Don’t □ Male □ Other __________ □ No □ Other __________ What is your level of education? □ Primary □ Secondary □ Tertiary □ Other______ Where is your state of origin? □ Perak □ Perlis □ Kedah □ Pulau Pinang □ Pahang □ Kelantan □ Terengganu □ N. Sembilan □ Selangor □ Kuala Lumpur □ Johor □ Sabah □ Sarawak □ Labuan □ Other __________ The next section will commence shortly. PLEASE DO NOT TURN THE PAGE UNTIL TOLD TO DO SO THANK YOU. 161 SECTION 2- Word level Testing (Unit Selection) Words answer sheet. A: There will be 10 words in this section. Listen to the sound file and write down your answer. Word 1 Word 2 Word 3 Word 4 Word 5 Word 6 Word 7 Word 8 Word 9 Word 10 162 SECTION 3- Mean Opinion Score (MOS) Test Listening Test. For all the ten words play twice, thick the word that you think is better in term of naturalness. Word 1: Word 1a Word 1b Word 2: Word 2a Word 2b Word 3: Word 3a Word 3b Word 4: Word 4a Word 4b Word 5: Word 5a Word 5b Word 6: Word 6a Word 6b Word 7: Word 7a Word 7b Word 8: Word 8a Word 8b Word 9: Word 9a Word 9b Word 10: Word 10a Word 10b Thank you for your participation. <<<<<< End of questionnaire >>>>>>