IMPLEMENTATION OF SIMULATED ANNEALING IN UNIT SELECTION FOR MALAY TEXT-TO-SPEECH SYSTEM

advertisement
IMPLEMENTATION OF SIMULATED ANNEALING IN UNIT SELECTION
FOR MALAY TEXT-TO-SPEECH SYSTEM
LIM YEE CHEA
A dissertation submitted in fulfillment of the
requirements for the award of the degree of
Master of Science (Mathematics)
Faculty of Science
Universiti Teknologi Malaysia
NOVEMBER 2009
iii
Dedicated to Jesus Christ,
my personal Lord and Savior,
my pastor, Church members,
my beloved mum, dad, brother and sister.
iv
ACKNOWLEDGEMENTS
“Let us then with confidence draw near to the throne of grace, that we may
receive mercy and find grace to help in time of need.” First and foremost, I want to
thank Jesus for His grace and mercy throughout this project. It is by His hand and
wisdom in guiding me to finish my work.
I would like to extend my appreciation to my honorable supervisor, Dr. Zaitul
Marlizawati Zainuddin and my co-supervisor, Dr. Tan Tian Swee, for their academic
guidance, suggestions, support and encouragement shown during the course of my
study. The patience, tolerance, diligence and dedication shown to me have given me
great encouragement and a good example to follow after.
Finally, I would love to convey my gratitude to my beloved family members
and church members for their love and care shown to me along the process of the
study. They have given me so much assistance, comfort and prayer support, either
financially or spiritually, of which words could not express and will forever be
remembered in my heart. Here I want to especially appreciate Mohd Redzuan bin
Jamaludin, his willingness and guidance in doing Matlab.
v
ABSTRACT
Unit selection method has become the predominant approach in speech
synthesis. The quality of unit selection based concatenative speech synthesis
primarily governed by how well two successive units can be joined together.
Therefore, the main purpose of unit selection is to minimize the audible
discontinuities. The process of unit selection is based on phonetic context and
Simulated Annealing that selects units from large database with the minimization of
a criterion, which is often called cost. This dissertation presents a variable-length unit
selection Malay text to speech system that is capable of providing more natural and
accurate unit selection for synthesized speech. To provide the capability of selecting
a speech unit not only limited to phoneme, diphone or triphone but also a string of
phonemes that can be matched directly to the database, unit selection methods have
been implemented. The Mel Frequency Cepstral Coefficients (MFCC) as spectral
parameters have been introduced in the unit selection based speech synthesis.
Distance measurement is needed to measure the difference between two vectors of
this speech feature. The spectral distance used is Euclidean Distance.
vi
ABSTRAK
Kaedah pilihan unit telah menjadi cara utama dalam sintesis pertuturan.
Kualiti untuk pilihan unit dalam penyambungan perkataan adalah berpandukan
kepada betapa baiknya kedua-dua unit menyambung bersama. Oleh itu, matlamat
utama dalam pilihan unit adalah untuk mengurangkan komposisi jarak. Process untuk
pilihan unit adalah bergantung pada konteks fonetik dan Simulated Annealing yang
memilih unit dari database dengan meminimumkan satu criteria, yang selalunya
dipanggil kos. Disertasi ini melaksanakan satu pemilihan unit berlainan panjang
yang mampu memberikan pemilihan unit yang lebih tepat dan semulajadi untuk
pertuturan sintesis. Untuk mengadakan pemilihan pertuturan unit yang berupaya
bukan hanya terhad kepada foneme,dua fonem atau tiga fonem tetapi juga satu
raingkaian fonem yang boleh terusdipadankan kepada pangkalan data, kaedah
pemilihan unit telah dilaksanakan. Mel Frequency Cepstral Coefficients (MFCC)
sebagai spektra parameter telah diperkenalkan dalam pemilihan unit pertuturan
sintesis. Pengiraan jarak adalah diperlukan untuk mengira jarak antara dua vector ini.
Spectra jarak yang digunakan adalah Jarak Euclidean.
vii
TABLE OF CONTENT
CHAPTER
TITLE
TITLE PAGE
i
DECLARATION PAGE
ii
DEDICATION
iii
ACKNOWLEDGEMENT
iv
ABSTRACT
v
ABSTRAK
vi
TABLE OF CONTENTS
vii
LIST OF TABLES
xi
LIST OF FIGURES
xiii
LIST OF SYMBOLS
xvi
LIST OF APPENDICES
1
PAGE
xviii
INTRODUCTION
1
1.0
Introduction
1
1.1
Background of the Problem
2
1.2
Problem Statement
3
1.3
Objective of the Study
3
1.4
Scopes of the Study
3
1.5
Significance of the Study
4
1.6
Research Methodology
4
1.7
Dissertation Layout
5
viii
2
LITERATURE REVIEW
5
2.1
Speech synthesis
6
2.1.1
7
2.2
Concatenative Speech Synthesis
Unit Selection
2.2.1
9
Non-Uniformed or Variable Length Unit
11
Selection
2.2.2
2.3
Corpus-based Unit Selection
Cost function for unit selection
12
14
2.3.1
The Acoustic Parameters
16
2.3.2
Linguistic Features
16
2.3.3 Local cost
17
2.3.3.1 Sub-cost on prosody
19
2.3.3.2 Sub-cost on discontinuity
20
2.3.3.3 Sub-cost on phonetic environment
20
2.3.3.4 Sub-cost on spectral
21
discontinuity
2.3.3.5 Sub-cost on phonetic
22
appropriateness
2.3.3.6 Other sub-costs
23
2.3.3.7 Integrated cost
23
2.4
Cost weighting
24
2.5
Target cost
25
2.6
Concatenation cost
26
2.7
Spectral Distances
29
2.8
Feature Extraction
30
2.8.1
30
2.9
MFCC
Distance Measures
32
2.9.1 Simple Distance Measures
33
2.9.1.1 Absolute Distance
33
2.9.1.2 Euclidean Distance
34
2.9.2 Statistically Motivated Distance Measures
34
2.9.2.1 Mahalanobis Distance
34
2.9.2.2 Kullback–Leibler (KL) distance
35
ix
2.10
3
4
Heuristic Method
36
2.10.1 Simulated Annealing
37
2.10.2 Approaches to improve SA algorithm
39
2.10.3 Polynomial approximation
40
2.10.4 Annealing Schedule
41
2.10.4.1 Theoretically optimum cooling schedule
41
2.10.4.2 Geometric cooling schedule
42
2.10.4.3 Cooling schedule of Van Laarhoven et al.
42
2.10.4.4 Cooling schedule of Otten et al.
43
2.10.4.5 Cooling schedule of Huang et al.
43
2.10.4.6 Adaptive cooling schedules
44
2.10.4.7 A new adaptive cooling schedule
44
2.11
Parallel SA
46
2.12
Segmented Simulated Annealing
47
PROPOSED SYSTEM AND IMPLEMENTATION
49
3.0
Introduction
49
3.1
System Design Flow
50
3.2
Malay Phonetics and Phone Sets
51
3.3
Malay Phoneme
51
3.3.1
Malay Vowels
51
3.3.2
Malay Consonant
51
3.4
Phoneme Units Database
52
3.5
Feature Extraction
55
3.6
Phonetic context
58
3.7
Unit Selection
59
3.8
Concatenation
60
SIMULATED ANNEALING
63
4.0
Introduction
63
4.1
Procedure of Simulated Annealing
65
4.2
Initial Solution
67
4.3
The cooling schedule
67
4.3.1 Markov chain
70
x
4.4
Neighbourhood Generation Mechanism
70
4.5
Metropolis's criterion
80
4.6
Stopping criteria
82
4.7
Unit Selection
82
4.7.1 Phonetic context
82
4.7.2 Concatenation Cost
88
4.8
5
89
4.7.2.2 Concatenation cost for Move 2
90
4.7.2.3 Concatenation cost for Move 3
90
4.7.2.4 Concatenation cost for Move 4
91
Concatenation
100
TESTING, ANALYSIS AND RESULT
107
5.1
Experiment
107
5.2
Test Materials
107
5.3
Test Conditions
107
5.4
Test Procedure
108
5.5
Profiles of Listeners
109
5.5.1 Percentages of Listeners by Gender
110
5.5.2 Percentage of Listeners by Race
111
5.5.3 Percentage of Listeners by State of Origin
112
Result and Analysis
113
5.6.1 Word Level Testing
113
5.6.2 Mean Opinion Score
114
5.6
6
4.7.2.1 Concatenation cost for Move 1
CONCLUSION AND RECOMMENDATION
117
6.1
Conclusion
117
6.2
Suggestion for Future Work
120
REFERENCES
APPENDICES A-E
121
131 -163
xi
LIST OF TABLES
TABLE NO.
TITLE
PAGE
2.1
Sub-cost functions
17
3.1
Total units after extracting the phoneme units from the carrier sentences
54
4.1
Maximum number of iterations for Markov Chain length 1
and 2 to reach final temperature greater than 0.1.
70
4.2
The information of the 10 words before filter using phonetic context.
86
4.3
The information of the 10 words after filter using partially matched
phonetic context (left phonetic context).
4.4
The information of the 10 words after filter using fully matched phonetic
context (left and right phonetic context).
4.5
92
Information of concatenation cost with temperature
reduction rate, α = 0.85
4.11
91
Information of concatenation cost with temperature
reduction rate, α = 0.95
4.10
90
Information of concatenation cost (Move 4) with temperature
reduction rate, α = 0.90
4.9
90
Information of concatenation cost (Move 3) with temperature
reduction rate, α = 0.90
4.8
89
Information of concatenation cost (Move 2) with temperature
reduction rate, α = 0.90
4.7
88
Information of concatenation cost (Move 1) with temperature
reduction rate, α = 0.90
4.6
87
93
Information of concatenation cost with temperature
reduction rate, α = 0.80
94
xii
4.12
Information of concatenation cost with temperature
reduction rate, α = 0.95
4.13
Information of concatenation cost with temperature
reduction rate, α = 0.90
4.14
96
Information of concatenation cost with temperature
reduction rate, α = 0.85
4.15
95
97
Information of concatenation cost with temperature
reduction rate, α = 0.80
98
4.16
The sequences of the 10 selected words.
100
5.1
Profiles of Listeners
109
5.2
Words selected for listening test.
113
5.3
The score line of synthesis words with considers the concatenation cost. 115
5.4
The score line of the 10 synthesis words with considers the concatenation
cost.
116
xiii
LIST OF FIGURE
FIGURE NO.
TITLE
PAGE
2.1
Classes of waveform synthesis methods for speech synthesis.
7
2.2
Viterbi search.
8
2.3
Architecture of corpus-based unit selection concatenative
speech synthesizer.
13
2.4
Schematic diagram of cost function
15
2.5
Example of unit search algorithm. The shortest path is marked in blue.
28
2.6
Example of unit search algorithm. The difference in cost between the
optimal sequences of two graphs is evaluated for d3 in pre-selection.
29
2.7
Objective Spectral distances
30
2.8
Block diagram of the conventional MFCC extraction algorithm
31
2.9
Parallel Simulated Annealing Taxonomy
46
2.10
Segmented simulated annealing
48
3.1
Block Diagram of System Design Flow.
50
3.2
A set of coefficient transform from MFCC algorithm.
53
3.3
Speech unit database.
54
3.4
The GUI to extract MFCCs coefficients.
55
3.5
The GUI to extract MFCCs coefficients.
56
3.6
The 12 coefficients extracted for phoneme “_m”.
56
3.7
The 12 coefficients extracted for phoneme “a”.
57
3.8
Distance measure and speech feature.
57
3.9
The candidate unit for phoneme “_n” that matched right phonetic context. 58
3.10
The candidate unit for phoneme “a” that matched left and right phonetic
3.11
context.
59
Unit selection
60
xiv
3.12
Waveform for phoneme “_n”.
61
3.13
Waveform for phoneme “a”.
61
3.14
Waveform for phoneme “s”.
61
3.15
Waveform for phoneme “i”.
62
3.16
Concatenation of the best matching units for the word “nasi”.
62
4.1
SA flow diagram to find best speech unit sequence.
66
4.2
Temperature reduction pattern for various reduction rates with
Markov Chain length 1.
4.3
69
Temperature reduction pattern for various reduction rate with
Markov Chain length 2.
69
4.4
Metropolis criterion
81
4.5
The feasible search region to form a Malay word “kampung” before filter
using phonetic context.
4.6
The feasible search region to form a Malay word “kampung” after filter
using partially matched phonetic context (left phonetic context).
4.7
85
The feasible search region to form a Malay word “kampung” after filter
using fully matched phonetic context (left and right phonetic context).
4.8
84
85
SA best solutions, mean and worst solutions for
ten problems from Table 4.12.
99
4.9
Waveform “_s1”.
101
4.10
Waveform “e537”
101
4.11
Waveform “l362”
101
4.12
Waveform “a2710”
101
4.13
Waveform “n1031”
102
4.14
Waveform “j7”
102
4.15
Waveform “u206”
102
4.16
Waveform “t142”
102
4.17
Waveform “ny1”
103
4.18
Waveform “a2060”
103
4.19
Concatenation waveform for the word “selanjutnya”.
103
4.20
Spectrogram for the word “nasi”.
104
4.21
Spectrogram for the word “berpengetahuan”.
104
4.22
Spectrogram for the word “demikian”.
105
xv
4.23
Spectrogram for the word “demikian” that do not consider
concatenation cost.
105
4.24
Spectrogram zoom in for the word “demikian” from Figure 4.22.
106
4.25
Spectrogram zoom in for the word “demikian” from Figure 4.23.
106
5.1
Percentage of listeners by gender.
110
5.2
Percentage of listeners by race.
111
5.3
Percentage of listeners by state of origin.
112
5.4
Level of intelligibility of the 10 selected words.
114
5.5
Results of the mean opinion score.
115
xvi
LIST OF SYMBOLS/ ABBREVIATIONS
AC
Average cost
kb
Boltzmann constant
S
Configuration set
C
Cost function
E
Energy
Cmax
Estimation of the maximum value of the cost function
⟨ f (T )⟩
Expected cost in equilibrium
FFT
Fast Fourier Transform
F0
Fundamental Frequency
GUI
Graphical User Interface
KL
Kullback-Leibler
LSF
Line spectral frequencies
LP
Linear prediction
LPC
Linear Predictive Coefficients
LC
Local cost
MC
Maximum cost
MOS
Mean Opinion Score
MCD
Mel-cepstral distortion
MFCCs
Mel-Frequency Cepstral Coefficients
Mel ( f )
Mel scale
MCA
Multiple centroid analysis
NC p
Norm cost
N
Neighbourhood structure
PLP
Perceptual linear prediction
xvii
P(E)
Probabilities of acceptance
δ
Real number
C pro
Sub-cost on prosody
CF0
Sub-cost on F0 discontinuity
Cenv
Sub-cost on phonetic environment
Cspec
Sub-cost on spectral discontinuity
Capp
Sub-cost on phonetic appropriateness
T
Temperature
α
Temperature reduction rate
TTS
Text-to-speech
TSP
Travelling salesman problem
U
Upper bound
σ 2 (T )
Variance in the cost at equilibrium
xviii
LIST OF APPENDICES
APPENDIX
TITLE
PAGE
A
Source Code of MFCC
131
B
Source Code of Simulated Annealing (Move 1)
137
C
Source Code of Simulated Annealing (Move 2)
145
D
Source Code of Simulated Annealing (Move 3)
153
E
Evaluation Questionnaire
161
CHAPTER 1
INTRODUCTION
1.0
Introduction
Corpus-based concatenative synthesis has become the major trend recently
because the resulted speech sounds more natural than that produced by parameterdriven production models (Chou, 1999). Unit selection synthesizers in the current
state produce highly intelligible, near natural synthetic speech (Tsiakoulis et al.,
2008). This method creates speech by re-sequencing pre-recorded speech units
selected from a very large speech database (Cepko et al., 2008). Speech is produced
by searching through large speech database (corpus) and concatenating selected
units, thus forming the output signal. This approach shows its superiority over
formant and articulatory synthesis, because it tends to concatenate natural acoustic
units with no modification. Thus, offering better speech quality (Janicki et al., 2008).
Text to speech synthetic is produced by concatenating speech unit from a very large
speech corpus containing enough prosodic and spectral varieties for all synthetic
units (Vepa et al., 2002; Vepa and King, 2004). Hence, it is possible to synthesize
highly natural-sounding speech by selecting an appropriate sequence of units (Vepa
et al., 2002). The selection of the best unit sequence from the database can be treated
as a search problem which has the lowest overall distance. Since the quality of the
resulting synthetic speech will depend to a large extent on the variability and
availability of representative units, therefore, it is crucial to design a corpus that
covers all speech units and most of their variations in a feasible size (Min et al.,
2001). The unit selection process is based on the cost function that consists of target
cost and join cost. The join cost is measurement of the acoustic smoothness between
2
the concatenated units (Dong and Li, 2008). This dissertation will focus on
concatenation costs which generally use a distance measure on a parameterization of
the speech signal. MFCCs are chosen as spectral parameters as they are most
commonly used in state-of-the-art recognizers (Rabiner and Juang, 1993). Distance
measurement is needed to measure the difference between two vectors of this speech
feature. The spectral distance used is Euclidean Distance. Mel Frequency Cepstral
Coefficients were derived using standard methods commonly used in speech
recognition. MFCCs are representative of the real cepstrum for a windowed short
time signal derived from the Fast Fourier Transform (FFT) of the speech signal (Wei
et al., 2006).
1.1
Background of the Problem
The main problem with the existing Malay text-to-speech (TTS) synthesis
system is the poor quality of the generated speech sound. This poor quality is caused
by the inability of traditional TTS system to provide multiple choices of unit for
generating an accurate synthesized speech (Tan and Sheikh, 2008b). Most of the
current Malay TTS systems are utilizing diphone concatenation that only supports a
single unit for each existing diphone, the selection of speech unit for concatenation
may not be accurate enough (Tan and Sheikh, 2008b). The current trend in high
quality text-to-speech systems (TTS) is to concatenate acoustic units selected from
large-scaled corpus of continuous read speech. Thus, a robust unit selection is needed
to handle the huge volume of data in the database (Blouin et al., 2002).There exist
artifacts such as phase mismatches and discontinuities in spectral shape since units
are extracted from disjoint phonetic contexts which can have a deleterious effect on
perception (Hunt and Black, 1996). It is nominally cast as a multivariate optimization
task, where the available unit inventory is searched for the “best” sequence of units
which makes up the target utterance. This optimization relies on suitable cost criteria
to characterize relevant aspects of acoustic and prosodic context (Bellegarda, 2008).
3
1.2
Problem Statement
The task of the research is to use Simulated Annealing to find the minimum
path for the speech units.
1.3
Objective of the Study
The dissertation aims to achieve the three objectives outlined in this section
i)
To implement Mel Frequency Cepstral Coefficients (MFCCs) in unit
selection.
ii)
To implement heuristic optimization method in unit selection.
iii)
To evaluate the performance of the heuristic optimization method in
unit selection.
1.4
Scopes of the Study
This dissertation presents a variable-length unit selection scheme to select
text-to-speech (TTS) synthesis units from phoneme based corpus which supporting
phoneme pattern in Malay Text to Speech. Speech feature selected are MFCCs.
Spectral distance used is Euclidean distance. Heuristic methods namely Simulated
Annealing is implemented in unit selection to select the best sequence of unit.
4
1.5
Significance of the Study
For Malay TTS system, this is the first version of implementation of unit
selection using heuristic method which is Simulated Annealing. The performance of
this kind of algorithm and methods will be evaluated based on values of cost
functions obtained and listening test. By doing so, the advantages and disadvantages
of this method will be known if compared to other existing unit selection methods.
1.6
Research Methodology
The variable length unit selection is capable of providing more natural and
accurate unit selection for synthesized speech and has been implemented in Malay
text to speech system in this project (Tan and Sheikh, 2008b). During synthesis,
proper units are selected by searching the closest database units to the symbolic
target sequence using the Simulated Annealing. The number of possible units at a
given time can number in the tens of thousands if a database is built from a 100-hour
corpus (Nishizawa and Kawai, 2006). Therefore, heuristic optimization method is
needed to select the appropriate units without having to go through all possible
combination of units sequences. The C++ programming codes for Simulated
Annealing was developed. To make the acoustic distortion measures correspond to
human perception more consistently, the Mel Frequency Cepstral Coefficients
(MFCC) as spectral parameters have been introduced in the unit selection based
speech synthesis (John et al., 1993). Distance measurement is needed to measure the
difference between two vectors of this speech feature. The spectral distance used is
Euclidean Distance. The smaller the magnitude in Euclidean Distance means closer
the concatenation point and thus generated better speech sound. The performance of
the heuristic method and other unit selection method were evaluated based on values
of cost functions obtained and listening test.
5
1.7
Dissertation Layout
This dissertation is divided into six major parts. Chapter 1 includes
introduction, background, objective and scope of the thesis. The purpose is to show
how this research is different from other conventional method.
Chapter 2 provides the comprehensive study in various unit selection
methods. The focus will be on the cost function for unit selection, speech features
and spectral distance. It will also include a discussion for Simulated Annealing (SA)
with the purpose of laying a foundation for the possible approach to improve the
performance of SA.
Chapter 3 describes on the proposed system and implementation. It will
discuss the process involved in generating the waveform for synthesis word from
contextual linguistic, selection of speech units, concatenation and output sound.
Chapter 4 describes the procedure for SA. It will also describe the procedure
in unit selection from contextual linguistic, SA to concatenation. Various parameter
setting and neighbourhood generation mechanism for SA will be used to investigate
the performance of SA.
Chapter 5 is listening test for the synthesis words based on result in Chapter
4. The purpose is to justify the contribution of concatenation cost in improving the
speech quality.
Chapter 6 provides the conclusion for the system. It will also give some
recommendation for further improvement of the system.
CHAPTER 2
LITERATURE REVIEW
2.1
Speech synthesis
There are two types of speech synthesis methods (Figure 2.1) which are
parameter synthesis and concatenative synthesis (Hirai and Tenpaku, 2004). For
parameter synthesis method, it involved encoded and decoded of speech samples
(Hirai and Tenpaku, 2004). Before the speech samples are stored in a database,
speech samples are encoded into like Linear Predictive Coefficients (LPC)
parameters. In the synthesis stage, this speech samples are decoded. The encoded of
speech samples are required due to small memory storage size. During the encoded
process, it involves information lost and the consequence is speech intelligibility
degrades (Hirai and Tenpaku, 2004). For concatenative synthesis, it do not involved
encoded and decoded of speech samples. During speech concatenation, the original
speech segments are selected to concatenate (Hirai and Tenpaku, 2004). The original
here refer to as they are, or processed lightly. In this case, speech intelligibility and
the speaker’s identity are maintained since information is stored well. However, this
method required large storage capabilities. This method also results in large
computational cost for searching for appropriate concatenation speech segments. The
issues of large storage capabilities and computational cost are resolved with everincreasing advancements in computer technology these days (Hirai and Tenpaku,
2004). For the synthesis system in this dissertation, the length of a segment is a
phoneme.
7
Waveform Synthesis
Parametric Synthesis
Source-filter
Articulatory
Diphones
Concatenative Synthesis
Fixed inventory
Triphones
Nonuniform Unit Selection
Demisyllables
Figure 2.1 Classes of waveform synthesis methods for speech synthesis (Schwarz,
2007).
2.1.1
Concatenative Speech Synthesis
There exists several numbers of different techniques for synthesizing speech
(Chappell and Hansen, 2002). Corpus-based concatenative approach to speech
synthesis has been widely explored in recent years (Sakai et al., 2008). The
concatenative synthesis starts with a collection of speech waveform signals and
concatenates individual segments to form a new utterance. The concatenative
approach is based on the idea of re-combining natural prosodic contours and
phoneme sequences using a superpositional framework (Jan et al., 2005). In
concatenative speech synthesis, speech segments, each of which is often generalized
as a unit, are selected from speech corpus through the minimization of the overall
cost. The Viterbi search (Figure 2.2) is basically employed, which is based on the
dynamic programming (DP) approach to find the unit sequence with the minimal
cost (Nishizawa and Kawai, 2008). The final speech is more natural than with other
forms of synthesis since concatenative synthesis begins with a set of natural speech
segments (Chappell and Hansen, 2002).
8
Figure 2.2 Viterbi search (Sakai et al., 2008)
There exist several possible choices for basic synthesis unit in concatenative
speech synthesizer such as phonemes, diphones, demisyllables, syllables, words or
phrases (Min et al., 2001). Smaller units and larger units have it own advantages and
disadvantages. For a small units like phonemes, it is not difficult to collect a speech
corpus that embodies many prosodic and spectral varieties (Min et al., 2001). But the
disadvantage is the synthesized speech tends to suffer more distortions caused by
mismatches between concatenated units since small units mean much more
concatenation points. For larger units such as words or phrases, it is almost
impossible to cover many varieties in a feasible size (Min et al., 2001) although they
have less concatenation point. Each segment’s boundary for concatentation is chosen
during synthesis in order to best fit the adjacent segments in the synthesized
utterance. Spectral mismatch can be computed using an objective measure to
determine the level of spectral fit between segments at various possible boundaries.
The spectral mismatch is measured at various possible segment boundaries, and the
minimum measure score means the closest spectral match (Chappell and Hansen,
2002).
Large database can yield high speech quality for direct concatenation of
segments since the database contains enough sample segments to include a close
match for each desired segment (Chappell and Hansen, 2002). However, large
9
database can also mean costly in terms of database collection, search requirements,
and segment memory storage and organization (Chappell and Hansen, 2002). For
databases that contain multiple instances of each speech unit, segments selection is
based upon two cost functions which are the target cost and concatenation cost
(Chappell and Hansen, 2002). The target cost measure the difference between
available segments with a theoretical ideal segment, and the concatenation cost
measures the acoustic smoothness between the concatenated units (Dong and Li,
2008). Several spectral distance measures have been compared when used as
concatenation costs (Chappell and Hansen, 2002).
2.2
Unit Selection
Unit selection method has become the major approach in speech synthesis
recently and captures the attention of most researchers. The speech units need to be
carefully selected from a large database of continuous read speech recorded from a
professional speaker in order to yield high quality TTS systems (Sarathy and
Ramakrishnan, 2008). To produce enough speech target specification for unit
selection, the database should be designed to cover as much of the prosodic and
phonetic characteristics of the language as possible (Sarathy and Ramakrishnan,
2008). The unit selection becomes much slower when a larger unit database is used
for high-quality sounds. It is because computational effort in the search for the
optimal unit sequence is proportional to the square of the number of possible units
(Nishizawa and Kawai, 2006). The aim of unit selection speech synthesis is to select
a sequence of units which requires less signal processing or ideally no signal
processing at all (Robert et al., 2007).
In current speech synthesizers, the process of unit selection is based on some
type of dynamic programming that selects units from large database with the
minimization of the integrated cost (Wu et al., 2004). In corpus-based TTS, the
search for optimum unit sequence is determined by a Viterbi algorithm that
minimizes a cost function (Díaz and Banga, 2006). There are two different types of
spectral distortion in unit selection synthesis. These two spectral distortions are
10
contextual unit distortion and inter-unit distortion (Sagisaka, 1994). The unit
selection algorithm uses two cost functions; target cost and concatenation cost (Fek
et al., 2006). The quality of unit selection based concatenative speech synthesis is
determined by how well two successive units can be joined together to minimize the
audible discontinuities (Vepa et al., 2002). The concatenation cost is used as the
objective measure of discontinuity.
Most of the time, contextual unit distortion and inter-unit distortion is often
represented by target cost and concatenation cost respectively. The contextual unit
distortion is a measure of a whole unit caused by the difference between the unitextraction context and the target synthesis context. Most of the time, phonetic
contextual difference is used to represent this contextual unit distortion (Sagisaka,
1994). For inter-unit distortion, it is a measure for spectral discontinuity at unit
concatenation boundaries. This spectral discontinuity is computed from the
difference in spectral envelopes of neighboring units at the concatenation points
(Sagisaka, 1994).
The combinations of units should be considered in unit selection, since the
suitability of synthesis unit includes not only similarity between a synthesis target
and a selected unit but also the smoothness between neighboring units (Nishizawa
and Kawai, 2006). According to Fek et al.,(2006), the target costs is assigned with
null cost if the phonetic context are fully matched. This means that the left and right
phonetic contexts of the input unit and the candidate are exactly matched. According
to Fek et al.,(2006), the concatenation cost is assigned with null cost if units were
consecutive in the speech database.
The coverage of units in all different speaker attitudes and prosodic styles is
one of the biggest problems in the unit selection approach (Sarathy and
Ramakrishnan, 2008). It will not be enough to guarantee the full coverage of target
feature combinations thoroughly even after recording of several hundred thousand
sentences (Sarathy and Ramakrishnan, 2008). Therefore, unit selection is very
important to substitutes the units not matching target specification with the closest
substitutes.
11
To improve the naturalness of the synthesized speech, the possible approach
is to increase the length of basic units from demi-phones, phones, di-phones, triphones, syllables, words to non-uniform or variable-length units (Wu et al., 2004).
Since less concatenation points yields less spectral discontinuity, thus longer
synthesis units will reduce the effect of spectral distortion (Wu et al., 2004).
In concatenative systems, speech units can be either fixed-size diphones or
variable length units (Hasim et al., 2006). Fixed-size units only allow one length size
unit to be available in the speech unit database. The length size can either be
phoneme, diphone, syllable or word unit length, depending on its application. The
approach which utilizes variable length units is known as unit selection (Hasim et al.,
2006). For Malay speech unit concatenation, there are two types of units used namely
single unit and ‘unit selection’. The method which used single unit has only one
occurrence of all possible units. For this type of method, the diphone level has been
used by most of the research and available commercial TTS system namely Festival
Speech Synthesis system, Mbrola, and German TTS-systems (Taylor et al., 1999).
For unit selection method, since a large speech corpus containing more than
one instance of a unit is recorded, it provides more accurate unit selection with
multiple choices for all correspondence units. The variable length units are selected
based on some estimated objective measure to optimize the synthetic speech quality
(Hasim et al., 2006).
2.2.1
Non-Uniformed or Variable Length Unit Selection
Nowadays, many speech synthesis systems used non-uniform or variable
length unit concatenation as an effort to minimize audible signal discontinuities at
the concatenation points (Stylianou and Syrdal, 2001). To produce the output speech,
the most appropriate units of variable lengths, with the desired prosodic features
within the corpus are automatically retrieved and selected on-line in real-time, and
concatenated during the synthesis process (Chou and Tseng, 1998). Usually, longer
units can be used in the synthesis if they appear in the corpus with desired prosodic
12
features and the need for signal modification to obtain the desired prosodic features
for a voice unit is significantly reduced which signal modification usually degrades
the naturalness of the synthesized speech (Chou and Tseng, 1998). The quality of a
waveform generated depends mainly on the number of concatenation points.
Therefore, higher perceived quality speech can be produced as the length of the
elements used in the synthesized speech increases that is decreases in number of
concatenation points (Nagy et al., 2005).
2.2.2
Corpus-based Unit Selection
The concept of corpus-based or unit selection synthesis is that the corpus is
searched for maximally long phonetic strings to match the sounds to be synthesized
(Piits et al., 2007). Corpus-based speech tends to elicit considerably higher ratings
of naturalness in auditory tests than diphone or triphone synthesis (Nagy et al.,
2005). This is because there is less number of real concatenation points (Fek et al.,
2006). To solve the problems with the fixed-size unit inventory synthesis, e.g.
diphone synthesis, a promising methodology has been proposed using corpus-based
concatenative speech synthesis (unit selection) (Hasim et al., 2006). There is more
than one instance of each unit to capture prosodic and spectral variability found in
natural speech in the speech corpus. The acoustic units of varying sizes are selected
from a large speech corpus and concatenated in corpus-based systems. If an
appropriate unit is found in the unit inventory, the needs for signal modifications on
the selected units are minimized (Hasim et al., 2006). A unit selection algorithm is
required to choose the units from the inventory that matches best the target
specification of the input sequence of units because there are more than one instance
of each unit (Hasim et al., 2006). A factor that has been argued to contribute to the
perceived lack of naturalness of synthesis speech is the frequency of unit
concatenations (Möbius, 2000). The unit selection algorithm favors choosing
adjacent speech segments in order to minimize the number of joins (Hasim et al.,
2006). Thus, the limitations of concatenative synthesis for fixed acoustic unit
inventory have been overcome by corpus-based approaches.
13
The unit selection produced much better output speech quality than fixed-size
unit inventory synthesis in terms of naturalness. However, some challenges still face
by the unit selection. One of the problems is the inconsistency of the speech resulting
quality (Hasim et al., 2006). The selected unit is needed to undergo some prosodic
modifications which degrade the speech quality at this segment join if the unit
selection algorithm fails to find a good match for a target unit. To overcome this
problem, speech corpus should be designed to cover all the prosodic and acoustic
variations of the units that can be found in an utterance. It is infeasible to record
larger and larger databases since it will slow down the unit selection. A better way is
to find a way for optimal coverage of the language. Another problem faced by the
unit selection is some glitches exist at the concatenation points in the synthesized
utterance during concatenating the speech waveforms. A speech model is generally
used for speech representation and waveform generation to ensure smooth
concatenation of speech waveforms and to enable prosodic modifications on the
speech units (Hasim et al., 2006). Figure 2.3 shows the components of Malay
waveform generator modules generation.
Figure 2.3 Architecture of corpus-based unit selection concatenative speech
synthesizer.
14
The Algorithm 2.1 shows the algorithm to detect appropriate speech unit sequence
for synthesis using the purtubation of model parameters (Hirai et al., 2002).
Algorithm 2.1
1. Estimate the values of speech features depending on the input text. The
values are called “expected values” of a feature.
2. According to the expected values, a speech unit sequence which shows the
minimum cost C0 is found in the speech database and the IDs of the units in
the sequence are substituted S 0 .
3. The cost C0 and the speech unit sequence S 0 are substituted Cmin and Smin .
4. Choose a feature from among F0 , speech unit duration, power, etc., and
substitute into F. Execute the processing shown below:
4.1
the model parameter values of F are adjusted within the range in
which naturalness and clarity are maintained, and the new expected
values sequence is substituted T.
4.2
Based on T, a speech unit sequence S, which shows the minimum cost
C, is found in the speech database. If C < Cmin , then C and S are
substituted Cmin and Smin .
5. Repeat step (4) until the number of repetitions exceeds the limit or until there
is no expectation to renew the Cmin .
6. Systhesize the speech based on the speech unit sequence Smin .
2.3
Cost function for unit selection
The cost function for unit selection is viewed as a transformation of objective
features such as acoustic measures and linguistic information, into a perceptual
measure (Figure 2.4). The predicted perceptual measure that is expected to capture
the degradation of synthetic speech naturalness is considered a cost (Toda et al.,
2006).
15
Perceptual experiments should be conducted to determine the components of
the cost function based on the results of the experiments. In practice, it is almost
impossible to experimentally transform acoustic measures into a perceptual measure
provided the acoustic features have simple structures, as in the case of F0 or
phoneme duration. Most of the times, acoustic features have such complex structures
that this kind of experiment is infeasible (Toda et al., 2006). Nothing satisfactory has
been found so far although various studies have been carried out to search for an
acoustic measure that can capture perceptual characteristics (Klabbers and Veldhuis,
2001; Stylianou and Syrdal, 2001; Ding and Campbell, 1998; Wouters and Macon,
1998).
Besides, phonetic information can be transformed into perceptual measures
from perceptual experiments (Kawai and Tsuzaki, 2002). However, acoustic
measures that can represent the characteristics of instances of waveform segments
are still necessary since phonetic information can only evaluate the difference
between phonetic categories (Toda et al., 2006).
Figure 2.4 Schematic diagram of cost function (Toda, 2003).
16
2.3.1
The Acoustic Parameters
The parameters that fall into this category are normally prosodic parameters
that describe pitch and duration of unit. The use of prosody alone is not enough to
reflect spectral mismatches. Both spectral parameters and prosodic parameters need
to be included in the unit for unit selection (Dong and Li, 2008). MFCC is employed
as parameters to represent spectral information in this dissertation. Basic synthesis
chosen in this dissertation is phoneme. The 12 MFCC coefficients are used to
represent spectral information.
2.3.2
Linguistic Features
The linguistic features are used for predicting the acoustic parameters. The
linguistic features can be obtained from input text. The linguistic features that can be
derived from the utterance files included context units, syllable information, syllable
position information, word information, phrase information and utterance
information. For context units, it describes phone identities of the previous 2 and
next 2 units. For syllable information, it describes stress, accent, length of the
previous, current and next syllables. For syllable position information: syllable
position in word and phrase, stressed syllable position in phrase, accented syllable
position in phrase, distance from the stressed syllable, distance from the accented
syllable, and name of the vowel in the syllable. For word information, it describes
length and part-of-speech of the previous word, current word and next word, position
of the word in phrase. For phrase information, it describes lengths (in number of
words and syllables) of previous phrase, current phrase and next phrase, position of
the current phrase in major phrase, boundary tone of the current phase. Finally, for
utterance information, it describes lengths in number of syllables, words and phrases
(Dong and Li, 2008).
17
2.3.3
Local cost
The degradation of naturalness caused by individual candidate unit can be
shown using local cost. The higher local cost means the less naturalness of the
speech synthesis. The cost function is defined as a weighted sum of the five sub-cost
functions (Table 2.1). Each sub-cost represents either source information or vocal
tract information (Toda et al., 2006).
Table 2.1 Sub-cost functions (Toda, 2003).
Source information
Vocal tract information
Prosody ( F0 , duration )
C pro
F0 discontinuity
CF0
Phonetic environment
Cenv
Spectral discontinuity
Cspec
Phonetic
Capp
inappropriateness
The local cost LC ( ui , ti ) at a candidate unit ui for a target phoneme ti is
given by
LC ( ui , ti ) = w pro ⋅ C pro ( ui , ti ) + wF0 ( ui − ui −1 ) + wenv ⋅ Cenv ( ui ,ui −1 ) + wspec ⋅ C ( ui ,ui −1 )
+ wapp ⋅ Capp ( ui ,ti )
w pro + wF0 + wenv + wspec + wapp = 1
where wpro , wF0 , wenv , wspec and wapp denote weights for their respective sub-costs. All
sub-costs need to be normalized so that they have positive values with the same
mean. The previous unit ui −1 shows a candidate unit for the ( i − 1) th target phoneme
ti −1 (Toda et al., 2006). The sub-costs Cenv , Cspec and Capp become null when the
candidate units ui −1 and ui are adjacent in the corpus. The five sub-costs can be
further divided into target cost Ct and concatenation cost Cc (Hunt and Black, 1996;
Campbell and Black, 1997).
18
The unit selection process is based on two types of cost function which are
target cost and concatenation cost (Dong and Li, 2008). The local cost (Toda, 2003;
Toda et al., 2006) is given by
LC ( ui , ti ) = wt ⋅ Ct ( ui , ti ) + wc ⋅ Cc ( ui , ui −1 )
wt + wc = 1
such that the target cost Ct and concatenation cost Cc can be written as
Ct ( ui , ti ) =
wpro
wt
Cc ( ui , ui −1 ) =
wpro
Since wt
+
⋅ C pro ( ui , ti ) +
wapp
wt
⋅ Capp ( ui , ti )
wF
w
wenv
⋅ Cenv ( ui , ui −1 ) + spec ⋅ Cspec ( ui , ui −1 ) + 0 ⋅ CF0 ( ui , ui −1 )
wc
wc
wc
wapp
wt
=1
, therefore
wt = w pro + wapp
.
wenv wspec wF0
+
+
=1
w = wF0 + wenv + wspec
Similarly, wc
, therefore c
wc
wc
The target cost takes into account, the phone label and its position in the
word, the phone label and its position in the syllable, and the segment duration
(according to statistical data). For the concatenation cost, the factors it takes into
account is takes adjacent phones in the same speech segment, avoids too large
duration, pitch and energy differences between adjacent phones, and finally favors
similar phonetic features for adjacent phones (Prudon et al., 2002). It is not easy to
reach a balance between the target cost and the join cost. If we give more weight to
concatenation cost, then the target cost may be weighted low and thus result in bad
synthesis. Combining these two costs is necessary as a way of lessening this behavior
(Vepa and King, 2004).
19
2.3.3.1 Sub-cost on prosody: C pro
The difference in prosodic parameters ( F0 contour and phoneme duration)
between a candidate segment and the target caused the degradation of naturalness.
The C pro sub-cost captures the degradation of naturalness caused by this
phenomenon (Toda et al., 2006). To calculate the difference in the F0 contour, a
phoneme is divided into various parts, then the difference in an averaged log-scaled
F0 is computed in each part. The prosodic cost is represented as an average of the
costs calculated in each phoneme (Toda, 2003). This sub-cost function (Toda et al.,
2006) can be estimated from the results of perceptual experiments. The sub-cost C pro
can be written as
C pro ( ui , ti ) =
1
M
∑ P ( D (u , t , m ) , D (u , t ))
M
m =1
F0
i
i
d
i
i
(2.1)
where DF0 ( ui , ti , m ) is the difference in the averaged log-scaled F0 in the mth
divided part. The DF0 is set to zero in the unvoiced phoneme. Dd is the difference in
the duration. Dd is calculated for each phoneme and used in the calculation of the
cost in each part. M is the number of divisions (Toda, 2003).
P is the nonlinear function and was determined from the results of perceptual
experiments on the degradation of naturalness caused by prosody modification,
assuming that the output speech was synthesized with prosody modification. The
function should be determined based on other experiments on the degradation of
naturalness caused by using a different prosody from that of the target when prosody
modification is not performed (Toda, 2003).
20
2.3.3.2 Sub-cost on F0 discontinuity: CF0
The F0 discontinuity at a segment boundary caused the degradation of
naturalness. The CF0 sub-cost captures the degradation of naturalness caused by this
phenomenon. This sub-cost (Toda et al., 2006) is computed as the distance based on
the log-scaled F0 at the boundary and is given by
(
CF0 ( ui , ui −1 ) = P DF0 ( ui , ui −1 ) , 0
)
where DF0 is the difference in log-scaled F0 at the boundary and is set to zero at the
unvoiced phoneme boundary. The function P in Equation (2.1) is used to normalize
a dynamic range of the sub-cost. The sub-cost becomes zero when the units ui −1 and
ui are adjacent in the corpus (Toda, 2003).
2.3.3.3 Sub-cost on phonetic environment: Cenv
The mismatch of phonetic environments between a candidate segment and
the target caused the degradation of naturalness (Toda et al., 2006). The Cenv subcost captures the degradation of naturalness caused by this phenomenon and is given
by
Cenv ( ui , ui −1 ) =
{
}
{
}
1
S s ( ui −1 , Es ( ui −1 ) , ui ) + S p ( ui , E p ( ui ) , ui −1 )
2
=
1
S s ( ui −1 , Es ( ui −1 ) , ti ) + S p ( ui , E p ( ui ) , ti −1 )
2
(2.2)
(2.3)
Ss and S p is the sub-cost function that captures the degradation of naturalness caused
by the mismatch with the succeeding and preceding environment respectively. Es and
E p are the succeeding phoneme and preceding phoneme in the corpus respectively.
Thus, S s ( ui −1 , Es ( ui −1 ) , ti ) is the degradation caused by the mismatch with the
succeeding environment in the phoneme for ui −1 by substituting Es ( ui −1 ) with the
21
phoneme for ti . S p ( ui , E p ( ui ) , ti −1 ) is the degradation caused by the mismatch with
the preceding environment in the phoneme ui by substituting E p ( ui ) with the
phoneme for ti −1 . The sub-cost functions S s and S p are determined from the results
of perceptual experiments. Equation (2.2) is transformed into Equation (2.3) by
considering that a phoneme for ui is equivalent to a phoneme for ti and a phoneme
for ui −1 is equivalent to a phoneme for ti −1 (Toda, 2003).
This sub-cost is calculated from the results of perceptual experiments. The
sub-cost does not always become zero even if the mismatch of phonetic
environments does not exist since this sub-cost also reflects the difficulty of
concatenation caused by the uncertainty of segmentation (Klabbers and Veldhuis,
2001). The sub-cost becomes null when the units ui −1 and ui are adjacent in the
corpus (Toda, 2003).
2.3.3.4 Sub-cost on spectral discontinuity: Cspec
The spectral discontinuity at a segment boundary caused the degradation of
naturalness. The Cspec sub-cost captures the degradation of naturalness caused by this
phenomenon (Toda et al., 2006). This sub-cost is determined as the weighted sum of
a mel-cepstral distortion between frames of a segment and those of the preceding
segment around the boundary (Toda, 2003). The sub-cost (Toda, 2003) can be
written as
Cspec ( ui , ui −1 ) = cs ⋅
w
−1
2
∑ h ( f )MCD ( u , u
f =− w /2
i
i −1
,f)
where h is the triangular weighting function of length w . MCD ( ui , ui −1 , f ) is the
mel-cepstral distortion between the f th frame from the concatenation frame
(f
= 0 ) of the preceding segment ui −1 and the f th frame from the concatenation
frame ( f = 0 ) of the succeeding segment ui in the corpus. Concatenation is
22
conducted between the -1-th frame of ui −1 and the 0-th frame of ui . cs is a
coefficient to normalize the dynamic range of the sub-cost. The mel-cepstral
distortion (Toda, 2003) calculated in each frame-pair is written as
(
40
20
d
d
⋅ 2 ⋅ ∑ mcα( ) − mcβ( )
ln10
d =1
)
2
where mcα( d ) and mcβ( d ) show the d-th order mel-cepstral coefficient of a frame α and
that of a frame β , respectively. This sub-cost becomes zero when the segments
ui −1 and ui are adjacent in the corpus (Toda, 2003).
2.3.3.5 Sub-cost on phonetic appropriateness: Capp
The phonetic inappropriateness caused the degradation of naturalness. The
Capp sub-cost captures the degradation of naturalness caused by this phenomenon
using outlying segments (Toda et al., 2006). The sub-cost is computed as the melcepstral distortion between the mean vector of a candidate segment and that of the
target (Toda et al., 2006). The sub-cost Capp (Toda, 2003) can be written as
Capp ( ui , ti ) = ct ⋅ MCD ( CEN ( ui ) , CEN ( ti ) )
where CEN is a mean cepstrum calculated at the frames around the phoneme center.
MCD is the mel-cepstral distortion between the mean cepstrum of the segment
ui and that of the target ti . ct denotes a coefficient to normalize the dynamic range of
the sub-cost (Toda, 2003). This sub-cost is set to zero in the unvoiced phoneme
(Toda, 2003).
23
2.3.3.6 Other sub-costs
Besides that, there exist many others sub-cost such as acoustic phonemic
target cost, phoneme cost, phrase cost, etc. Acoustic phonemic target cost (Jan et al.,
2005) is used to measure acoustic distance to acoustic template of a phonetic class.
Foot or phoneme cost is assigned to violations of the same-class constraint. Phrase or
foot cost measure mismatches between the target and the phrase match sequence in
terms of the number and lengths of the feet. Sentence or phrase cost measure
mismatches between the target and the phrase match sequence in terms of the
number and lengths of the feet.
The cost definition must consider spectral compatibility, addressed by subcosts phonological identity difference, phoneme characteristic difference, signal f0 ,
signal f0 derivative, signal energy difference, signal energy derivative difference.
Long-term compatibility is addressed by sub-costs phonological identity difference,
phoneme characteristic difference, syllabic position difference, word position
difference, breath group position difference, system predicted duration difference and
signal duration difference. Sub-costs syllabic position difference, word position
difference and breath group position difference are devised to favour the coherence
of prosodic groups (Blouin et al., 2002).
2.3.3.7 Integrated cost
The optimum set of units for an utterance is selected from a speech corpus in
unit selection. Therefore, local costs for individual units need to be integrated into a
cost for a unit sequence. This cost is referred to as an integrated cost in this
dissertation (Toda et al., 2006). The average cost ( AC ) with the formula
AC =
1 N
⋅ ∑ LC ( ui , ti )
N i =1
is often used as the integrated cost (Hunt and Black, 1996; Campbell and Black,
1997) where N denotes the number of targets in the utterance. The silence before the
24
utterance are represented by the target t0 and the candidate u0 whereas the silence
after the utterance are denoted by t N and uN . Both the sub-costs C pro and Capp are
fixed to zero for the pause. Since the average cost shows the level of naturalness over
the entire synthetic utterance, a unit with an expensive cost can also be selected in
the output sequence of units; with the condition that it is optimal in terms of the
average cost (Toda et al., 2006).
It might be assumed that the largest cost in the sequence would have
significant effect on the degradation of naturalness in synthetic speech. The
maximum cost ( MC ) as the integrated cost given by
MC = max { LC ( ui , ti )} ,
i
1< i < N
is used to verify this assumption. Besides these two types of integration methods,
there is another cost called norm cost, NC p , which is given by
1
p⎤p
⎡1 N
NC p = ⎢ ⋅ ∑ { LC ( ui , ti )} ⎥
⎣ N i =1
⎦
The norm cost is equivalent to the average cost when the value p is set to 1. Whereas
the norm cost is equal to the maximum cost when p approaches infinity. Thus, the
mean value and the maximum value can be obtained using this norm cost by
varying p .
2.4
Cost weighting
There exist many combinations of features for which not every feature can be
simultaneously satisfied. Therefore, some compromises need to be made. Since there
will be relative importance for each sub-cost in the whole cost function, thus, tuning
the weights is an important stage in the design of the selection algorithm to reflect
their relative importance (Díaz and Banga, 2006). The highest weight will be
assigned to the most important sub-cost. Various approaches have been presented for
25
automatically tuning the weights of the cost functions employed in the speech unit
selection process. Various weight set has been integrated into the selection process
and the set that gave the smallest mean square error between the original and the
synthetic pitch contours was selected as the candidate. However, results show that
there does not exist a set of weights with consistent performance across all (or almost
all) of the sets (Díaz and Banga, 2006). Thus, adjusting the weights by manual tuning
seems to be the only solution to this problem and further research needs to be carried
out on optimizing the weights.
However, there is a possible approach of weight optimization, which is using
multiple linear regression (MLR). The multiple linear regression calculated on subcost values generated with the training corpus, as a function of an acoustic measure
of concatenation quality is used to optimize the sub-cost weights (Blouin et al.,
2002).
Besides that, target cost weights can be adjusted automatically using linear
regression in the synthesis system. Taking in the context cost as a target cost element
makes it very critical to use the trained weights since some context mismatches exists
when using such weights. To solve this problem, adjust-listen operations need to be
performed starting from the automatically trained weights until satisfactory results
are obtained. The need for such manual tuning may be caused by the objective
function used in weight training is not perceptually suitable (Hamza et al., 2001).
2.5
Target cost
The target cost is a measure of how much a candidate unit is remote from a
desired position in the synthesized phrase. In the classical approach, it is measured
against, for example, a desired pitch contour, the distance is derived directly from
text. It is based on two factors which are syllable’s position in a word, and presence
and type of a boundary tone. The target cost equals zero if both factors match the unit
in the corpus and the target phrase. Else, various cost values are assigned for
26
different cases, depending on the syllable type (stressed, final, other) and type of a
boundary tone (ending, low, high, none) (Janicki et al., 2008).
The target cost function refers to how well a unit’s phonetic contexts and
prosodic characteristics match with those required in the synthetic phone sequence
(Vepa and King, 2004). In other words, target cost is a mismatch between desired
and candidate's acoustic characteristics (Cepko et al., 2008). These characteristics are
denoted by features as which prediction from input text is possible, like duration,
intensity and intonation curve. The target cost function captures degradation of
naturalness caused by the difference between a target and a selected unit in a
mismatch of the phonetic environment, log F0 (fundamental frequency), phone
duration, and MFCC (mel-frequency cepstral coefficients) (Nishizawa and Kawai,
2006). The target cost is usually calculated as the weighted sum of the differences
between prosodic and phonetic parameters of target and candidate units (Vepa and
King, 2004).
The target cost can be further divided into two types which are phonetic
target costs and prosodic target costs. The phonetic target cost (Zhao et al., 2006)
contains sub-costs for the Left Phone Context and the Right Phone Context. The
prosodic target costs (Zhao et al., 2006) contain the sub-costs for Position in Phrase,
Position in Word, Position in Syllable, Accent Level in Word and Emphasis Level in
Phrase, etc.
2.6
Concatenation cost
Features that are included in the concatenation cost calculation may be certain
spectral type coefficients that parameterize borders of the speech units in the corpus.
The concatenation cost is the distortion between these parameters of two adjacent
candidate units (Zhao et al., 2006). Concatenation cost is used to measure the
acoustic smoothness between the concatenated units (Dong and Li, 2008). According
to Nishizawa and Kawai (2006) the concatenation cost function captures the
27
degradation of naturalness caused by discontinuity at the unit boundary in F0 and
MFCC.
According to Janicki et al (2008), the concatenation cost is a measure of how
well we loose on concatenating acoustic units. This is based also on linguistic
information, because its main component is a cost related to the change of a
phoneme’s context. This information can be obtained directly from the analysis of
the input and corpus texts and is computed as follows (Janicki et al., 2008):
• it is assigned with null cost if the left neighbor in the corpus is exactly the same unit
as the left neighbor in the synthesized phrase,
• the cost is assigned a certain value if the left neighbor in the corpus is the same
phoneme as the left neighbor in the synthesized phrase.
• higher cost is applied if the left corpus neighbor belongs only to the same phonetic
category as the left neighbor in the phrase.
• even higher cost is applied if only the voicing information agrees.
• highest cost is set in all other cases.
Component of the concatenation cost included measure of F0 difference,
which is calculated only for units neighboring with voiced parts (Janicki et al.,
2008). It is believed that unit selection with estimating concatenation cost basing on
the phoneme’s context change, together with F0 difference, should bring similar
results as methods using spectral distance measures (Wouters and Macon, 1998) but
at much lower computational cost.
The ideal concatenation cost should correlates highly with human listeners’
perceptions of discontinuity at concatenation points. That means the concatenation
cost should predict the degree of perceived discontinuity although the computation of
concatenation cost based only on measurable properties of the candidate units such
as spectral parameters, amplitude, and F0 (Vepa and King, 2004).
The concatenation cost function contains a component that measures
differences in the spectral properties of the speech either side of a proposed join
between two candidate units (Vepa and King, 2004). Blouin et al. (2002) conducted
28
a concatenation cost function based on phonetic and prosodic features. The function
presented is defined as a weighted sum of sub-costs, each of which is a function of
various symbolic and prosodic parameters. The multiple linear regression were used
to optimize weights as a function of an acoustic measure of concatenation quality.
Perceptual evaluation results showed that the concatenation sub-cost weights
determined automatically were better than hand-tuned weights, with or without
applying smoothing after unit concatenation (Vepa and King, 2004).
The acoustic measures and phonetic features were compared by Kawai and
Tsuzaki (2002) in term of their ability to predict audible discontinuities. The MFCCs
were employed to derive acoustic measures and the distance between certain frames
were measures using Euclidean distances. Phonetic features were found to be more
efficient than acoustic measures in term of predicting the audible discontinuities
(Vepa and King, 2004). Figure 2.5 shows the unit selection that based on cost
minimization of target cost and concatenation cost. Figure 2.6 shows the search
graph for unit selection for global optimum and local minimum.
Figure 2.5 Example of unit search algorithm. The shortest path is marked in blue
(Janicki et al., 2008).
29
(a) The search graph for the global optimum.
(b) The search graph for the local minimum where d3 is fixed.
Figure 2.6 Example of unit search algorithm. The difference in cost between the
optimal sequences of two graphs is evaluated for d3 in pre-selection (Nishizawa and
Kawai, 2008).
2.7
Spectral Distances
The concatenation cost function consists of a distance measure that operates
on some parameterization of the final and initial frames of two units to be
concatenated (Vepa and King, 2004), as shown in Figure 2.7. Thus, a wide variety of
distance measures and parameterizations are possible.
30
Figure 2.7 Objective Spectral distances (Vepa and King, 2004).
2.8
Feature Extraction
2.8.1
MFCC
One of the most important issues in the field of speech synthesis is feature
extraction (Khor, 2007). MFCCs are a representation defined as the real cepstrum of
a windowed short-time signal and it is derived from the fast Fourier transform (FFT)
of the speech signal (Vepa and King, 2004). The difference of MFCC from the real
cepstrum is that a nonlinear, perceptually motivated frequency scale is used, which
approximates the behavior of the human auditory system. Examples of
parameterizations included linear prediction (LP) coefficients; LP spectrum; Mel
frequency cepstral coefficients (MFCC); line spectral frequencies (LSF); perceptual
linear prediction (PLP) coefficients; PLP spectrum; multiple centroid analysis
(MCA) coefficients.
31
MFCC have good performance in speech recognition systems. Good
performance in speech synthesis systems meanwhile is likely to be determined more
by the relationship between parameter values and human perception of speech
sounds (Donovan, 2003). A study conducted by Wouters and Macon (1998), to
measure the correlation between a number of likely spectral distance measures based
on different speech parameterizations and human perception of the differences
between speech sounds. Their results indicated that MFCCs worked as well as any
other parameterization they tested, when used in either a Euclidean distance measure
or in a Mahalanobis distance measure computed using parameter variances estimated
from their whole database.
Speech
Preemphasis
Windowing
&
Overlapping
FFT
MelFrequency
Filter bank
Cepstrum
Logged
Energy
MFCC
Delta
Figure 2.8 Block diagram of the conventional MFCC extraction algorithm (Khor,
2007).
MFCC is capable of capturing the phonetically important characteristics of
speech (Wong and Sridharan, 2001). MFCC is coefficients that represent audio,
based on perception. The mel scale is a perceptual scale of pitches judged by
listeners to be equal in distance from one another. A cepstrum is the result of taking
the Fourier Transform (FT) of the decibel spectrum as if it were a signal. Cepstrum
used as an excellent feature vector for representing the human voice and musical
signals. The spectrum is usually first transform using Mel Frequency Band. The
result is called MFCCs. The result for mel scale is given as
32
f ⎞
⎛
m = 2595log10 ⎜1 +
⎟.
⎝ 700 ⎠
The transforming from linear frequency to Mel-frequency can be written as:
f ⎞
⎛
Mel ( f ) = 1127 ln ⎜1 +
⎟
⎝ 700 ⎠
It has been discovered that the perception of a particular frequency by auditory
system is influenced by energy in a critical band around mel frequencies for the final
inverse Fourier Transform in the calculation of cepstral coefficients will produce
MFCC (Figure 2.8).
2.9
Distance Measures
There are two stages involve in the computation of a distance measure which
are feature extraction and quantifying it. The feature is extracted from each pair of
candidate units and then the distance between the feature vectors representing the
units is computed (Kirkpatrick and O'Brien, 2006). The distance measure is needed
to measure the difference between two vectors of such parameters. Some examples
of distance measure are absolute magnitude distance, Euclidean distance,
Mahalanobis distance and Kullback-Leibler (KL) divergence. All the distance
measure listed above are metrics except KL divergence. The symmetrical version of
KL divergence can be used to compute the distance between two speech
parameterizations (Vepa and King, 2004). A psychoacoustic experiment on listeners’
detectability of signal discontinuities in concatenative speech synthesis have been
performed by Stylianou and Syrdal (2001).The results indicated that a symmetrical
Kullback-Leibler (KL) distance between FFT-based power spectra and the Euclidean
distance between MFCC have the highest prediction rates (Stylianou and Syrdal,
2001). According to Klabbers et al (2000), the Kullback-Leibler (KL) distance was
found to be the best predictor of audible discontinuities (Klabbers et al., 2000).
Distance measures have many applications in speech technologies. For
speech coding, they are applied as objective measures of speech quality and also
33
applied in the design of vector quantization algorithms. Therefore, in unit selection
synthesis, an objective distance measure which is able to predict audible
discontinuities is very important (Stylianou and Syrdal, 2001). This is none other
than concatenation cost distance measures which are best able to predict audible
discontinuities. Higher concatenation costs will be assigned to the units that are
predicted to produce audible discontinuities in concatenation, and thus they will be
less likely to be selected.
Recently, the most widely used distance in speech recognition is the
Euclidean distance between MFCCs. From the inspiration of speech recognition
methods, some speech synthesis unit selection algorithms use the Euclidean distance
between MFCCs (Stylianou and Syrdal, 2001).
2.9.1
Simple Distance Measures
2.9.1.1 Absolute Distance
Simple absolute distance is calculated as the sum of the absolute magnitude
difference between individual features of the two feature vectors with the formula as
shown in Equation (2.4)
N
Dabs ( X , Y ) = ∑ X i − Yi
i =1
(2.4)
34
2.9.1.2 Euclidean Distance
The Euclidean distance between two feature vectors, X and Y, is calculated as
shown in Equation (2.5). However, the Euclidean distance does not take into account
of variances or covariances of the distribution of the feature vectors (Vepa and King,
2004).
DEu ( X , Y ) =
2.9.2
n
∑( X
i =1
i
− Yi )
2
(2.5)
Statistically Motivated Distance Measures
Examples of famous distance measure from statistics are Mahalanobis
distance and Kullback-Leibler divergence (Vepa and King, 2004).
2.9.2.1 Mahalanobis Distance
Mahalanobis distance is a generalization of standardized (Euclidean) distance
in that it takes account of the variance or covariance of individual features (Vepa and
King, 2004). Donovan (2001) has used this distance measure in a concatenation cost
function. The Mahalanobis needs the estimation of covariance matrices. The offdiagonal elements of the covariance matrix are assumed to be zero, and this will save
computational costs and also storage requirement. Results showed that making this
diagonal covariance assumption was reasonable since using full covariance matrices
did not improve performance over using diagonal matrices. Equation (2.6) shows the
Mahalanobis distance for two feature vectors, X and Y , with diagonal covariance
matrix
DMa ( X , Y )
2
⎡ X −Y ⎤
= ∑⎢ i i ⎥
σi ⎦
i =1 ⎣
n
2
(2.6)
35
where σ i is standard deviation of the i-th feature of the feature vectors (i.e., the
diagonal entries of the covariance matrix).
2.9.2.2 Kullback–Leibler (KL) distance
The KL distance originates from statistics and is asymmetrical. It is also
known as the divergence, discrimination or relative entropy in information theory. It
is used to quantify differences between two probability distributions or densities. The
KL distance can also be employed to quantify differences in shape of strictly positive
sequences (or functions) of which the sum (or integral) equals one. In concatenative
speech synthesis, it has been used to quantify the differences between spectral
envelopes at concatenation points (Veldhuis, 2002). The spectral envelopes at the
boundary are viewed as the probabilistic distributions as shown in Equation (2.7)
(Zhao et al., 2006). In others words, the distance between the vectors is computed to
quantify the degree of similarity between two feature vectors, P and Q (Kirkpatrick
and O'Brien, 2006).
The original asymmetrical definition of the KL distance is changed into a
symmetrical version as shown in Equation (2.8) with property that
d SKL ( P , Q ) = d SKL ( Q , P ) . The Symmetric Kullback-Leibler distance measures the
dissimilarity between two probability distributions, P (θ ) and Q (θ ) (Kirkpatrick and
O'Brien, 2006). It has the important property that it emphasizes on the differences in
spectral regions with high energy rather than differences in spectral regions with low
energy. Thus, more emphasized fall on spectral peaks rather than valleys between the
peaks and low frequencies are emphasized more than high frequencies (Klabbers et
al., 2000).
1
2π
∫
2π
0
1
2π
P (θ )dθ =
d KL ( P, Q ) =
1
2π
∫
2π
0
∫
2π
0
Q (θ )dθ = 1
P (θ ) log
P (θ )
dθ
Q (θ )
(2.7)
36
d SKL ( P, Q ) =
=
1
( d KL ( P, Q ) + d KL ( Q, P ) )
2
1 ⎧⎪ ⎡ 1
⎨⎢
2 ⎩⎪ ⎣ 2π
⎡ 1
=⎢
⎣ 4π
=
1
4π
∫
2π
0
∫
2π
0
P (θ ) log
P (θ ) log
P (θ ) ⎤ ⎡ 1
dθ ⎥ + ⎢
Q (θ ) ⎦ ⎣ 2π
P (θ )
1
dθ +
Q (θ )
4π
∫
2π
0
∫
2π
0
Q (θ ) log
Q (θ ) log
P (θ ) ⎤ ⎫⎪
dθ ⎥ ⎬
Q (θ ) ⎦ ⎭⎪
P (θ ) ⎤
dθ ⎥
Q (θ ) ⎦
( )
∫ ( P (θ ) − Q (θ ) ) log Q (θ ) dθ
P θ
2π
0
(2.8)
To evaluate the Equation (2.8), the standard procedure is by performing the integral
as a summation over discrete frequencies. The discrete summation approximation
can be written as
N
DSKL ( P, Q ) = ∑ ( Pi − Qi ) log
i =1
Pi
Qi
(2.9)
Equation (2.9) is valid only if the subscript i is a frequency index (Vepa and King,
2004). Thus, this distance measure has not been used for MFCC.
2.10
Heuristic Method
There are two different types of heuristics which are constructive methods
and destructive method (Manuel, 1997). Constructive methods can be categorized as
iterative improvement algorithms. Its starts from an initial configuration, generate a
sequence of configurations until a satisfactory one is found. Destructive method is a
hybrid “strategic oscillation” approach to complement a constructive approach where
it employs alternating constructive and destructive phases of varying amplitudes
(Manuel, 1997). Metaheuristics have experienced remarkable growth over the past
decade. The success of metaheuristics is their flexibility in applications to complex
optimization problems (Vasan and Komaragiri, 2009). Among the most popular
metaheuristics are Simulated Annealing, tabu search and Genetic Algorithms. These
heuristics has a unique search mechanism that allows them to escape local optima
(Vasan and Komaragiri, 2009).
37
The heuristic algorithms do not guarantee an optimal solution can be found
most of the time. However, solutions that are close to optimal which are within a
few percent of the optimum can be found quickly. An ideal global optimization
method should simultaneously meets two requirements which are solve out the
global minimum of multimodal objective function and have high convergence speed
(Wang et al., 1996).
Simple heuristics have been used in some previous speech synthesis systems
reported in the publications to select suitable candidate of units (Qing et al., 2008).
The selection criterion is based on cost degradation from the optimal sequence. A
lattice generated by splitting all candidate units into instances is searched through
with a Viterbi shortest path algorithm (Blouin et al., 2002). In concatenative speech
synthesis systems, each speech unit in the speech database of the system will be
evaluated in order to find out most appropriate unit, which shows the lowest cost
(Hirai et al., 2002).
2.10.1 Simulated Annealing (SA)
The SA algorithm is based on Monte-Carlo methods and it may be considered
as a special form of iterative improvement (Manuel, 1997). It was Kirkpatrick et al
(1983) who first proposed SA, as a method for solving combinatorial optimization
problems. Simulated Annealing was given this name in analogy to the annealing
process in thermodynamics (Gao and Tian, 2007), specifically with the way metal is
heated and then is gradually cooled so that its particles will attain the minimum
energy state and that the optimization process can be carried out by applying the
Metropolis criterion (Triki et al., 2005). This algorithm was controlled by the
parameters of the cooling schedule. The simulated annealing algorithm is a random
search method where it can be employed to solve the mixed discretization problem
and the complicated non-linearity problem (Turgut et al., 2003). Simulated annealing
algorithm generally applicable for solving combinatorial optimization problems by
generating a sequence of moves at decending values of a control parameter (Jeong
and Kim, 1990). The aim of simulated annealing is to choose a good solution to an
38
optimization problem according to some cost function on the state space of possible
solutions (Rose et al., 1990).
Simulated annealing is a generalization of the local search algorithm (Turgut
et al., 2003). In the iterative process for simulated annealing algorithm, it algorithm
allowed accepting non-improving neighbouring solutions (Turgut et al., 2003) to
avoid being trapped at a poor local optimum with a certain probability. Whereas
iterative improvement algorithm would allow only cost-decreasing ones to be
accepted (Turgut et al., 2003). The probability of acceptance with energy E that
characterizes thermal equilibrium and kb denotes the Boltzmann constant is given by
the Boltzmann distribution (Vasan and Komaragiri, 2009):
⎛ E ⎞
P ( E ) = exp ⎜ −
⎟
⎝ kbT ⎠
This equation indicates that a system at a high temperature has almost uniform
probability of being at any energy state, but at a low temperature it has a small
probability of being at a high energy state (Vasan and Komaragiri, 2009).
However, it is proven that eventually simulated annealing produces more
optimal solution than the iterative improvement algorithm (Turgut et al., 2003). The
generation scheme to get a new configuration is called a move and it is crucial to
both the quality of the solution and the speed of the algorithm (Jeong and Kim,
1990). Since the searching strategy of SA can avoid the search process being trapped
in the local optimum solution, so the SA is an effective global optimization algorithm
(Gao and Tian, 2007). The SA algorithm has shown theoretically that a global
optimum of the optimization problem can be reached with probability one, provided
a set of conditions regarding acceptance and generation mechanisms is satisfied
(Manuel, 1997). Besides that, simulated annealing has the ability to scale for large
scale optimization problems and its robustness towards achieving local optima
convergence (Turgut et al., 2003). Due to this advantage, SA-based methods may
have a great potential for obtaining high quality solutions when applied to various
discrete optimization problems. However, simulated annealing algorithm needs lots
of iterative computation although it can obtain a global optimum solution (Manuel,
1997). Thus, SA has slow convergence rate.
39
2.10.2 Approaches to improve SA algorithm
There are two approaches to speed up simulated annealing which are cooling
schedule improvements and generation mechanism design (Manuel, 1997). For
cooling schedule improvements, it has to deal with careful control of the annealing
process. There are a number of problem-independent general annealing processes
that have been reported in the literature (Manuel, 1997). The adaptive cooling
schedules which explain their efficiency has been used in the literature. The
annealing parameters are determined automatically from measures of statistical
quantities related to the particular problem at hand (Manuel, 1997).
For generation mechanism design, smart generation mechanisms based on the
idea of range limiting or changes in cost function are employed to reduce the chances
of generating a next state that is going to be rejected (Manuel, 1997). This kind of
move would lead to a significant reduction in computational time at low
temperatures where the probability of acceptance is very low. The enlargement of the
neighbourhood structure obtained by combining several simple moves to obtain a
complex one is also another method of generation mechanism design (Manuel,
1997). The larger neighbourhood structure allows a faster exploration of the
configuration set and a higher probability to escape from local minima. However,
these approaches are usually problem-dependent and therefore have not been widely
used (Manuel, 1997).
However, although the theoretical basis of the algorithm has been known for
almost two decades, there is still lack of enough practical information about it. Thus,
it is not an easy task for user to design his own algorithm. More importantly, no
theoretical results can give clear statement about which temperature decrement rule
should be used or what kind of neighborhood should be chosen (Triki et al., 2005).
40
2.10.3 Polynomial approximation
As asymptotic convergence requires infinite computing time, and as simple
list of the configuration set has an exponential time complexity, some polynomialtime approximations are used in practice, while preserving as much as possible the
flavour of the convergence theory (Manuel, 1997). The proper procedure (Manuel,
1997) is to choose a suitable cooling schedule, that is, to decide for
i)
the initial condition, i.e. the initial temperature, t0
ii)
the decrement rule (annealing schedule) for the temperature,
iii)
the equilibrium condition, i.e. the length of the Markov chains, and
iv)
the stop condition, i.e. the final temperature.
In doing so, most cooling schedules try to establish and maintain equilibrium
on each temperature level by appropriately adjusting the length of the Markov chains
and the cooling rate. Polynomial-time approximation can at best reach pseudoequilibrium (Manuel, 1997). To solve a discrete optimization problem using
simulated annealing, the following steps needs to be performed: Initially, express the
problem as a cost function optimization problem by defining the configuration set S,
the cost function C, and the neighbourhood structure, N. Next, choose an annealing
schedule. Finally, conducts the annealing process.
A general approach for a simulated annealing procedure (Algorithm 2.2)
described in the following pseudo-code program (Manuel, 1997):
Algorithm 2.2:
1.
2.
3.
4.
Initialize ( s0 , t0 )
k := 0
s := s0
Until Equilibrium reached do:
4.1 Generate s ' from Ns
4.2. Metropolis test:
⎛ C ( s ') − C ( s ) ⎞ ⎪⎫
⎪⎧
4.2.1 If min ⎨1, exp ⎜
⎟ ⎬ > random [ 0,1) , then continue
tk
⎝
⎠ ⎭⎪
⎩⎪
with s := s ' .
4.2.2 else continue with old s .
5.
if stopping criterion valid, stop.
k := k + 1 .
6.
41
7.
8.
Calculate tk .
Goto 4.
Initially, randomly generate an initial solution, s0 . To allow that most of the
proposed transitions pass the Metropolis criterion, the corresponding initial
temperature, t0 has to be set high enough. A free search in the configuration set is
intended at the beginning of the algorithm. The search becomes strict as t decreases.
Thus, fewer proposed transitions are accepted. Finally, local minimum is reach with
very small values of t. At this stage, the annealing process evolves to a final
configuration where no proposed transition is accepted at all. The last optimum
solution found could be interpreted as a solution of the discrete optimization
problem.
2.10.4 Annealing Schedule
There are several theoretical and empirical cooling schedules that could be
categorized into classes such as monotonic schedules, adaptive schedules, geometric
schedules and quadratic cooling schedules (Nader and Saeed, 2004). SA algorithm
usually contains two imbricated loops which are outer loop and inner loop. The outer
loop controls the decrease of the temperature and the inner loop in which the
temperature remains constant and that consists in a Metropolis algorithm (Triki et al.,
2005).
2.10.4.1 Theoretically optimum cooling schedule
The annealing schedule given by Formula (2.10) ensures convergence of the
SA to the optimum solution (Hajek, 1988)
Tk =
C
ln (1 + k )
(2.10)
42
where k = 1, 2,... is the index of the outer loop and C is the depth of the deepest local
minimum. However, this optimum cooling schedule is only of theoretical interest
since the decrease of the temperature is too slow.
2.10.4.2 Geometric cooling schedule
The most frequently used annealing schedule is given by
α ( 0 < α < 1)
Tk +1 = α ⋅ Tk ,
where α denotes the cooling factor (Triki et al., 2005). Usually the value of α is
chosen in the range between 0.5 and 0.99. This cooling schedule provides a baseline
for comparison with more sophisticated schedules since it is very simple.
2.10.4.3 Cooling schedule of Van Laarhoven et al.
The annealing schedule presented in Equation (2.11) that is shown to be
terminates in polynomial time has been proposed by Van Laarhoven and Aarts
(1987)
Tk +1 = Tk ⋅
1
ln (1 + δ )
Tk
1+
3σ (Tk )
(2.11)
where δ is a “small” real number. By maintaining the homogeneous Markov chains
close to each other, it is hopes that a small number of transitions will be sufficient to
reach thermal equilibrium between each temperature decrement. Lundy and Mees
(1986) has described similar annealing schedule given in Equation (2.12)
Tk +1 = Tk ⋅
1
1+
γ
U
Tk
where U is some upper bound on ( f ( x ) − f opt ) and γ a “small” real number.
(2.12)
43
2.10.4.4 Cooling schedule of Otten et al.
The annealing schedule proposed by Otten and van Ginneken (1984) can be
written as
T3
1
⋅ 2k
M k σ (Tk )
(2.13)
Cmax + Tk ⋅ ln (1 + δ )
⋅T
σ 2 (Tk ) ⋅ ln (1 + δ ) k
(2.14)
Tk +1 = Tk −
where M k is given by
Mk =
and Cmax is an estimation of the maximum value of the cost function.
The annealing schedule in Equation (2.13) can be simplified as
Tk +1 = Tk
1
ln (1 + δ )
1 + Tk ⋅
Cmax
after substituting Equation (2.14) into (2.13) which is similar to the annealing
schedule (2.11).
2.10.4.5 Cooling schedule of Huang et al.
The annealing schedule proposed by Huang et al. (1986) is based on the
average cost values of consecutive Markov chains.
From Equation (2.15),
d ⟨ f (T )⟩ σ 2 (T )
(2.15)
=
dT
T2
where ⟨ f (T )⟩ is the expected cost in equilibrium and σ 2 (T ) is the variance in the
cost at equilibrium, authors obtained
d ⟨ f (T )⟩ σ 2 (T )
=
d ln (T )
T
Thus
where
ln (T ) − ln (T − δ T ) T
Δ (T )
σ 2 (T )
Δ (T ) = ⟨ f (T )⟩ − ⟨ f (T − δ T )⟩
44
Finally,
⎛
Δ (T ) ⎞
T − δ T = T exp ⎜⎜ −T 2
⎟
σ (T ) ⎟⎠
⎝
(2.16)
The Δ (T ) in Equation (2.16) is replaced by Huang et al. (1986) with λσ (T ) where
λ ( 0 < λ ≤ 1) is a constant parameter that has to be determined by the user. It is
expected that the quasi-equilibrium will be maintained by setting the difference
Δ (T ) to be less than the standard deviation of the cost. Finally, the annealing
schedule in (2.17) with typical value of λ equal 0.7 is obtained
⎛ λTk ⎞
Tk +1 = Tk exp ⎜⎜ −
⎟⎟
⎝ σ (Tk ) ⎠
(2.17)
This annealing schedule has been widely used and is known to be an efficient general
cooling schedule (Triki et al., 2005).
2.10.4.6 Adaptive cooling schedules
In adaptive cooling schedules, the computation of the next temperature value
is based on the past history of the run. The aims of the adaptive cooling schedules are
to try to achieve the annealing as close to equilibrium as possible and keep the
annealing process as short as possible (Triki et al., 2005). These two goals are
contradictory to each others.
2.10.4.7 A new adaptive cooling schedule
Triki et al. (2005) looking for a cooling schedule that would allow them to
control the difference in average cost between two sequences of trials. This cooling
schedule is based on the observation that the average cost does not vary
proportionally to the temperature. The annealing schedule given by Triki et al.
(2005) is as in Equation (2.18).
45
⎛
Δ (T ) ⎞
Tk +1 = Tk ⋅ ⎜⎜ 1 − Tk 2 k ⎟⎟
σ (Tk ) ⎠
⎝
(2.18)
Δ (T ) can be control by the user where Δ (T ) is defined by Formula (2.19).
Δ (T ) = f (T ) − f (T − δ T )
(2.19)
The annealing schedule in Equation (2.18) has several good properties.
Firstly, this schedule relies only on the parameter Δ (T ) .The information on the
problem difficulty can be obtained during the execution of the SA algorithm. The
theoretical evolution of the expected cost is assign once for all by the choice
of Δ (T ) . To determine whether the current temperature fitted well or not in the
problem, the comparison is made between the practical average expected cost at
temperature T and theoretical expected cost. If these two costs fit well, the problem
is said to be easy at this temperature. Otherwise, the problem is said to be difficult at
this temperature and it is necessary to have new tuning of SA parameters (Triki et al.,
2005).
The practical expected cost and the theoretical expected cost can function as a
guideline for thermal equilibrium. When the difference between the practical
expected cost and the theoretical expected cost becomes significant, that means
thermal equilibrium has not been reached (Triki et al., 2005). Therefore, the possible
action includes increase the number of trials at the current temperature, or to choose
a new smaller Δ (T ) , or to stop SA and start a greedy algorithm (Triki et al., 2005).
46
2.11
Parallel SA
Parallel Simulated Annealing
Synchronous
Serial-Like
Functional Decomposition
Simple Serializable Set
Decision Tree
Asynchronous
Altered Generation
Spatial Decomposition
Shared State-Space
Spatial Decomposition
Shared State-Space
Systolic
Figure 2.9 Parallel Simulated Annealing Taxonomy (Daniel, 1995).
Parallel simulated annealing techniques in taxonomy of three major classes
have been shown in Figure 2.9. The three major classes are serial-like, altered
generation and asynchronous (Daniel, 1995). An algorithm is called synchronous if
adequate synchronization ensures that cost function calculations are the same as
those in a similar sequential algorithm (Daniel, 1995). Serial-like and altered
generation are two major categories of synchronous (Daniel, 1995). Serial-like
convergence algorithms maintain the convergence properties of sequential annealing
(Daniel, 1995). Altered generation algorithms modify state generation but calculate
the same cost function (Daniel, 1995). Asynchronous algorithms eliminate some
synchronization and allow errors to get a better speedup but with the possibility of
causing reduced outcome quality (Daniel, 1995).
Furthermore, parallelizing simulated annealing can be performed by
generating several perturbations to the current solution simultaneously, which
requires synchronization to guarantee correct evaluation of the cost function (Durand
and White, 2000). Many parallel versions of SA have been developed in order to
improve SA performance. One of the approaches of parallel SA is to generate and
evaluate several moves simultaneously. This approach is application independent and
allows the exploitation of a reasonable amount of parallelism (Durand and White,
2000).
47
2.12
Segmented Simulated Annealing
Figure 2.10 shows the procedure for segmented simulated annealing. One of
the weaknesses of SA is the limited coverage of the SA method. In certain cases, this
weakness causes SA to prevent the search from converging to an acceptable solution
if the initial parameters are not near the optimum region (McGookin and MurraySmith, 2006). This problem can be eliminated by the development of the segmented
simulated annealing (SSA) algorithm (McGookin et al., 1996; Atkinson, 1992). The
idea behind SSA is consecutively executed a number of single SA processes. SSA
covers more of the search space than the conventional SA because of single SA
processes starts at a different point in the search space with a wide range of possible
initial values, which segments the search space into smaller regions (McGookin and
Murray-Smith, 2006). The number of final costs available is equal to the number of
consecutively executed single SA processes. This final cost values will be sorted into
ascending order. The best cost which is smallest value is taken to be the optimum
with its corresponding parameters providing an optimal result. Because SSA
provides much wider exploration of the search space particularly in the initial stages
than the conventional SA, therefore SSA is a better search method.
48
Randomly generate initial values
Generate desired number of scaling factors
(one for each run)
Apply Simulated Annealing for
one scaling factor
Store results
Repeat until the completion of
all desired number of runs
Sort runs into ascending cost
order
Take minimum cost run as
optimum
End
Figure 2.10 Segmented simulated annealing (McGookin and Murray-Smith, 2006).
CHAPTER 3
PROPOSED SYSTEM AND IMPLEMENTATION
3.0
Introduction
The main function of this system is to select the best candidates unit to form
the speech utterance based on cost functions. This is done by implement the unit
selection using speech features (MFCC) and applies a distance measure (Euclidean
distance) between pairs of speech features vectors. The heuristic method namely
Simulated Annealing had been employed during the search process. The shortest
distance units are selected and concatenate it. A listening test on the speech utterance
(waveform) is conducted. The conclusions were made based on the result of the
listening test.
50
3.1
System Design Flow
Input Text
Search and selects the candidates units
from the database based on phonetic
context. (Target cost).
Take all the selected units as an input
for unit selection.
Implement SA in unit selection. The
selection of unit was based on cost
minimization. (Concatenation cost).
Concatenate the selected units.
Output sound
Words design and performed listening
test.
Conclusion is made based on listening
test. Performance of SA is evaluated.
Figure 3.1 Block Diagram of System Design Flow.
51
3.2
Malay Phonetics and Phone Sets
The smallest unit of language is phone sets (Farid, 1980). There are a total of
32 phone sets in Malay Language including 6 vowels and 26 consonants (Nik et al.,
1989). Malay language has less vowels than US English because the rules in
pronunciation of Malay language are more direct compared to English (Raminah and
Rahim, 1987).The vowels sound in Malay language such as “a” has its same
pronunciation in wherever it place, no matter is at front, middle or back (Onn, 1993).
In English, the vowels of “a” in English may vary from its difference context or
position.
3.3
Malay Phoneme
Malay phonemes can be classified into two main categories namely Malay
vowels and consonants (Tan, 2003).
3.3.1 Malay Vowels
Vowels are steady state voiced sound (Luis, 1997). There is unique area
function for each of the vowel. The area function is influence by the shape of oral
cavity and the position of the tongue (Rowden, 1992). There are six vowels in Malay
language: three front vowels, one central vowel and two back vowels (Farid, 1980).
There are 6 vowels in Malay language: “a”, “e” (unstressed”e”), “e” (stressed “e”),
“i”, “o” and “u” (Farid, 1980).
3.3.2
Malay Consonant
There are 26 consonants in Malay language (Farid, 1980). The special
consonants in Malay are “kh”, and “sy” (Raminah and Rahim, 1987).
52
3.4
Phoneme Units Database
The most commonly used units in speech synthesis are probably phonemes
because they are the normal linguistic presentation of speech. Using phonemes
provides maximum flexibility with the rule-based systems (Khor, 2007). The
diphone, triphone or variable length of unit can be formed from phoneme since
phoneme is the basic unit of speech (Tan, 2009).
Unit selection required a very large speech database. In this dissertation,
phoneme database is used where it has been build by Center of Biomedical
Engineering, Universiti Teknologi Malaysia. The database was made up by many
sentences and cut all these sentences into smaller unit, phoneme. Since a single
sentence may consist of several same type of phoneme, therefore a database may
consist various sample of same type phoneme. Each phoneme may have up to
thousand of samples. Each phoneme’s waveform was transformed to a set of
coefficients after MFCC feature extraction. In each phoneme’s waveform, the initial
frame and final frame was transformed into 12 coefficients respectively. In other
words, each phoneme’s waveform consists a total of 24 coefficients as shown in
Figure 3.2.
53
Phoneme
Initial frame
Final frame
M
MFCC
MFCC
Coefficients
Coefficients
C1
C2
C3
C4
C5
C6
C1’
C2’
C3’
C4’
C5’
C6’
C7
C8
C9
C10
C11
C12
C7’
C8’
C9’
C10’
C11’
C12’
Figure 3.2 A set of coefficient transform from MFCC algorithm.
Phoneme database used in this dissertation consists of a total of 73 different
phonemes. The total units after extracting the phoneme units from the carrier
sentences are shown in Table 3.1. The two phonemes with highest frequency of
occurrence are “a” and “e” where they contribute almost 18.36 and 8.6 percent
respectively (Tan, 2009). The two phonemes with lowest frequency of occurrence are
“_z” and “iu”.
Instead of keeping the phoneme speech units in its original carrier sentence,
Tan (2009) grouped the same phoneme in their respective folder as shown in Figure
3.3. This will reduce the buffer or memory allocation when the system wants to
access certain phoneme from its source (Tan, 2009). For example, to form a sentence
consisting of 15 phonemes, it may require allocation of 15 times of memory for each
origin sentence before it can be extracted from the origin source. This will consume a
lot of memory and slow down the process of concatenation (Tan and Sheikh, 2008a).
Each phoneme folder consist of the wave files. For example, phoneme’s folder “a”
consists of 107 samples, so there will be 300 wave file inside the folder “a”.
54
….
….
….
……….
Figure 3.3 Speech unit database (Tan and Sheikh, 2008b)
Table 3.1: Total units after extracting the phoneme units from the carrier sentences
(Tan and Sheikh, 2008a)
Pho Total %
_a
107 0.64
_ai
4 0.02
_au
1 0.01
_b
256 1.52
_c
29 0.17
_d
269 1.6
_e
3 0.02
_eh
10 0.06
_f
17 0.1
_g
30 0.18
_h
65 0.39
_i
74 0.44
_ia
12 0.07
_j
49 0.29
_k
248 1.47
_kh
8 0.05
_l
72 0.43
_m
447 2.66
Pho
_n
_ny
_o
_p
_r
_s
_sy
_t
_u
_v
_w
_y
_z
a
ai
au
b
c
Total
33
2
21
320
58
258
5
178
42
5
18
59
1
3076
97
26
253
77
%
0.2
0.01
0.12
1.9
0.34
1.53
0.03
1.06
0.25
0.03
0.11
0.35
0.01
18.3
0.58
0.15
1.5
0.46
Pho Total %
d
313 1.86
e
1448 8.6
eh
124 0.74
f
38 0.23
g
169
1
h
374 2.22
i
970 5.76
ia
87 0.52
io
3 0.02
iu
1 0.01
j
164 0.97
k
665 3.95
kh
7 0.04
l
514 3.05
m
492 2.92
n
1293 7.68
ng
500 2.97
ny
91 0.54
Pho Total %
o
206 1.22
p
276 1.64
q
1 0.01
r
838 4.98
s
410 2.44
sy
8 0.05
t
652 3.87
u
696 4.13
ua
107 0.64
ui
2 0.01
v
10 0.06
w
72 0.43
y
52 0.31
z
13 0.08
55
3.5
Feature Extraction
To compute the spectral distance between two phoneme, e.g., M and A,
feature vectors of the final frame of phoneme M and the initial frame of phoneme A
have to be computed. There exists various distance measure in measuring the
distance between these two feature vectors. In this dissertation, the MFCC as speech
feature or parameterization was implemented and Euclidean distance was applied to
measure pairs of these feature vectors. The formula for Euclidean distance is as
follow:
DEu ( X , Y ) =
n
∑( X
i =1
i
− Yi )
2
Figure 3.4 The GUI to extract MFCCs coefficients.
(3.1)
56
Figure 3.5 The GUI to extract MFCCs coefficients.
In Figure 3.4 and Figure 3.5, the circled red is the phoneme and circled black
is the number of candidates available for that particular phoneme. After click the
button “MFCC”, all the front and end of the MFCCs coefficients of that particular
phoneme are extracted.
Figure 3.6 The 12 coefficients extracted for phoneme “_m”.
57
Figure 3.7 The 12 coefficients extracted for phoneme “a”.
In Figure 3.6 and Figure 3.7, the circled red represents the phoneme, number
of candidates and either front or end of the MFCCs. The circled black represents the
12 coefficients.
M
A
Feature Vector (C’)
Feature Vector (C)
MFCCs
Euclidean
Distance
MFCCs
Figure 3.8 Distance measure and speech feature.
58
Figure 3.8 shows the distance measure and speech feature which are
Euclidean distance and MFCCs. For example, to compute the spectral distance
between phonemes “m” and “a” from a database, feature vectors of final frame “m”
and the feature vectors of initial frame “a” have to be obtained. Once these features
are obtained, distance measure which is Euclidean distance is applied to these pair of
feature vectors.
3.6
Phonetic context
The best candidate of the input unit is the unit that the left and right phonetic
contexts of the input unit and the candidate is exactly matched, therefore “filtering”
process was conducted by using phonetic context to reduce computational time and
efforts. This process will further be discussed in the next chapter.
For example, to form a word “nasi”, the right phonetic context for “n” is “a”,
the left and right phonetic context for “a” is “n” and “s” respectively. Therefore, only
the unit that matched left and right phonetic context has the chance to be selected
while units that do not match left and right phonetic context are eliminated.
Figure 3.9 The candidate unit for phoneme “_n” that matched right phonetic context.
59
Figure 3.10 The candidate unit for phoneme “a” that matched left and right phonetic
context.
In Figure 3.9, the circled red is the right phonetic context for phoneme “_n”
which is “a”. There are a total of 33 candidate units for phoneme “_n” in the
database. After filtering by phonetic context, 22 candidates units that do not matched
the phonetic context were eliminated while 11 candidates units were remained. In
Figure 3.10, the circled black is the left phonetic context for phoneme “a” which is
“_n” and the circled red is the right phonetic context for phoneme “a” which is “s”.
There are a total of 3076 candidate units for phoneme “a” in the database. After
filtering by phonetic context, 3073 candidates units that do not matched the phonetic
context are eliminated while only 3 candidates units were remained. The retained
candidates units are used as an input for Simulated Annealing.
3.7
Unit Selection
Unit selection (Figure 3.11) will read the input from the text file. It will start
to search for the shortest path of the desired word from database. The total number of
sequences is equal to total number of candidates in phoneme i times total number of
candidates in phoneme i+1 until phoneme n in a word where n is total phonemes.
The heuristic method is employed since there was huge number of combination of
sequences to form a word. The heuristic method helps to search for minimum “path”
within reasonable times without having to go through all combination. The heuristic
method (meta-heuristic) mention here refers to Simulated Annealing. After that unit
selection will keep the selected units from the database into a result file. This result is
used to call the waveform from database to concatenate and produce sound.
60
Figure 3.11 Unit selection (Tan and Sheikh, 2008c).
3.8
Concatenation
To generate natural-sounding synthesized speech waveforms, one approach is
to select and concatenate units from a large speech database (Hunt and Black, 1996).
All the selected units will be joined together according to desired sequence of
phonemes to form an utterance of word or sentences. Figure 3.12, Figure 3.13,
Figure 3.14 and Figure 3.15 are the waveform for phonemes “_n”, “a”, “s” and “i”.
These four waveforms are joined together to form the output sound for Malay word,
“nasi” as shown in Figure 3.16.
61
Figure 3.12 Waveform for phoneme “_n”.
Figure 3.13 Waveform for phoneme “a”.
Figure 3.14 Waveform for phoneme “s”.
62
Figure 3.15 Waveform for phoneme “i”.
Figure 3.16 Concatenation of the best matching units for the word “nasi”.
CHAPTER 4
SIMULATED ANNEALING
4.0
Introduction
Annealing, a physical process in statistical mechanics, is often performed in
order to relax the system to the state with the minimum free energy. A crystalline
solid is heated up by increasing the temperature until the solid is melted into liquid,
then the temperature is lowered slowly until it achieves its regular crystal lattice
shape. All particles of the solid arrange themselves randomly at each temperature. A
solid is capable of reaching thermal equilibrium at each temperature provided the
cooling is slow enough. When the system reaches its frozen state, a crystalline solid
with low energy (non-defected) would be formed. However, the solid may become
glass with non-crystalline structure or the defected crystal with meta-stable
amorphous structures if temperature is lowered too fast (Jeon and Kim, 2004).
Simulated annealing (SA) was developed by Kirkpatrick et al., (1983) by
combining statistical mechanics with optimization principles. SA is a local search
algorithm. Simulated annealing (SA) is known to be one of the most efficient
heuristic algorithms particularly well suited for solving combinatorial optimization
problems because it can avoid local minima. Therefore, SA has been recognized as a
powerful tool for solving large number of optimization problems in a variety of
application areas (Koulamas et al, 1994).
64
The stochastic approach is used by SA to direct the search (Hasan et al.,
2000). The operation used in obtaining the neighborhood of a solution is called a
move. The search is allowed to approach a neighboring state even if the move causes
the value of the objective function to become worse. The random design of simulated
annealing method changes with a probabilistic acceptance criterion during the search
(Pantelides and Tzan, 2000). This unique characteristic enables the search process to
avoid being trapped in local minima since the non-improving moves may also be
accepted as current solution. Moves that increase the value of objective function are
accepted with probability. Therefore, establishing simulated annealing as a global
optimization method (Pantelides and Tzan, 2000).
The simulation of the annealing process is performed in the simulated
annealing. The cost function corresponds to the energy function and configuration in
optimization corresponds to state of statistical physics. The aim of SA is to minimize
the cost function. SA consists of control parameter, annealing schedule. Each
temperature can have one or a few iterations. Simulated annealing converges to the
global optimal solutions under certain conditions. These conditions are the ways of
generating neighborhood solutions and on the cooling schedule (Liu, 1999).
Simulated annealing converges to the global optimal solution provided that the
computation time is long enough (Liu, 1999). However, it requires excessive
computation time to obtain a near-optimal solution which is often unrealistic. In the
application of unit selection in Malay Text-to-Speech system, it is essential to get a
good solution within a reasonable time. Instead of allowing the SA to consume much
of the time in searching for minimum solution, it is more practical that more efforts
were put in improving the parameters since performance of SA depends very much
on the selection of parameters.
Since SA has little use of memory, therefore it cannot efficiently perform
intensification and diversification mechanisms (Jeon and Kim, 2004). It is difficult to
simultaneously apply intensification and diversification mechanisms in simulated
annealing due to memory structures and histories because a current solution is only
affected by a previous solution (Metropolis criterion). Diversification mechanism is
mainly performed at high temperature by accepting most of the solutions (random
65
walks). Intensification mechanism is mainly performed as temperature decreases
because of the small-perturbed solutions are accepted.
4.1
Procedure of Simulated Annealing
First, an initial solution needs to be generated. In the beginning, the initial
configurations including initial temperature and annealing schedule need to be
determined. After the initial temperature is chosen, an initial solution is randomly
generated or by applying a simple construction procedure. The temperature is set at a
high level so that almost all moves will be accepted initially. Then, the value of the
cost function will be calculated. This value of the cost function based on the initial
solution is straightaway accepted as current solution. Next, a new solution from the
neighborhood of the current solution will be generated by applying a move. The
temperature is lowered according to annealing schedule during the procedure until
almost no moves will be accepted. The obtained value of the cost function based on
the new solution will be calculated and compared to the current cost function value.
If it is better than the current value, it will be accepted as current solution. Else, the
new value would be accepted as current solution only when the Metropolis's criterion
is met, which is based on Boltzmann’s probability. This process then continues from
the new current solution.
66
Start
Initialize a sequence at initial temperature
Swap the phoneme using the move to
obtain the neighbourhood solution
Evaluate the neighbourhood solution
Increment counter &
decrease the
temperature
according to
annealing schedule.
Accept or reject the neighbourhood
solution based on Metropolis criterion
Is stopping criteria met?
No
Yes
End
Figure 4.1 SA flow diagram to find best speech unit sequence.
67
4.2
Initial Solution
An initial solution can also be considered as a random solution where it is a
starting solution that will be used in the search process (Ghazanfari et al., 2007). In
this dissertation, the initial solution is fixed. The first candidate of related phoneme is
chosen as an initial solution.
For example, to form a Malay word “masin”, it’s required 5 different phonemes
which are “_m”, “a”, “s”, “i” and “n”.
Example:
_m [1] a [1] s [1] i[1] n[1]
The numerical value in bracket “[ ]” represents the candidates numbers for the
corresponding phonemes. The objective function for cost minimization in unit
selection is represented by
⎛ N
⎞
u * = arg min ⎜ ∑ CC ( ui , ui −1 ) ⎟
u∈U
⎝ i =1
⎠
(4.1)
where u = u1 , u2 ,....u N are the units in the inventory U which minimize the
concatenation cost in Equation (4.1). To get the local concatenation cost value, it
requires the parameterization of units and a distance measure. Concatenation cost
between units ui and ui −1 can be written as
Cc ( ui , ui −1 ) =
4.3
12
∑ (u − u )
j =1
i
2
i −1
The cooling schedule
The choice of a cooling schedule has an important effect on performance of
SA algorithm (Ali et al., 2002). For this reason, modifications and improvements
have been tried by tuning the parameters (cooling rate) for better quality or time
tradeoff. The annealing schedule must be specified in any implementation of SA.
The value of temperature parameter, T, varies from relatively large value to a small
value close to zero. These temperature values are controlled by a cooling schedule
that specifies the initial and decreasing temperature values at each stage of the
68
algorithm. The following geometric function has been taken as the temperature
reduction function:
Tk +1 = α Tk
k=0,1,2,3,..
0 <α <1
where Tk is the temperature at stage k, α is the temperature reduction rate. In this
dissertation, the various temperature reduction rates was included which are 0.80,
0.85, 0.90 and 0.95. Figure 4.2 and Figure 4.3 show the temperature reduction
pattern for these temperature reduction rates for Markov chain equal to one and two
respectively.
The initial temperature T0 is set relatively high so that most of the moves are
accepted in the early stages and there is little chance of the algorithm intensifies into
the region of local minimum. The initial temperature and final temperature (Chen
and Su, 2002) in unit selection is set according to the Equation (4.2) and Equation
(4.3) respectively
Ti =
−1
ln Pi
(4.2)
Tf =
−1
ln Pf
(4.3)
where Pi is the desired initial probability and Pf is the desired final probability. The
parameters values in unit selection are given as follows:
−1
= 999.499 ≈ 1000
ln 0.999
-
initial temperature, T0 =
-
final temperature, T f =
-
temperature reduction rate, α = 0.80, 0.85, 0.90, 0.95
−1
= 0.0869 ≈ 0.1
ln 0.00001
69
Temperature
1000
900
800
0.95
700
0.90
600
0.85
500
0.80
400
300
200
100
0
0
20
40
60
80
100
120
140
160
180
Iterations
Figure 4.2 Temperature reduction pattern for various reduction rate with Markov
chain length 1.
Temperature
1000
900
800
0.95
700
0.90
600
500
0.85
400
0.80
300
200
100
0
0
50
100
150
200
250
300
350
400
Iterations
Figure 4.3 Temperature reduction pattern for various reduction rate with Markov
chain length 2.
70
4.3.1
Markov chain
The length of the Markov chain is required to decide how many trials are to
be used at each value of T. There were two Markov chain length used in unit
selection in this dissertation. The first Markov chain used was reduced the
temperature according to annealing schedule in every iterations. The next Markov
chain used was reduces the temperature according to annealing schedule after two
successive iterations.
Table 4.1 Maximum number of iterations for Markov Chain length 1 and 2 to reach
final temperature greater than 0.1.
Temperature reduction rate
Maximum
number of
iterations
to reach T f > 0.1
Markov Chain length=1
Markov Chain length=2
0.80
42
84
0.85
57
114
0.90
88
176
0.95
180
360
4.4
Neighbourhood Generation Mechanism
The effect of neighbourhood structure in SA has been studied by Cheh et al.
(1991). Cheh et al. (1991) found that a small neighbourhood is better than a larger
one for a number of problems. Different neighbourhood sizes on the travelling
salesman problem (TSP) has been tested by Goldstein and Waterman (1988) and
found that the best neighbourhood size is related to the problem size. Yao (1991)
found that a larger neighbourhood is better if the initial solution is far away from the
optimal solution. Yao (1993) showed that the performance of dynamic
neighbourhood size on TSP was significantly better than the standard SA with fixed
neighbourhood.
71
At each move, the generation scheme to get a new configuration is crucial to both the
quality of the solution and the speed of the algorithm. There are four different moves
used in this dissertation as follows:
Move 1: Randomly swap each of the phonemes once at a time.
Move 2: Swap the next phoneme of greatest local cost.
Move 3: Swap the previous phoneme of greatest local cost.
Move 4: Let greatest local cost at i. Swap the next phoneme when the local cost at
i+1>i-1.
1≤ i ≤ n
Swap the previous phoneme when the local cost at i+1<i-1.
Swap the first phoneme when greatest local cost at i = 1.
Swap the last phoneme when greatest local cost at i = n.
4.4.1
Move 1.
Move 1: Randomly swapping each of the phonemes once at a time.
Example:
Initial Solution:
phoneme1[1]
phoneme2[1]
phoneme3[1]
Local Cost(i+1)
Local Cost(i)
The total cost is computed as
phoneme4[1]
Local Cost(i+2)
phoneme5[1]
Local Cost(i+3)
i=n
∑ Local Cost [i ] where n is the total number of Local
i =1
Cost.
Iteration 1:
Apply the move to obtain neighborhood solution.
Neighborhood solution:
phoneme1[1]
phoneme2[1]
Local Cost(i)
phoneme3[7]
Local Cost(i+1)
phoneme4[1]
Local Cost(i+2)
phoneme5[1]
Local Cost(i+3)
72
The phoneme 3 is chosen randomly to swap. The candidate’s number is
changed randomly from 1 to 7. When phoneme 3 is changed, the Local Cost (i+1)
and Local Cost (i+2) will be changed while Local Cost (i) and Local Cost (i+3) will
remained unchanged. If this neighborhood solution is accepted as current solution,
then generate another neighborhood solution based on this newly accepted current
solution.
Iteration 2:
Neighborhood solution:
phoneme1[9]
phoneme2[1]
LocalCost (i)
phoneme3[7]
LocalCost(i+1)
phoneme4[1]
LocalCost(i+2)
phoneme5[1]
LocalCost(i+3)
The phoneme 1 is chosen randomly to swap. The candidate’s number is
changed randomly from 1 to 9. When phoneme 1 is changed, only the Local Cost (i)
will be changed while other Local Cost will remained unchanged. If this
neighborhood solution is accepted as current solution, then generate another
neighborhood solution based on this newly accepted current solution.
Iteration 3:
Neighborhood solution:
phoneme1[9]
phoneme2[1]
LocalCost (i)
phoneme3[7]
LocalCost(i+1)
phoneme4[1]
LocalCost(i+2)
phoneme5[14]
LocalCost(i+3)
The phoneme 5 is chosen randomly to swap. The candidate’s number is
changed randomly from 1 to 14. When phoneme 5 is changed, only the Local Cost
(i+3) will be changed while other Local Cost will remained unchanged. If this
neighborhood solution is accepted as current solution, then generate another
neighborhood solution based on this newly accepted current solution.
73
Iteration 4:
Neighborhood solution:
phoneme1[9]
phoneme2[26]
LocalCost (i)
phoneme3[7]
LocalCost(i+1)
phoneme4[1]
LocalCost(i+2)
phoneme5[14]
LocalCost(i+3)
The phoneme 2 is chosen randomly to swap. The candidate’s number is
changed randomly from 1 to 26. When phoneme 2 is changed, the Local Cost (i) and
Local Cost (i+1) will be changed while Local Cost (i+2) and Local Cost (i+3) will
remained unchanged. If this neighborhood solution is accepted as current solution,
then generate another neighborhood solution based on this newly accepted current
solution.
Iteration 5:
Neighborhood solution:
phoneme1[9]
phoneme2[26]
Local Cost(i)
phoneme3[7]
Local Cost(i+1)
phoneme4[8]
Local Cost(i+2)
phoneme5[14]
Local Cost(i+3)
The phoneme 4 is chosen randomly to swap. The candidate’s number is
changed randomly from 1 to 8. When phoneme 4 is changed, the Local Cost (i+2)
and Local Cost (i+3) will be changed while Local Cost (i) and Local Cost (i+1) will
remain unchanged. If this neighborhood solution is rejected as current solution, then
go back to the previous iterations and generate another neighborhood solution based
on the solution of previous iterations (current solution).
4.4.2
Move 2
Move 2: Swap the next phoneme of greatest local cost.
Since the task assigned to Simulated Annealing is minimization, the greatest
Local Cost should have the higher priority to be swap. Therefore, in this case, the
74
values of each local cost were used as a guidance to decide which phoneme that
needs to be swap. The move used is swap the next phoneme of greatest local cost.
Example:
Initial Solution:
phoneme1[1]
phoneme2[1]
Local Cost(i)
phoneme3[1]
Local Cost(i+1)
The total cost is computed as
phoneme4[1]
Local Cost(i+2)
phoneme5[1]
Local Cost(i+3)
i=n
∑ Local Cost [i ] where n is the total number of
i =0
Local Cost.
Iteration 1:
Apply the move to obtain neighborhood solution.
For Local Cost (i) > Local Cost (i+1) > Local Cost (i+2) > Local Cost (i+3), since
the greatest distance is from phoneme 1 to phoneme 2, we can either swap next
phoneme of the greatest local cost or previous phoneme of the greatest local cost or
both phoneme so that the distance between them will be changed. In this case, the
move used is swap the next phoneme that has greatest local cost. The phoneme 2 is
chosen to swap since Local Cost (i) has greatest magnitude. The candidate’s number
is changed randomly from 1 to 9. When phoneme 2 is changed, the Local Cost (i)
and Local Cost (i+1) will be changed while Local Cost (i+2) and Local Cost (i+3)
will remain unchanged. If this neighborhood solution is accepted as current solution,
then generate another neighborhood solution based on this newly accepted current
solution.
Neighborhood solution:
phoneme1[1]
phoneme2[9]
Local Cost(i)
phoneme3[1]
Local Cost(i+1)
phoneme4[1]
phoneme5[1]
Local Cost(i+2) Local Cost(i+3)
75
Iteration 2:
For Local Cost (i+2) > Local Cost (i) > Local Cost (i+1) > Local Cost (i+3),
since the next phoneme for Local Cost (i+2) is phoneme 4, therefore phoneme 4
needs to be swap. The candidate’s number is changed randomly from 1 to 21. When
phoneme 4 is changed, Local Cost (i+2) and Local Cost (i+3) will be changed while
Local Cost (i) and Local Cost (i+1) remain unchanged. If this neighborhood solution
is accepted as current solution, then generate another neighborhood solution based on
this newly accepted current solution.
Neighborhood solution:
phoneme1[1]
phoneme2[9]
Local Cost(i)
phoneme3[1]
Local Cost(i+1)
phoneme4[21]
Local Cost(i+2)
phoneme5[1]
Local Cost(i+3)
Iteration 3:
For Local Cost (i+1) > Local Cost (i) > Local Cost (i+2) > Local Cost (i+3),
since the next phoneme for Local Cost (i+1) is phoneme 3, therefore phoneme 3
needs to be swap. The candidate’s number is changed randomly from 1 to 42. When
phoneme 3 is changed, Local Cost (i+1) and Local Cost (i+2) will be changed while
Local Cost (i) and Local Cost (i+3) remain unchanged. If this neighborhood solution
is accepted as current solution, then generate another neighborhood solution based on
this newly accepted current solution.
Neighborhood solution:
phoneme1[1]
phoneme2[9]
Local Cost(i)
phoneme3[42]
phoneme4[21]
Local Cost(i+1) Local Cost(i+2)
phoneme5[1]
Local Cost(i+3)
Iteration 4:
For Local Cost (i+3) > Local Cost (i) > Local Cost (i+2) > Local Cost (i+1),
since the next phoneme for Local Cost (i+3) is phoneme 5, therefore phoneme 5
needs to be swap. The candidate’s number is changed randomly from 1 to 7. When
phoneme 5 is changed, only Local Cost (i+3) will be changed while others remain
unchanged. If this neighborhood solution is accepted as current solution, then
76
generate another neighborhood solution based on this newly accepted current
solution.
Neighborhood solution:
phoneme1[1]
phoneme2[9]
Local Cost(i)
4.4.3
phoneme3[42]
Local Cost(i+1)
phoneme4[21]
Local Cost(i+2)
phoneme5[7]
Local Cost(i+3)
Move 3
Move 3: Swap the previous phoneme of greatest local cost.
In this case, the concept of choosing which phoneme to swap is similar with
move 2 but the different is, this time the previous phoneme of the greatest Local Cost
is chosen to swap.
Example:
Initial Solution:
phoneme1[1]
phoneme2[1]
Local Cost(i)
phoneme3[1]
phoneme4[1]
Local Cost(i+1) Local Cost(i+2)
The total cost is computed as
phoneme5[1]
Local Cost(i+3)
i=n
∑ Local Cost [i ] where n is the total number of
i =0
Local Cost.
Iteration 1:
Apply the move to obtain neighborhood solution.
For Local Cost (i) > Local Cost (i+1) > Local Cost (i+2) > Local Cost (i+3), since
the previous phoneme for Local Cost (i) is phoneme 1, therefore phoneme 1 needs to
be swap. The candidate’s number is changed randomly from 1 to 8. When phoneme
1 is changed, only Local Cost (i) will be changed while other remain unchanged. If
77
this neighborhood solution is accepted as current solution, then generate another
neighborhood solution based on this newly accepted current solution.
Neighborhood solution:
phoneme1[8]
phoneme2[1]
Local Cost(i)
phoneme3[1]
Local Cost(i+1)
phoneme4[1]
Local Cost(i+2)
phoneme5[1]
Local Cost(i+3)
Iteration 2:
For Local Cost (i+2) > Local Cost (i) > Local Cost (i+1) > Local Cost (i+3),
since the previous phoneme for Local Cost (i+2) is phoneme 3, therefore phoneme 3
needs to be swap. The candidate’s number is changed randomly from 1 to 7. When
phoneme 3 is changed, Local Cost (i+1) and Local Cost (i+2) will be changed while
Local Cost (i) and Local Cost (i+3) remain unchanged. If this neighborhood solution
is accepted as current solution, then generate another neighborhood solution based on
this newly accepted current solution.
Neighborhood solution:
phoneme1[8]
phoneme2[1]
Local Cost(i)
phoneme3[7]
Local Cost(i+1)
phoneme4[1]
Local Cost(i+2)
phoneme5[1]
Local Cost(i+3)
Iteration 3:
For Local Cost (i+1) > Local Cost (i) > Local Cost (i+2) > Local Cost (i+3),
since the previous phoneme for Local Cost (i+1) is phoneme 2, therefore phoneme 2
needs to be swap. The candidate’s number is changed randomly from 1 to 3. When
phoneme 2 is changed, Local Cost (i) and Local Cost (i+1) will be changed while
Local Cost (i+2) and Local Cost (i+3) remain unchanged. If this neighborhood
solution is accepted as current solution, then generate another neighborhood solution
based on this newly accepted current solution.
78
Neighborhood solution:
phoneme1[8]
phoneme2[3]
Local Cost(i)
4.4.4
phoneme3[7]
Local Cost(i+1)
phoneme4[1]
Local Cost(i+2)
phoneme5[1]
Local Cost(i+3)
Move 4
Move 4: Let greatest local cost at i. Swap the next phoneme when the local cost at
i+1>i-1.
Swap the previous phoneme when the local cost at i+1<i-1. 1 ≤ i ≤ n
Swap the first phoneme when greatest local cost at i = 1.
Swap the last phoneme when greatest local cost at i = n.
The move 2 and move 3 were combined to form move 4. In move 4, whether
to swap previous phoneme or next phoneme is determined by the magnitude of
previous and next Local Cost of the greatest Local Cost. The purpose of using this
move is to speed up the algorithm towards minimum solution. When a phoneme
other than first and last phoneme is swapped, two Local Costs will also be changed.
The decision whether to swap previous phoneme or next phoneme will determine
which two Local Costs that will be changed. The two Local Costs that yields the
higher magnitude is selected to change using move 4.
Example:
Initial Solution:
phoneme1[1]
phoneme2[1]
Local Cost(i)
phoneme3[1]
Local Cost(i+1)
The total cost is computed as
i=n
Local Cost(i+2)
phoneme5[1]
Local Cost(i+3)
∑ Local Cost [i ] where n is the total number of
i =0
Local Cost.
phoneme4[1]
79
Iteration 1:
Apply the move to obtain neighborhood solution.
For Local Cost (i+1) > Local Cost (i) > Local Cost (i+2) > Local Cost (i+3), since
Local Cost (i+1) has greatest magnitude, whether to swap phoneme 2 or phoneme 3
is determined by magnitude of Local Cost (i) and Local Cost (i+2). Since Local Cost
(i) > Local Cost (i+2), therefore Local Cost (i) should be assigned with higher
probability to changed if compared to Local Cost (i+2).
Since Local Cost (i) > Local Cost (i+2), therefore phoneme 2 (previous
phoneme) is swapped. The Local Cost (i) and Local Cost (i+1) are changed when
phoneme 2 is swapped. The candidate’s number for phoneme 2 is changed randomly
from 1 to 5.
Neighborhood solution:
phoneme1[1]
phoneme2[5]
Local Cost(i)
phoneme3[1]
Local Cost(i+1)
phoneme4[1]
Local Cost(i+2)
phoneme5[1]
Local Cost(i+3)
Iteration 2:
For Local Cost (i+2) > Local Cost (i+3) > Local Cost (i) > Local Cost (i+1),
since Local Cost (i+3) > Local Cost (i+1), therefore phoneme 4 is swapped. The
Local Cost (i+2) and Local Cost (i+3) are changed when phoneme 4 is swapped. The
candidate’s number for phoneme 4 is changed randomly from 1 to 13.
Neighborhood solution:
phoneme1[1]
phoneme2[5]
Local Cost(i)
phoneme3[1]
Local Cost(i+1)
phoneme4[13]
phoneme5[1]
Local Cost(i+2) Local Cost(i+3)
80
Iteration 3:
For Local Cost (i) > Local Cost (i+1) > Local Cost (i+2) > Local Cost (i+3),
phoneme 1 is swapped. The Local Cost (i) is changed when phoneme 1 is swapped.
The candidate’s number for phoneme 1 is changed randomly from 1 to 11.
Neighborhood solution:
phoneme1[11]
phoneme2[5]
Local Cost(i)
phoneme3[1]
Local Cost(i+1)
phoneme4[13]
Local Cost(i+2)
phoneme5[1]
Local Cost(i+3)
Iteration 4:
For Local Cost (i+3) > Local Cost (i+1) > Local Cost (i+2) > Local Cost (i),
phoneme 5 is swapped. The Local Cost (i+3) is changed when phoneme 3 is
swapped. The candidate’s number for phoneme 3 is changed randomly from 1 to 32.
Neighborhood solution:
phoneme1[11]
phoneme2[5]
Local Cost(i)
4.5
phoneme3[1]
Local Cost(i+1)
phoneme4[13]
Local Cost(i+2)
phoneme5[32]
Local Cost(i+3)
Metropolis's criterion
According to Metropolis's criterion (Figure 4.4), a random number λ in
[0, 1] is generated from a uniform distribution when the difference between the cost
function values of the newly generated solutions (ΔE) is equal to or greater than zero.
The newly generated solution is accepted as the current solution if Equation (4.4) is
met.
⎛ ΔE ⎞
P (δ ) = exp ⎜ −
⎟
⎝ T ⎠
(4.4)
81
METROPOLIS CRITERION
Start
Is Cnew < C prev ?
Yes
No
Yes
Is prob > λ ?
⎛ ( C prev − Cnew ) ⎞
⎟
prob= exp ⎜
⎜
⎟
T
⎝
⎠
( λ is a random number
from 0 to 1)
No
Accept new
solution as
current solution
Reject new solution
as current solution
End
Figure 4.4 Metropolis criterion.
82
4.6
Stopping criteria
In this dissertation, the stopping criterion is based on three conditions:
-
The maximum number of iteration is used as a stopping criterion. When the
total number of iterations, N > ε ( ε is user determine value).
-
If the neighbor solution was not improved after β iterations, ( β is user
determine value) terminate the algorithm.
-
If the newly generated temperature, Tk is less than final temperature, ω , that
is, Tk < ω , ( ω is user determine value), terminate the algorithm.
The user determine value ε , β and ω may influence the performance of SA when
these values is changed.
4.7
Unit Selection
Unit selection will read the input text and start to search for the most suitable
unit from database. The process of unit selection involved two stages. The first stage
is filtering the candidate units with phonetic context and the next stage is selection of
unit sequences which results in smallest sum of local cost. The target word utterance
is formed by concatenate the waveform of phonemes according to units sequence
that hold the smallest sum of local cost.
4.7.1 Phonetic context
According to Fek et al.,(2006), the target costs is assigned with null cost if
the phonetic context are fully matched. The phonetic context is used as a target cost
in this dissertation. The computational of this target cost is not included in this
dissertation but the phonetic context is used as a “filtering tool” to narrow down the
search space. Since the best candidate of the input unit is the unit that matched fully
83
in term of phonetic context, therefore “filtering” process was conducted by using
phonetic context to reduce computational time and efforts.
During the “filtering” process, only the phonetic contexts that are fully
matched were retained. The fully matched phonetic context means the left and right
phonetic contexts of the input unit and the candidate is exactly matched. For
example, to form a word “saya”, the left phonetic context for “a” is “s” and right
phonetic context for “a” is “y”. If this combination of left and right phonetic context
for “a” can be found in database, this is called fully matched phonetic context. If the
left phonetic context for “a” is other than “s” and right phonetic context for “s” is
“a”, this is called partially matched phonetic context since it matched only the right
phonetic context. If the right phonetic context for “a” is “s” and right phonetic
context for “a” is other than “y”, this is also called partially matched phonetic context
since it matched only the left phonetic context. If the left phonetic context for “a” is
other than “s” and right phonetic context for “s” is other than “a”, this is called not
matched phonetic context.
In this dissertation, the phonetic contexts that are fully matched after the
“filtering” process were used as an input for Simulated Annealing. The search region
for Simulated Annealing is significantly reduced after the “filtering” process. Based
on the finding of Fek et al.,(2006), since the best candidate unit will be the one that
matched fully the phonetic context, therefore, it is appropriate to conduct “filtering”
process before performing Simulated Annealing in this dissertation so that the SA
algorithm will exclude the unit that partially matched or totally does not matched the
phonetic context.
For example, to form a Malay word “kampung”, it required 6 different
phonemes which are “_k”, “a”, “m”, “p”, “u” and “ng”. The total number of
candidates for each phonemes are 248, 3076, 492, 276, 696 and 500 respectively as
shown in Figure 4.5. Therefore, before the phonetic context “filtering” process, the
total numbers of possible sequences are
248 × 3076 × 492 × 276 × 696 × 500 = 2.339 × 1010 . Figure 4.6 shows the feasible search
region after filter using partially matched phonetic context.
84
After the phonetic context “filtering” process using fully match phonetic context, the
total numbers of possible sequences are 40 × 2 × 22 ×17 × 3 × 36 = 3231360 ≈ 3.23 ×106 .
The percentage of reduction in term of total number of candidates for phoneme “_k”,
“a’, “m”, “p”, “u”, and “ng” are 83.87%, 99.93%, 95.53%, 93.84%, 99.57% and
92.80% respectively. The percentage of reduction in term of total numbers of
possible sequences after phonetic context “filtering” process is almost 100% .
Figure 4.5 The feasible search region to form a Malay word “kampung” before filter
using phonetic context.
85
Figure 4.6 The feasible search region to form a Malay word “kampung” after filter
using partially matched phonetic context (left phonetic context).
Figure 4.7 The feasible search region to form a Malay word “kampung” after filter
using fully matched phonetic context (left and right phonetic context).
86
After the “filtering” process, the total numbers of candidates for each
individual phoneme is significantly reduced. These new total numbers of candidates
for each individual phoneme will becomes the feasible search region for Simulated
Annealing (Figure 4.7). Before performing the SA algorithm, the total numbers of
candidates for each individual phoneme need to be determined. This is to ensure the
algorithm does not step in the region of infeasible.
Table 4.2 The information of the 10 words before filter using phonetic context.
Words
Phoneme involved
Total number of candidates
for each phoneme
according to target
sequences
Possible
number of
sequences
Category 1
(4 ≤ x ≤ 6)
nasi
x = number of phonemes
_n, a, s, i
33, 3076, 410, 970
4.037 × 1010
musim
_m, u, s, i, m
447, 696, 410, 970, 492
6.087 × 1013
janji
_j, a, n, j, i
49, 3076, 1293, 164, 970
3.10 × 1013
kampung
_k, a, m, p, u, ng
248, 3076, 492, 276, 696,
500
3.605 × 1016
Category 2
(7 ≤ x ≤ 9)
vitamin
x = number of phonemes
_v, i, t, a, m, n
5, 970, 652, 3076, 492,
970, 1293
6.002 × 1018
demikian
_d, e, m, i, k, ia, n
269, 1448, 492, 970, 665,
87, 1293
1.391× 1019
muktamad
_m, u, k, t, a, m, d
447, 696, 665, 652, 3076,
492, 3076, 313
1.9655 × 1023
informasi
_i, n, f, o, r, m, a, s, i
74, 1293, 38, 206, 838,
492, 3076, 410, 970
3.778 × 10 23
Category 3
( x ≥ 10 )
selanjutnya
x = number of phonemes
1.591× 10 28
berpengetahuan
_b, e, r, p, ng, t, a, h, ua, n
258, 1448, 514, 3076,
1293, 164, 696, 652, 91,
3076
256, 1448, 838, 276, 1448,
500, 1448, 652, 3076, 374,
107, 1293
_s, e, l, a, n, j, u, t, ny
9.327 × 1033
87
Table 4.3 The information of the 10 words after filter using partially matched
phonetic context (left phonetic context).
Words
Phoneme involved
Total number of candidates
for each phoneme
according to target
sequences
Category 1
x = number of phonemes
(4 ≤ x ≤ 6)
nasi
_n, a, s, i
11, 11, 129, 74
1155066
musim
_m, u, s, i, m
16, 16, 59, 74, 41
45825536
janji
_j, a, n, j, i
20, 20, 755, 50, 13
196300000
kampung
_k, a, m, p, u, ng
40, 40, 137, 76, 39, 36
2.339 × 1010
Category 2
(7 ≤ x ≤ 9)
vitamin
x = number of phonemes
_v, i, t, a, m, n
3, 3, 65, 199, 137, 38, 74
4.4848 × 1010
demikian
_d, e, m, i, k, ia, n
24, 24, 190, 38, 87, 3, 29
3.1477 × 1010
muktamad
_m, u, k, t, a, m, d
16, 16, 87, 20, 199, 137,
130, 89
1.4051× 1014
informasi
_i, n, f, o, r, m, a, s, i
19, 19, 3, 4, 32, 19, 130,
129, 74
3.2686 × 1012
Category 3
( x ≥ 10 )
selanjutnya
x = number of phonemes
_s, e, l, a, n, j, u, t, ny
138, 138, 111, 227, 755,
50, 29, 61, 1, 64
2.0508 × 1018
berpengetahuan
_b, e, r, p, ng, t, a, h, ua, n
141, 141, 410, 12, 30, 138,
24, 36, 199, 200, 1, 28
3.899 × 1020
Possible
number of
sequences
88
Table 4.4 The information of the 10 words after filter using fully matched phonetic
context (left and right phonetic context).
Words
Phoneme involved
Total number of candidates Possible
for each phoneme
number of
according to target
sequences
sequences
Category 1
x = number of phonemes
(4 ≤ x ≤ 6)
nasi
_n, a, s, i
11, 3, 25, 74
61050
musim
_m, u, s, i, m
16, 6, 7, 2, 41
55104
janji
_j, a, n, j, i
20, 2, 12, 2, 13
12480
kampung
_k, a, m, p, u, ng
40, 2, 22, 17, 3, 36
3231360
Category 2
(7 ≤ x ≤ 9)
vitamin
x = number of phonemes
_v, i, t, a, m, n
3, 1, 26, 10, 10, 11, 74
6349200
demikian
_d, e, m, i, k, a, n
24, 2, 14, 5, 2, 3, 29
584640
muktamad
_m, u, k, t, a, m, d
16, 1, 7, 3, 10, 40, 4, 89
47846400
informasi
_i, n, f, o, r, m, a, s, i
19, 2, 1, 2, 5, 6, 11, 25, 74
46398000
Category 3
( x ≥ 10 )
selanjutnya
x = number of phonemes
_s, e, l, a, n, j, u, t, ny
138, 13, 58, 44, 12, 10, 5,
1, 1, 64
1.7581× 1011
berpengetahuan
_b, e, r, p, ng, t, a, h, ua, n
141, 109, 11, 3, 3, 24, 3,
16, 20, 1, 1, 28
9.8157 × 1011
4.7.2
Concatenation Cost
The performance of each move used in this dissertation was tested on the 10
different words. The same conditions are set for all the moves. These conditions are
length of Markov chain is 1 and temperature reduction rate α = 0.90 , while initial
solution is fixed. These conditions also included the common stopping criteria, that
is, final temperature T f ≤ 0.1 , maximum number of iterations is 100 and number of
non-improving move is 50. When the move with best performance is found, the
performance of that move is investigated further by varying the cooling schedule and
length of the Markov Chain. The words were run for 10 times of the same conditions.
89
The mean, variance and standard deviation for the concatenation cost were
calculated. The performance of the moves was evaluated based on sum of the mean
values for the 10 words.
4.7.2.1 Concatenation cost for Move 1
Table 4.5 Information of concatenation cost (Move 1) with temperature reduction
rate, α = 0.90
Words
mean
variance
4.7705
Standard
deviation
2.1841
Initial
solution
51.9894
Best
solution
37.1871
Worst
solution
44.2674
nasi
40.1364
musim
57.1876
1.2936
1.1374
66.3190
55.6639
58.9858
janji
57.4715
1.5463
1.2435
65.8612
55.7146
59.4232
kampung
45.5589
6.0959
2.4690
57.0421
42.3074
51.4532
vitamin
57.5345
9.2171
3.0360
71.3736
54.3926
64.0537
demikian
61.0513
3.4004
1.8440
66.3976
58.841
64.4931
muktamad
56.5940
4.0877
2.0218
69.2485
52.7411
59.9678
informasi
92.8231
7.7344
2.7811
102.8064 90.1583
98.8206
selanjutnya
88.0075
25.3428
5.0342
96.2091
96.0506
berpengetahuan 95.0096
29.3829
5.4206
109.5395 85.2593
Total
651.3744
78.6626
103.3070
90
4.7.2.2 Concatenation cost for Move 2
Table 4.6 Information of concatenation cost (Move 2) with temperature reduction
rate, α = 0.90
Words
mean
variance
1.7800
Standard
deviation
1.3342
Initial
solution
51.9894
Best
solution
38.0371
Worst
solution
42.3188
nasi
38.7232
musim
55.6213
0.5375
0.7332
66.3190
54.8516
56.4404
janji
65.8612
0
0
65.8612
65.8612
65.8612
kampung
51.9043
0
0
57.0421
51.9043
51.9043
vitamin
68.1684
0
0
71.3736
68.1684
68.1684
demikian
62.3619
0
0
66.3976
62.3619
62.3619
muktamad
66.9015
0
0
69.2485
66.9015
66.9015
informasi
93.7267
4.7221
2.1731
102.8064 91.0594
96.6425
selanjutnya
90.7814
0
0
96.2091
90.7814
3.5564
1.8859
109.5395 94.9970
berpengetahuan 98.1096
Total
90.7814
100.4276
692.1595
4.7.2.3 Concatenation cost for Move 3
Table 4.7 Information of concatenation cost (Move 3) with temperature reduction
rate, α = 0.90
Words
mean
variance Standard Initial
Best
Worst
deviation solution solution solution
nasi
51.6553 0
0
51.9894 51.6553 51.6553
musim
57.3512
0.3507
0.5922
66.3190
56.5752
57.8808
janji
58.8172
0
0
65.8612
58.8172
58.8172
kampung
51.3213
0
0
57.0421
51.3213
51.3213
vitamin
65.5459
1.7338
1.3168
71.3736
62.9330
67.1848
demikian
61.9506
2.9976
1.7314
66.3976
59.9471
64.7144
muktamad
59.6220
1.2380
1.1127
69.2485
58.2360
60.9776
informasi
102.7709 0
0
102.8064 102.7709 102.7709
selanjutnya
91.7929
0.8504
0.9222
96.2091
berpengetahuan 95.3239
6.0698
2.4637
109.5395 94.0278
Total
696.1512
91.1065
93.4464
101.2508
91
4.7.2.4 Concatenation cost for Move 4
Table 4.8 Information of concatenation cost (Move 4) with temperature reduction
rate, α = 0.90
Words
mean
variance
4.0708
Standard
deviation
2.0176
Initial
solution
51.9894
Best
solution
39.7428
Worst
solution
44.8005
nasi
41.9025
musim
58.1836
2.3585
1.5357
66.3190
56.0629
59.9161
janji
58.8172
0
0
65.8612
58.8172
58.8172
kampung
50.6723
3.1929
1.7869
57.0421
46.5059
51.9677
vitamin
62.0547
5.2145
2.2835
71.3736
59.5890
64.6396
demikian
65.1283
2.6202
1.6187
66.3976
62.3619
66.3976
muktamad
66.9015
0
0
69.2485
66.9015
66.9015
informasi
94.5730
4.5584
2.1350
102.8064 91.1099
97.3322
selanjutnya
91.3895
0.2920
0.5404
96.2091
92.2531
10.1918
3.1925
109.5395 94.9008
berpengetahuan 98.8900
Total
90.7814
104.8383
688.5126
From Table 4.5 to Table 4.8, the move that yields the smallest sum of the
mean values for the 10 words is move 1. In other words, the performance of move 1
is the best compared to other. The advantage of move 1 is its flexibility in swapping
the phonemes. The move 1 does not take into consideration on values of the local
cost in determining which phoneme that needs to be swap. In other words, there exist
various neighborhood sizes for move 1 in every iteration. This advantage is
significant when there exists one or more local costs with high magnitudes compared
to other. This is due to all phonemes have the equal chance to be swap regardless of
their magnitude of the local cost. Due to this advantage, the performance of move 1
is superb compared to other for the 10 words tested. For the 10 words tested, all the
words besides “berpengetahuan” exists one or more local cost with relatively high
magnitudes of local costs compared to other. Therefore, move 2, move 3 and move 4
are under performed in unit selection. The disadvantage of move 1 is that the best
local cost may not be maintained in the next iteration due to its randomness.
92
Since move 1 has the best performance, various annealing schedule and
length of the Markov chain were tested on move 1 to investigate the performance of
the move under different conditions.
Case 1:
Markov chain length =1.
Stopping criteria:
-
final temperature , T f ≤ 0.1
-
maximum number of iterations = 200
-
number of non-improving move = 0.5(200) =100
Table 4.9 Information of concatenation cost with temperature reduction rate,
α = 0.95
Words
mean
variance
1.5701
Standard
deviation
1.2531
Initial
solution
51.9894
Best
solution
38.1209
Worst
solution
40.5318
nasi
39.777
musim
56.2531
0.3269
0.5718
66.3190
55.4287
57.1868
janji
56.8056
1.7449
1.3209
65.8612
55.3994
59.2640
kampung
43.9745
6.3534
2.5206
57.0421
41.1393
48.4903
vitamin
56.2057
7.3691
2.7146
71.3736
51.1585
60.4980
demikian
58.8773
1.2913
1.1363
66.3976
56.7844
60.6892
muktamad
56.0925
3.0699
1.7521
69.2485
53.7636
58.2923
informasi
92.3077
4.0464
2.0116
102.8064
89.8871
96.8190
selanjutnya
87.7347
9.6309
3.1034
96.2091
80.8717
90.9065
berpengetahuan 95.1154
6.2367
2.4973
109.5395
91.4759
98.8498
Total
643.1435
93
Case 2:
Markov chain length =1.
Stopping criteria:
-
final temperature , T f ≤ 0.1
-
maximum number of iterations = 60
-
number of non-improving move = 0.5(60) =30
Table 4.10 Information of concatenation cost with temperature reduction rate,
α = 0.85
Words
mean
variance
2.1525
Standard
deviation
1.4671
Initial
solution
51.9894
Best
solution
39.2396
Worst
solution
43.5353
nasi
41.1636
musim
57.4694
0.8764
0.9361
66.3190
56.4528
59.2198
janji
58.1928
2.5889
1.6090
65.8612
55.3994
59.8581
kampung
46.8516
5.5348
2.3526
57.0421
43.5470
50.3501
vitamin
59.7214
13.1650
3.6284
71.3736
56.5820
66.2310
demikian
60.5946
5.6822
2.3837
66.3976
57.5656
65.8790
muktamad
55.3585
13.4849
3.6722
69.2485
53.4601
60.0769
informasi
95.8624
10.2939
3.2084
102.8064 90.5174
100.3186
selanjutnya
89.3682
9.5203
3.0855
96.2091
91.7274
18.4947
4.3005
109.5395 87.0187
berpengetahuan 97.6107
Total
662.1932
87.2320
102.6322
94
Case 3:
Markov chain length =1.
Stopping criteria:
-
final temperature , T f ≤ 0.1
-
maximum number of iterations = 50
-
number of non-improving move = 0.5(50) =25
Table 4.11 Information of concatenation cost with temperature reduction rate,
α = 0.80
Words
mean
variance
4.1179
Standard
deviation
2.0293
Initial
solution
51.9894
Best
solution
38.1209
Worst
solution
44.0197
nasi
37.0614
musim
57.5359
0.8788
0.9374
66.3190
56.3752
59.0334
janji
58.1860
4.5681
2.1373
65.8612
56.0573
61.5893
kampung
47.8566
10.5943
3.2549
57.0421
43.8004
52.7271
vitamin
60.5878
17.1983
4.1471
71.3736
55.4193
67.6045
demikian
61.6131
4.8180
2.1950
66.3976
57.6987
64.9662
muktamad
59.4896
23.0488
4.8009
69.2485
52.7343
65.3014
informasi
94.5739
15.6052
3.9503
102.8064 88.3331
99.2731
selanjutnya
89.8916
13.6941
3.7006
96.2091
96.2091
24.3634
4.9359
109.5395 94.1026
berpengetahuan 100.4161
Total
88.2295
106.1951
667.2120
From Table 4.5, Table 4.9, Table 4.10 and Table 4.11, move 1 (Table 4.9) is
best performed under the conditions Markov chain length =1 and temperature
reduction rate, α = 0.95. When the cooling is slow enough, it is capable to reach
thermal equilibrium for each temperature value, avoiding to be trapped in local
minimum. The solution quality obtained for slow temperature reduction
rate, α = 0.95 , is better than the faster temperature reduction rates, α = 0.80, 0.85 and
0.90. When the cooling is too fast, it is not capable to reach thermal equilibrium for
each temperature value, resulting in trapped in local minimum. Although the solution
quality obtained for slow temperature reduction rate is better, but at the same time it
result in slow convergence rate.
95
Case 4:
Markov chain length =2.
Stopping criteria:
-
final temperature , T f ≤ 0.1
-
maximum number of iterations = 400
-
number of non-improving move = 0.5(400) =200
Table 4.12 Information of concatenation cost with temperature reduction rate,
α = 0.95
Words
mean
variance
0.7515
Standard
deviation
0.8669
Initial
solution
51.9894
Best
solution
37.5301
Worst
solution
39.8161
nasi
38.8704
musim
55.8377
0.1841
0.4291
66.3190
55.3601
56.3920
janji
55.8641
3.7639
1.9401
65.8612
53.6312
58.9424
kampung
43.3521
8.0114
2.8304
57.0421
38.0765
46.3593
vitamin
56.0249
1.7922
1.3387
71.3736
53.0617
58.3926
demikian
58.3768
1.9648
1.4017
66.3976
55.9996
60.1783
muktamad
53.4289
3.2450
1.8014
69.2485
50.1260
56.1836
informasi
90.1820
2.2295
1.4932
102.8064
88.5677
91.6438
selanjutnya
88.6255
8.6738
2.9451
96.2091
84.8192
93.8592
berpengetahuan 91.6801
5.8925
2.4274
109.5395
86.6432
94.6761
Total
632.2425
96
Case 5:
Markov chain length =2.
Stopping criteria:
-
final temperature , T f ≤ 0.1
-
maximum number of iterations = 200
-
number of non-improving move = 0.5(200) =100
Table 4.13 Information of concatenation cost with temperature reduction rate,
α = 0.90
Words
mean
variance
2.6193
Standard
deviation
1.6184
Initial
solution
51.9894
Best
solution
37.765
Worst
solution
42.6251
nasi
39.5469
musim
56.0799
0.4742
0.6886
66.3190
55.4613
57.3321
janji
56.3270
3.1464
1.7738
65.8612
53.6312
58.4735
kampung
42.6240
5.1535
2.2701
57.0421
39.1755
47.157
vitamin
56.8142
1.2225
1.1057
71.3736
55.1818
58.8009
demikian
58.6278
1.6375
1.2800
66.3976
56.7614
60.2722
muktamad
53.2059
4.9892
2.2336
69.2485
50.1260
57.1076
informasi
90.2507
2.9655
1.7221
102.8064
87.8247
92.9634
selanjutnya
89.7432
7.9122
2.8129
96.2091
85.8571
94.0317
berpengetahuan 94.0035
2.4743
1.5730
109.5395
91.6833
95.2747
Total
637.2231
97
Case 6:
Markov chain length =2.
Stopping criteria:
-
final temperature , T f ≤ 0.1
-
maximum number of iterations = 150
-
number of non-improving move = 0.5(150) =75
Table 4.14 Information of concatenation cost with temperature reduction rate,
α = 0.85
Words
mean
variance
2.7807
Standard
deviation
1.6675
Initial
solution
51.9894
Best
solution
37.1871
Worst
solution
42.3749
nasi
39.6497
musim
56.7848
0.6914
0.8315
66.3190
55.5627
58.0270
janji
56.6454
4.2224
2.0549
65.8612
53.6312
59.5977
kampung
46.9636
9.0054
3.0009
57.0421
43.4254
51.5412
vitamin
57.7452
9.0776
3.0129
71.3736
52.7207
62.6273
demikian
59.0589
2.1474
1.4654
66.3976
55.7617
60.9374
muktamad
56.8856
3.5898
1.8947
69.2485
54.4198
59.7742
informasi
90.0772
1.6477
1.2836
102.8064
88.4301
92.3604
selanjutnya
89.7157
9.7443
3.1216
96.2091
85.8864
94.0497
berpengetahuan 95.1791
8.5402
2.9224
109.5395
91.0962
99.3100
Total
648.6352
98
Case 7:
Markov chain length =2.
Stopping criteria:
-
final temperature , T f ≤ 0.1
-
maximum number of iterations = 100
-
number of non-improving move = 0.5(100) =50
Table 4.15 Information of concatenation cost with temperature reduction rate,
α = 0.80
Words
mean
variance
0.9839
Standard
deviation
0.9919
Initial
solution
51.9894
Best
solution
38.5778
Worst
solution
41.6367
nasi
39.3888
musim
56.9937
0.6505
0.8065
66.3190
55.7416
58.2279
janji
56.5346
1.9137
1.3834
65.8612
53.6312
58.6272
kampung
45.5048
11.1886
3.3449
57.0421
40.5324
51.5542
vitamin
58.0246
10.0880
3.1762
71.3736
53.4715
62.9814
demikian
61.0635
3.9300
1.9824
66.3976
59.1747
64.9493
muktamad
56.9372
6.8791
2.6228
69.2485
53.8201
62.4109
informasi
93.8204
8.5837
2.9298
102.8064 87.9695
97.4976
selanjutnya
92.7170
11.3913
3.3751
96.2091
96.2091
berpengetahuan 99.4340
21.0200
4.5847
109.5395 92.7515
Total
86.2980
105.3478
660.4186
According to Triki et al., (2005), the probability distribution is closer to the
quasi-equilibrium for longer length of Markov chain. From Table 4.12, Table 4.13,
Table 4.14 and Table 4.15, the length of Markov chain is 2. The solution quality
obtained for Markov chain=2 is better for Markov chain=1 for the entire four
temperature reduction rate. Therefore, move 1 has best performance under longer
Markov chain length which is 2 and slow temperature reduction rate, α = 0.95 as
shown in Table 4.12. The best solution, mean and worst solution from Table 4.12 is
plotted (Figure 4.8).
99
Concatenation cost
100
90
Best solution
mean
80
Worst solution
70
60
50
40
30
1
2
3
4
5
6
7
8
9
10
Problem
Figure 4.8 SA best solutions, mean and worst solutions for ten problems from Table
4.12.
.
100
4.8
Concatenation
The concatenation of waveform to form the target word utterance is based on
the result in Table 4.12 since it results in smallest sum of the mean values for the 10
words. The sequences of the 10 words in Table 4.12 are presented in Table 4.16.
Table 4.16 The sequences of the 10 selected words.
Words
Sequences
nasi
_n[1] a[1084] s[246] i[805]
musim
_m[26] u[528] s[31] i[929] m[478]
janji
_j[6] a[2943] n[1053] j[131] i[644]
kampung
_k[5] a[549] m[407] p[168] u[664] ng[26]
vitamin
_v[1] i[784] t[89] a[2675] m[235] i[691] n[243]
demikian
_d[13] e[425] m[336] i[831] k[491] ia[46] n[956]
muktamad
_m[26] u[292] k[486] t[397] a[83] m[395] a[968] d[243]
informasi
_i[1] n[948] f[22] o[129] r[655] m[440] a[2826] s[305] i[442]
selanjutnya
_s[1] e[537] l[362] a[2710] n[1031] j[7] u[206] t[142] ny[1]
a[2060]
_b[1] e[1426] r[30] p[172] e[1013] ng[347] e[1028] t[595]
a[661] h[221] ua[65] n[245]
berpengetahuan
101
Figure 4.9 Waveform “_s1”
Figure 4.10 Waveform “e537”
Figure 4.11 Waveform “l362”
Figure 4.12 Waveform “a2710”
102
Figure 4.13 Waveform “n1031”
Figure 4.14 Waveform “j7”
Figure 4.15 Waveform “u206”
Figure 4.16 Waveform “t142”
103
Figure 4.17 Waveform “ny1”
Figure 4.18 Waveform “a2060”
Figure 4.19 Concatenation waveform for the word “selanjutnya”.
104
Figure 4.20 Spectrogram for the word “nasi”.
Figure 4.21 Spectrogram for the word “berpengetahuan”.
105
Figure 4.22 Spectrogram for the word “demikian”.
Figure 4.23 Spectrogram for the word “demikian” that do not consider concatenation
cost.
Figure 4.9 to Figure 4,18 is the waveform for phonemes “_s”, “e”, “l”, “a”,
“n”, “j”, “u”, “t”, “ny” and “a” respectively. Figure 4.19 is the waveform for the
word “selanjutnya” after concatenates the waveform from Figure 4.9 to Figure 4.18.
Figure 4.20, Figure 4.21 and Figure 4.22 is the spectrogram obtained for the words
“nasi”, “berpengetahuan” and “demikian”. These Figures are obtained according to
the speech unit sequences as in Table 4.12. Figure 4.23 is the spectrogram for the
word “demikian” that do not consider concatenation cost. In fact, it only considers
the matching of the left and right phonetic context. The circled red part in Figure
106
4.22 and Figure 4.23 are zoomed in and presented in Figure 4.24 and Figure 4.25
respectively.
Figure 4.24 Spectrogram zoom in for the word “demikian” from Figure 4.22.
Figure 4.25 Spectrogram zoom in for the word “demikian” from Figure 4.23.
The spectrogram considers the concatenation cost (Figure 4.24) produce
better joined condition. The red and green lines in Figure 4.24 are straight line which
shows the smoothness of join between two segments during synthesis. The
spectrogram that do not considers the concatenation cost (Figure 4.25) is a curved
line which shows the discontinuities or spectral mismatch between concatenated
units.
CHAPTER 5
TESTING, ANALYSIS AND RESULT
5.1
Experiment
A formal listening test was conducted to evaluate the output sound. The
following sections will describe test materials, test conditions, test procedures,
listeners, and the statistical analysis of result.
5.2
Test Materials
Ten words are selected in the listening test. These ten words are the words
formed by unit selection system after concatenation. These selected words ranged
from 4 to 12 phonemes.
5.3
Test Conditions
The ten words selected from unit selection system were using the following synthesis
conditions:
‐
Units: Phoneme
‐
Feature Extraction: Mel Frequency Cepstral Coefficient
‐
Spectral Distance: Euclidean distance
108
5.4
Test Procedure
The test was carried out in two parts. The first part consists of ten words
selected from the unit selection system. Listeners could take as long as they pleased
over each word and take a short break between each word. The first part of the
listening test is to test for the intelligibility of the synthesis words. Listeners are
requested to play the sound file more than one time for each word. Listeners are
required to write down the answer of the output sound. The second part of listening
test is mean opinion score (MOS), listeners are requested to play and listen to the
output sound and tick the words they think is better in term of naturalness. In this
part, there are a total of two sound files for each word, divided into “a” and “b”
respectively. Sound files “a” are the ten words selected from the unit selection
system. The sound files “b” is the ten similar words selected from the unit selection
system but do not considering concatenation cost. The purpose of this part is to test
the naturalness of the synthesized words.
109
5.5
Profiles of Listeners
There are a total 15 listeners participating in the listening test. They are from
different background, gender, races and state of origin. Table 5.1 shows the profiles
of the listeners.
Table 5.1 Profiles of Listeners
Number of Participants
Percentage
Gender
Male
8
53.33%
Female
7
46.67%
Race
Malay
6
40%
Chinese
8
53.33%
Others
1
6.67%
State of Participants
Johor
5
33.33%
Pulau Pinang
2
13.33%
Perak
2
13.33%
Kedah
1
6.67%
Kuala Lumpur
1
6.67%
Selangor
1
6.67%
Sarawak
2
13.33%
Pahang
1
6.67%
Malay
First
Spoken
Language
Yes
6
40%
No
9
60%
110
5.5.1
Percentages of Listeners by Gender
There are a total of fifteen listeners participated (six Malay, eight Chinese and
one other) in the listening test. The ages of the listeners were range between 22 and
34, with a mean age of 25 years old. All the participants of the listening test were
native speakers of Malay Language with no hearing loss. Figure 5.1 shows the
percentage of listeners by gender.
Figure 5.1 Percentage of listeners by gender.
111
5.5.2
Percentage of Listeners by Race
There were three races of listeners participated in the listening test. The three
races are Malay (40%), Chinese (53.33%) and others (6.67%). Figure 5.2 shows the
percentage of listeners by race.
Figure 5.2 Percentage of listeners by race.
112
5.5.3
Percentage of Listeners by State of Origin
The listeners that come from eight different states were selected in the
listening test. These states are Johor (33.33%), Pulau Pinang (13.33%), Perak
(13.33%), Kedah (6.67%), Kuala Lumpur (6.67%), Selangor (6.67%), Sarawak
(13.33%) and Pahang (6.67%). Figure 5.3 shows the percentage of listeners by state
of origin.
Figure 5.3 Percentage of listeners by state of origin.
113
5.6
Result and Analysis
5.6.1 Word Level Testing
This section is word level testing which test for the intelligibility of the
synthesis words. The listeners are required to write down the answer from what they
heard for all the 10 words as in Table 5.2.
Table 5.2 Words selected for listening test.
Number
Words
No. of phoneme
1
nasi
4
2
musim
4
3
janji
5
4
kampung
6
5
vitamin
7
6
demikian
7
7
muktamad
8
8
informasi
9
9
selanjutnya
10
10
berpengetahuan
12
All the listeners wrote the correct answers for all the selected words except
word 2 which is “musim”. Six participants wrote the wrong answer for word 2. The
six participants are confused about the pronunciation of the first phoneme, “_m”.
Five participants wrote it as “busim” and one participant wrote it as “pusim”.
Therefore, the intelligibility rate of the 10 selected synthesis words is 96%. Figure
5.4 shows the level of intelligibility of the 10 selected words.
114
Figure 5.4 Level of intelligibility of the 10 selected words.
5.6.2
Mean Opinion Score
This section of the listening test is mean opinion score. The objective of this
section is test for the naturalness of the synthesis words between synthesis words
with consider the concatenation cost and without consider the concatenation cost.
Figure 5.5 shows the results of the mean opinion score. From Figure 5.5, the number
of listeners who rate that the synthesis words with considering concatenation cost
sound better is 13 out of 15. There is 1 listener who rate that the quality of this 2
categories of synthesis words is the same while another 1 participant states that the
synthesis words that do not consider the concatenation cost sound better. Therefore,
the naturalness rate of the 10 selected synthesis words is 92.86%. Table 5.3 shows
the score line of synthesis words with considers the concatenation cost.
115
Figure 5.5 Results of the mean opinion score.
Table 5.3 The score line of synthesis words with considers the concatenation cost.
Score
Frequency
9/10
1
8/10
2
7/10
6
6/10
4
5/10
1
3/10
1
116
Table 5.4 The score line of the 10 synthesis words with considers the concatenation
cost.
Words
Score
nasi
7/15
musim
10/15
janji
15/15
kampung
12/15
vitamin
8/15
demikian
15/15
muktamad
4/15
informasi
3/15
selanjutnya
12/15
berpengetahuan
13/15
From Table 5.4, there are two words that have perfect score which are “janji”
and “demikian”. There are 5 words that score more than or equal to 8. These words
are “musim”, “kampung”, “selanjutnya”, “vitamin” and “berpengetahuan”. However,
there are three words that do not score well. These 3 words are “nasi”, “muktamad”
and “informasi”. The reason for the words that do not score well is that there exists
spectral mismatch in the word especially the first and last phoneme. There does not
exist left phonetic context for first phoneme and right phonetic context for last
phoneme. Therefore, the first and last phoneme only matched partially the phonetic
context. Therefore the quality of the synthesized words is degraded due to spectral
mismatch in the first and last phoneme.
CHAPTER 6
CONCLUSION AND RECOMMENDATION
6.1
Conclusion
This dissertation has reviewed related methods and algorithms for unit
selection and a first version of unit selection using Simulated Annealing for Corpusbased Malay Text-to-Speech system has been implemented. The main purpose of this
dissertation is to select the speech unit sequences which result in lowest
concatenation cost. This system has achieved its aim of improving the speech quality
derived from first version of Corpus-based Malay Text-to-Speech system.
The system level of concatenation is phoneme based using variable length
unit selection. The speech units have been selected using Corpus-based unit selection
and carries 381 sentences and 16826 phonemes with total size of 37.6Mb. The
storage format for Corpus-based Malay Text-to-Speech is wave and the sampling
frequency is 16kHz.
In order to produce high quality output speech, the lowest overall cost is a
must due to minimization of contextual differences and spectral discontinuities. The
unit selection is based on two cost functions which are target cost and concatenation
cost. Since there will be relative importance for target cost and concatenation cost in
the whole cost function, thus, tuning the weights is an important stage in the design
of the selection algorithm to reflect their relative importance (Díaz and Banga, 2006).
However, results show that there does not exist a set of weights with consistent
performance across all (or almost all) of the sets (Díaz and Banga, 2006).
118
The target cost can be further divided into two types which are phonetic
target costs and prosodic target costs. The phonetic target cost (Zhao et al., 2006)
contains sub-costs for the Left Phone Context and the Right Phone Context. In the
proposed method, only the phonetic target cost is employed. The contextual
linguistic (target cost) is used as a filtering tool. Only the speech units that match left
and right phonetic context have the possibility to be chosen. For first phoneme, only
the matched of right phonetic context and for last phoneme, only the matched of left
phonetic context have the possibility to be chosen. The retained candidate units are
used as an input for Simulated Annealing. Since the computational of target cost is
not included in the cost function for unit selection, therefore the cost function for unit
selection left only with concatenation cost. Therefore, the advantage for the proposed
method is weight tuning is not required for the cost function in unit selection since
there is only one component in the cost function which is concatenation cost.
The retained candidate units are undergoing for concatenation cost
minimization. Features that are included in the concatenation cost calculation were
MFCC type coefficients that parameterize borders of the speech units in the corpus.
The concatenation cost is the distortion between these parameters of two adjacent
candidate units (Zhao et al., 2006). Two stages involve in the computation of a
distance measure which are feature extraction and quantifying it. The speech unit
sequence that yield smaller concatenation cost yield better join condition at the
concatenation point and thus produce better speech quality. To calculate the
concatenation cost, feature extraction (MFCC) is needed for all the speech units and
transform in 12 dimensional MFCCs. Concatenation cost distance measure used is
Euclidean distance. Higher concatenation costs are predicted to exist audible
discontinuities and thus less likely to be selected.
The searching for the minimum cost sequences is solved using SA. There are
four different types of moves are used to obtain the neighbourhood solution. The
moves can be further divides into two categories which are the move that swap the
phoneme without based on the magnitude of local cost and the move that swap the
phoneme based on the magnitude of local cost. In this research, the former move has
the better performance than the latter moves since the latter has the weakness of
119
trapped in local minimum. For annealing schedule, there are four different
temperature reduction rates used in this research. The slower temperature rate has
better performance than the faster temperature reduction rate. For the length of
Markov chain, there are two different lengths of Markov chain used in this research
which are reduced the temperature in every iteration and reduced the temperature
after two iterations. The latter approach that has longer Markov chain length has
better performance than the shorter. Therefore, the SA implemented in this research
has high robustness since it performance is sensitive to small changes of parameter
setting.
The contribution of concatenation cost was evaluated by conducting a
listening test. There are ten different Malay words have been selected for the
listening test. These ten Malay words can be further divides into three categories
based on the number of phoneme in the word. The listening test is conducted to
validate the concatenation cost ability to predict spectral discontinuity. The listening
tests have justified the contribution of concatenation cost in unit selection. Therefore,
Simulated Annealing is a suitable method for unit selection since it has made a
contribution in improving the speech quality by selecting the best speech unit
sequence within reasonable computational time and effort.
120
6.2
Suggestion for Future Work
In this dissertation, speech feature selected is MFCCs and spectral distance
used is Euclidean distance. The suggestion for future work is as follows to
investigate which combination will better predict the spectral discontinuities.
1. Implementation of unit selection using Euclidean distance with Linear
Predictive Coefficient (LPC).
2. Implementation of unit selection using Kullback-Leibler distance with
MFCCs.
3. Implementation of unit selection using Mahalanobis distance with MFCCs.
4. Comparison of the performance of three combinations above.
For heuristic method, the move to use Simulated Annealing needs to be
enhanced. The decision of swapping which phonemes should be based on total
number of candidate units for that particular phoneme, not the magnitude of the local
cost. In other words, the phoneme with a large number of candidate units should
have higher frequency of swapping. The different annealing schedule and Markov
chain length can also be conducted for future work. Other heuristic method such as
Genetic Algorithm and Tabu search in unit selection can also be conducted for future
work. Hybridization of these heuristic methods and comparison of the performance
of these heuristic methods can be performed in the future.
The two meta-heuristic methods which are SA and GA can be hybridized to
yield a more effective algorithm. One possible approach is to replace the crossover
and mutation processes of GA by SA operators (Hwang and He, 2006). This
approach maintains the advantages and avoids the disadvantages of both search
algorithms. This hybrid algorithm has better fine-tuning ability to search for the
global optimum solutions and more strong hill-climbing ability for escaping from
local minima than the standard GA thank to the special characteristic of SA (Hwang
and He, 2006).
121
REFERENCES
Ali, M. M., Törn, A. and Viitanen, S. (2002). A direct search variant of the simulated
annealing algorithm for optimization involving continuous variables. Computers
& Operations Research. 29(1), 87-102.
Atkinson, A. C. (1992). A segmented algorithm for simulated annealing. Statistics
and Computing, (2), 221–230.
Bellegarda, J. R. (2008). Unit-Centric Feature Mapping for Inventory Pruning in Unit
Selection Text-to-Speech Synthesis. Audio, Speech, and Language Processing,
IEEE Transactions. Jan. 2008. 74-82.
Blouin, C., Rosec, O., Bagshaw, P.C. and Alessandro, C. (2002). Concatenation cost
calculation and optimisation for unit selection in TTS. Proceedings of 2002
IEEE Workshop on Speech Synthesis.. 11-13 September. Santa Monica, USA,
231-234.
Campbell, W.N. and Black, A. W. (1997). Prosody and the selection of source units
for concatenative synthesis. In: Van Santen, J.P.H., Sproat, R.W., Olive, J.P.,
Hirschberg, J. (Eds.), Progress in Speech Synthesis. Springer. New York, 279–
292.
Cepko, J., Talafova, R. and Vrabec, J. (2008). Indexing join costs for faster unit
selection synthesis. Systems, Signals and Image Processing, 2008. IWSSIP 2008.
15th International Conference.25-28 June. Bratislava, Slovakia, 503-506.
Chappell, D. T. and Hansen, J. H. L. (2002). A comparison of spectral smoothing
methods for segment concatenation based speech synthesis. Speech
Communication, 36(3-4). 343-373.
Cheh, K.M., J.B. Goidberg and R.G. Askin (1991). A note on the effect of
neighbourhood structure in simulated annealing. Computers and Operations
Research, 18. 537-547.
122
Chen, T. Y. and Su, J. J. (2002). Efficiency improvement of simulated annealing in
optimal structural designs. Advances in Engineering Software. 33(7-10), 675680.
Chou Fu-chiang (1999). Corpus-based Technologies for Chinese Text-To-Speech
Synthesis. Ph.D Dissertation, Department of Electrical Engineering National
Taiwan University, ROC.
Chou, F. C. and Tseng, C. Y. (1998). Corpus-based Mandarin speech synthesis with
contextual syllabic units based on phonetic properties. Proceedings of the 1998
IEEE International Conference on Acoustics, Speech and Signal Processing. 1215 May. Seattle, WA, 893 – 896.
Daniel Rex Greening (1995). Simulated Annealing with Errors. Ph.D. Thesis.
University of California. Los Angeles.
Díaz, F. C. and Banga, E. R. (2006). A method for combining intonation modelling
and speech unit selection in corpus-based speech synthesis systems. Speech Communication. 48(8), 941-956.
Ding, W. K. F. and Campbell, N. (1998). Improving speech synthesis of CHATR
using a perceptual discontinuity function and constraints of prosodic
modification. Proc. 3rd ESCA/COCOSDA International Workshop on Speech
Synthesis. Nov. 1998. Jenolan Caves, Australia, 191-194.
Dong, M. H. and Li, H. Z. (2008). Predicting Spectral and Prosodic Parameters for
Unit Selection in Speech Synthesis. Chinese Spoken Language Processing.
ISCSLP '08. 6th International Symposium.16-19 December. 1-4.
Donovan, R. E. (2001). A new distance measure for costing spectral discontinuities
in concatenative speech synthesizers. The 4th ISCA Tutorial and Research
Workshop on Speech Synthesis. Pethshire, Scotland, 59–62.
Donovan, R. E. (2003). Topics in decision tree based speech synthesis. Computer
Speech & Language, 17( 1). January 2003. 43-67.
Durand, M. D. and White, S. R. (2000). Trading accuracy for speed in parallel
simulated annealing with simultaneous moves. Parallel Computing. 26(1), 135150.
Farid, M. O. (1980). Aspects of Malay Phonology and Morphology. Bangi: Universiti
Kebangsaan Malaysia.
123
Fek, M. Pesti, P. Nemeth, G. Zainko, C. Olaszy, G. (2006). Lecture Notes in
Computer Science. Text, Speech and Dialogue, 9th International Conferences.
11-15 September. Czech Republic, 367-374.
Ghazanfari, M., Alizadeh, S., Fathian, M. and Koulouriotis, D. E. (2007).
Comparing simulated annealing and genetic algorithm in learning FCM. Applied
Mathematics and Computation.192(1). 56-68.
Gao, M. and Tian, J. (2007). Path Planning for Mobile Robot Based on Improved
Simulated Annealing Artificial Neural Network. Natural Computation, 2007.
ICNC 2007. Third International Conference. 24-27 August. Haikou, China, 812.
Goldstein, L. and M. Waterman (1988). Neighborhood size in the simulated
annealing algorithm. American Journal of Mathematical and Management
Sciences, 8, 409-423.
Hajek, B. (1988). Cooling schedules for optimal annealing. Mathematics of
Operations Research. 13(2), 311–329.
Hamza, W., Rashwan, M. and Afify, M. (2001). Quantitative method for modeling
context in concatenative synthesis using large speech database. Acoustics,
Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE
International Conference. 7-11 May. Salt Lake City, Utah, USA, 789-792.
Hasan, M., AlKhamis. T., and Ali, J. (2000). A comparison between simulated
annealing, genetic algorithm and tabu search methods for the unconstrained
quadratic Pseudo-Boolean function. Computers & Industrial Engineering. 38(3),
323-340.
Hasim, S., Tunga, G., and Yasar, S. (2006). A Corpus-Based Concatenative Speech
Synthesis System for Turkish. Turkish Journal Of Electrical Engineering &
Computer Sciences, 14(2), 209-223.
Hirai, T. and Tenpaku, S. (2004). Using 5 ms segments in concatenative speech.
Fifth ISCA ITRW on Speech Synthesis. 16 Jun. Pittsburgh, PA, USA. 37-42.
Hirai, T., Tenpaku, S.and Shikano, K. (2002). Speech unit selection based on target
values driven by speech data in concatenative speech synthesis. Speech
Synthesis, 2002. Proceedings of 2002 IEEE Workshop. 11-13 September. Santa
Monica, USA, 43-46.
124
Huang, M. D., Romeo, F. and Sangiovanni-Vincentelli, A.L. (1986). An efficient
general cooling schedule for simulated annealing, In: Proceedings of the IEEE
International Conference on Computer-Aided Desig. Santa Clara, 381–384.
Hunt, A. and Black, A. (1996). Unit selection in a concatenative speech synthesis
system using large speech database. Proc. Int. Conf. Acoust., Speech, Signal
Process. Atlanta, GA, 373–376.
Hwang, S. F. and He, R. S. (2006). Improving real-parameter genetic algorithm with
simulated annealing for engineering problems. Advances in Engineering
Software. 37(6), 406-418.
Jan, V. S., Alexander, K., Esther, K. and Taniya, M. (2005). Synthesis of prosody
using multi-level unit sequences. Speech Communication.46(3-4), 365-375.
Janicki, A., Meus, P.and Topczewski, M. (2008). Taking advantage of pronunciation
variation in unit selection speech synthesis for polish. Communications, Control
and Signal Processing, 2008. ISCCSP 2008. 3rd International Symposium. 1214 March. St. Julians, 1133 – 1137.
Jeon, Y. J. and Kim, J. C. (2004). Application of simulated annealing and tabu search
for loss minimization in distribution systems. International Journal of Electrical
Power & Energy Systems. 26(1), 9-18.
Jeong, C. S. and Kim, M. H. (1990). Fast parallel simulated annealing for traveling
salesman problem. Neural Networks, 1990., 1990 IJCNN International Joint
Conference. 17-21 June. Washington, D.C, 947-953.
John, R. D., Jr., John, G. P. and John H. L. H. (1993). Discrete-Time Processing of
Speech Signals, Macmillan Publishing Company, New York, 1993.
Kawai, H. and Tsuzaki, M. (2002). Acoustic measures vs. phonetic features as
predictors of audible discontinuity in concatenative speech synthesis. Proc.
ICSLP. September 2002. Denver, U.S.A., 2621-2624.
Khor Ai Peng (2007). Implementation of Unit Selection by Using Euclidean Distance
in Malay Text To Speech. Bachelor Degree Thesis: Universiti Teknologi
Malaysia, Skudai.
Kirkpatrick, B., and O'Brien, D. S. R. (2006). A Comparison of Spectral Continuity
Measures as a Join Cost in Concatenative Speech Synthesis. Irish Signals and
Systems Conference, 2006. IET. 28-30 June. Dublin, Ireland, 515–520.
Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983). Optimization by Simulated
Annealing, Science 220, 671-680.
125
Klabbers, E. and Veldhuis, R. (2001). Reducing audible spectral discontinuities.
IEEE Transactions on Speech and Audio Processing. 9(1), 39-51.
Klabbers, E., Veldhuis, R. and Koppen, K. (2000). A solution to the reduction of
concatenation artifacts in speech synthesis. Proc. ICSLP. 16-20 October.
Beijing, China. 35-62.
Koulamas, C., Antony, S.R., and R. Jaen (1994). A survey of simulated annealing
applications to operations research problems. Omega, 22. 41-56.
Liu, J. (1999). The impact of neighbourhood size on the process of simulated
annealing: Computational experiments on the flowshop scheduling problem.
Computers & Industrial Engineering. 37(1-2), 285-288.
Luis, M. T. (1997). Speech Coding and Synthesis Using Parametric Curves.
University of East Anglia: Master Thesis.
Lundy, M. and Mees, A. (1986). Convergence of an annealing Algorithm.
Mathematical Programming 34, 111–124.
Manuel, D. A. (1997). Constructing efficient simulated annealing algorithms.
Discrete Applied Mathematics. 77(2), 139-159.
McGookin, E. W. and Murray-Smith, D. J. (2006). Submarine manoeuvring
controllers’ optimisation using simulated annealing and genetic algorithms.
Control Engineering Practice. 14(1), 1-15.
McGookin, E. W., Murray-Smith, D. J., & Li, Y. (1996). Segmented simulated
annealing applied to sliding mode controller design. Proceedings of the 13th
world congress of IFAC, San Francisco, USA. 333–338.
Min C., Hu, P., Hong, Y. and Chang, E. (2001). Selecting non-uniform units from a
very large corpus for concatenative speech synthesizer. Acoustics, Speech, and
Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International
Conference. 7-11 May. Salt Lake City, UT, USA, 785-788.
Möbius, B. (2000). Corpus-Based Speech Synthesis: Methods and Challenges.
Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung. 6(4), 87-116.
Nader, A. and Saeed, Z. (2004). Adaptive temperature control for simulated
annealing: a comparative study. Computers & Operations Research. 31(14).
2439-2451.
Nagy, A., Pesti, P., Németh, G. and Bőhm, T. (2005). Design Issues of a CorpusBased Speech Synthesizer. Hungarian Journal on Communications. 6, 18-24.
126
Nik, S. K., Farid, M. O. and Hashim, M. (1989). Tatabahasa Dewan: Perkataan.
Kuala Lumpur: Dewan Bahasa Dan Pustaka.
Nishizawa, N. and Kawai, H. (2006). A Short-Latency Unit Selection Method with
Redundant Search for Concatenative Speech Synthesis. Acoustics, Speech and
Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International
Conference .14-19 May. Toulouse, France, I - I
Nishizawa, N. and Kawai, H. (2008). Unit database pruning based on the cost
degradation criterion for concatenative speech synthesis. Acoustics, Speech and
Signal Processing, 2008. ICASSP 2008. IEEE International Conference. 31
March. Las Vegas, Nevada, U.S.A, 3969-3972.
Onn, H. M. (1993). Binaan dan Fungsi Perkataan dalam Bahasa Melayu: Suatu
Huraian dari Sudut Tatabahasa Generatif. Kuala Lumpur: Dewan Bahasa Dan
Pustaka.
Otten, R. H. J. M. and van Ginneken, L.P.P.P. (1984). Floor plan design using
simulated annealing. In: Proceedings of the IEEE International Conference on
Computer-Aided Design. Santa Clara, 96–98.
Pantelides, C. P., and Tzan, S. R. (2000). Modified iterated simulated annealing
algorithm for structural synthesis. Advances in Engineering Software. 31(6),
391-400.
Piits, L., Mihkla, M., Nurk, T. and Kiisel, I. (2007). Designing a Speech Corpus for
Estonian Unit Selection Synthesis. Proceeding of the 16th Nordic Conference of
Computational Linguistics NODALIDA-2007. 24-26 May. Tartu.
Prudon, R., Alessandro, C. and Mareuil, P.B. (2002). Prosody synthesis by unit
selection and transplantation on diphones. Speech Synthesis. Proceedings of
2002 IEEE Workshop. 11-13 September. Santa Monica, USA, 119 – 122.
Qing, G., Bin, W. and Katae, N. (2008). Speech Database Compacted for an
Embedded Mandarin TTS System. Chinese Spoken Language Processing, 2008.
ISCSLP '08. 6th International Symposium. 16-19 December. Kunming, China,
1-4.
Rabiner, L.R. and Juang, B.H. (1993). Fundamentals of Speech Recognition. second
ed. Prentice-Hall, Englewood Cliffs, NJ.
Raminah, S. and Rahim, S. (1987). Kajian Bahasa untuk Pelatih Maktab Perguruan.
8th ed. Petaling Jaya: Penerbit Fajar Bakti Sdn. Bhd.
127
Robert, A. J. C., Korin, R. and Simon, K. (2007). Multisyn: Open-domain unit
selection for the Festival speech synthesis system. Speech Communication.
49(4), 317-330.
Rose, J., Klebsch, W.and Wolf, J.(1990). Temperature measurement and equilibrium
dynamics of simulated annealing placements. Computer-Aided Design of
Integrated Circuits and Systems, IEEE Transactions. March 1990. 253 – 259.
Rowden, C. (1992). Speech Processing. UK: McGraw-Hill, Inc.
Sagisaka, Y. (1994). Recent advances in Japanese speech synthesis research.
International Symposium on Speech, Image Processing and Neural Networks.
13-16 April 1994. Hong Kong, 146 – 150.
Sakai, S., Kawahara, T. and Nakamura, S. (2008). Admissible stopping in viterbi
beam search for unit selection in concatenative speech synthesis. Acoustics,
Speech and Signal Processing, 2008. ICASSP 2008. IEEE International
Conference. 31 March-4 April. Las Vegas, Nevada, U.S.A, 4613-4616.
Sarathy, K.P. and Ramakrishnan, A.G. (2008). A research bed for unit selection
based text to speech synthesis. Spoken Language Technology Workshop. 15-19
December. Goa, India, 229 – 232.
Schwarz, D. (2007). Corpus-Based Concatenative Synthesis. Signal Processing
Magazine, IEEE.24(2), 92-104.
Stylianou, Y. and Syrdal, A. K. (2001). Perceptual and objective detection of
discontinuities in concatenative speech synthesis. Proc. ICASSP. May 2001. Salt
Lake City, U.S.A., 837-840.
Tan Tian Swee (2003). The Design and Verification of Malay Text to Speech
Synthesis System. Master Thesis. Universiti Teknologi Malaysia, Skudai.
Tan, T. S. and Sheikh H. (2008a). Corpus Design for Malay Corpus-based Speech
Synthesis System. American Journal of Applied Sciences 6(4): 696-702, ISSN
1546-9239.
Tan, T. S. and Sheikh, H. (2008b). Corpus-based Malay text-to-speech synthesis
system. APCC 2008. 14th Asia-Pacific Conference on Communications, 2008.
14-16 October. 1-5.
Tan, T. S. and Sheikh, H. (2008c). Implementation of Phonetic Context Variable
Length Unit Selection Module for Malay Text to Speech. Science Publications.
Journal of Computer Science 4(7), 550-556, ISSN 1549-3636.
128
Tan, T. S. and Sheikh. H. (2003). Building Malay TTS Using Festival Speech
Synthesis System. Conference of The Malaysia Science and Technology.
September 2-3. Johor Bahru, Malaysia, MSTC 2002, 120.
Tan, T. S., Sheikh, H., and Hussain, A. (2003). Building Malay Diphone Database
for Malay Text to Speech Synthesis System Using Festival Speech Synthesis
System. Proc of The International Conference on Robotics, Vision, Information
and Signal Processing 2003, 22-24 January 2003: ROVISP: 634-648.
Tan Tian Swee. (2009). Corpus-based Malay Text-To-Speech Synthesis System.
Ph.D. Thesis. Universiti Teknologi Malaysia, Skudai.
Taylor, P., Black, A. and Caley, R. (1999). Festival Speech Synthesis System:
system documentation (1.4.0). Human Communication Research Centre
Technical Report. HCRC/TR, 83-202.
Toda, T. (2003). High-Quality and Flexible Speech Synthesis with Segment Selection
and Voice Conversion. Doctoral thesis. Nara Institute of Science and
Technology.
Toda, T., Kawai, H., Tsuzaki, M. and Shikano, K. (2006). An evaluation of cost
functions sensitively capturing local degradation of naturalness for segment
selection in concatenative speech synthesis. Speech Communication. 48(1), 4556. Triki, E., Collette, Y. and Siarry, P. (2005). A theoretical study on the behavior of
simulated annealing leading to a new cooling schedule. European Journal of
Operational Research, 166(1), 77-92.
Tsiakoulis, P., Chalamandaris, A., Karabetsos, S. and Raptis, S. (2008). A statistical
method for database reduction for embedded unit selection speech synthesis.
Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE
International Conference. 31 March-4 April. Las Vegas, Nevada, U.S.A., 46014604.
Turgut, D., Turgut, B., Elmasri, R. and Le, T.V. (2003). Optimizing clustering
algorithm in mobile ad hoc networks using simulated annealing. Wireless
Communications and Networking, 2003. WCNC 2003. 2003 IEEE.20-20 March.
New Orleans, Louisiana, USA, 1492-1497.
Van Laarhoven, P. J. M. and Aarts, E.H.L. (1987). Simulated Annealing: Theory and
Applications. Kluwer Academic Publishers.
129
Vasan., A. and Komaragiri, S. R. (2009). Comparative analysis of Simulated
Annealing, Simulated Quenching and Genetic Algorithms for optimal reservoir
operation. Applied Soft Computing. 9(1). 274-281.
Veldhuis, R. (2002).The Centroid of the Symmetrical Kullback–Leibler Distance.
IEEE Signal Processing Letters. 9(3). 16 March 2002. 96-99.
Vepa, J. and King, S. (2004). Join cost for unit selection synthesis. Text to Speech
Synthesis, S. Naranyan, A. Alwan, Eds., Prentice Hall, 2004.
Vepa, J., King, S. and Taylor, P. (2002). New objective distance measures for
spectral discontinuities in concatenative speech synthesis. Proc. IEEE 2002
Workshop on Speech Synthesis, 11-13 September 2002. Santa Monica, USA.
Wang, Y., Yan, W. and Zhang, G. (1996). Adaptive simulated annealing for the
optimal design of electromagnetic devices. Magnetics, IEEE Transactions.
32(3). 1214-1217.
Wei, H., Chan, C. F., Chiu, S. and Pun, K. P. (2006). An efficient MFCC extraction
method in speech recognition. IEEE International Symposium on Circuits and
Systems. 21-24 May 2006. Island of Kos, Greece, 4.
Wong, E. and Sridharan, S. (2001). Comparison of Linear Prediction Cepstrum
Coefficients and Mel-Frequency Cepstrum Coefficients for Language
Identification. Proc. 2001 International Symposium on Intelligent Multimedia,
Video and Speech Processing, 2-4 May, Hong Kong. 95-98.
Wouters, J. and Macon, M. W. (1998). A perceptual evaluation of distance measures
for concatenative speech synthesis. Proc. ICSLP. 1998. Sydney, Australia,
2747-2750.
Wu, C. H., Hsia, C. C., Chen, J. F. and Liu, T. H. (2004). Variable-length unit
selection using LSA-based syntactic structure cost. International Symposium on
Chinese Spoken Language Processing. 15-18 December 2004. Hong Kong, 201
-204.
Yao, X. (1991). Simulated annealing with extended neighbourhood. International
Journal of Computer Mathematics, 40. 169-189.
Yao, X. (1993). Comparison of different neighbourhood sizes in simulated
annealing. Proceedings of the Fourth Australian Conference on Neural
Networks (A CNN'9 3), 216-219.
130
Zhao, Y., Liu, P., Li, Y., Chen, Y. and Chu, M. (2006). Measuring Target Cost in
Unit Selection With KL-Divergence Between Context-Dependent HMMs. In
proceeding of ICASSP 2006. 14-19 May. Toulouse, France, 725-728.
APPENDIX A
Source Code of MFCC
void CFeatureMfccDlg::MfccFront(CString fileName)
{
int i;
fileName.Replace (".wav", "F.snd");
short *data = new short[512];
CFile file;
for( i=0;i<512;i++)
{
data[i]=(short)m_Wave.buf[i];
}
file.Open(fileName, CFile::modeCreate|CFile::modeWrite|CFile::typeBinary);
if(file)
{
file.Write((void*)data,sizeof(short)*512);
file.Close();
}
m_cMFCC.MFCC(12,fileName,0);
delete [] data;
}
void CFeatureMfccDlg::MfccEnd(CString fileName)
{
int i;
fileName.Replace (".wav", "E.snd");
short *data = new short[512];
CFile file;
int j=0;
for( i=m_Wave.NoOfSample-512;i<m_Wave.NoOfSample;i++)
{
data[j]=(short)m_Wave.buf[i];
j++;
}
file.Open(fileName, CFile::modeCreate|CFile::modeWrite|CFile::typeBinary);
if(file)
{
file.Write((void*)data,sizeof(short)*512);
file.Close();
}
m_cMFCC.MFCC(12,fileName,0);
delete [] data;
131
void CFeatureMfccDlg::OnButtonMfcc()
{
UpdateData(TRUE);
int i;
for(i=m_iStart;i<=m_iEnd;i++)
{
CString tempStr;
tempStr.Format(m_sPhoName+"%d.wav",i);
ReadWav(tempStr);
MfccFront(tempStr);
MfccEnd(tempStr);
}
}
void CFeatureMfccDlg::OnButtonFmfcc()
{
int i;
short *data = new short[512];
CFile file;
for( i=0;i<512;i++)
{
data[i]=(short)m_Wave.buf[i];
}
file.Open("front.snd",
CFile::modeCreate|CFile::modeWrite|CFile::typeBinary);
if(file)
{
file.Write((void*)data,sizeof(short)*512);
file.Close();
}
m_cMFCC.MFCC(12,"front.snd",0);
delete [] data;
}
void CFeatureMfccDlg::OnButtonReadwav()
{
m_Wave.Load("_e1.wav");
m_Wave.SetFormatToSamples();
WORD nChannels;
DWORD nSamplesPerSec;
WORD nBitsPerSample;
m_Wave.GetParameters(nChannels, nSamplesPerSec, nBitsPerSample);
m_Wave.Stop ();
m_Wave.ReadWav("_e1.wav");
m_Wave.Play (this, 0, m_Wave.NoOfSample-1);
}
132
void CFeatureMfccDlg::ReadWav(CString fileName)
{
m_Wave.Load(fileName);
m_Wave.SetFormatToSamples();
WORD nChannels;
DWORD nSamplesPerSec;
WORD nBitsPerSample;
m_Wave.GetParameters(nChannels, nSamplesPerSec, nBitsPerSample);
m_Wave.Stop ();
m_Wave.ReadWav(fileName);
}
void CFeature::MFCC(int coeffNo, CString file, int flag)
{
SpeechFile = file;
Num_Feature = coeffNo;
MODE = flag;
for(int z=0;z<NO_FILTER;z++)
{
filters[z].first=((float)filter_data[z].first/SAMPLINGFREQ)*(float)S
AMPLES + 0.5;
filters[z].middle=((float)filter_data[z].middle/SAMPLINGFREQ)*
(float)SAMPLES + 0.5;
filters[z].final=((float)filter_data[z].final/SAMPLINGFREQ)*
(float)SAMPLES + 0.5;
if(filters[z].first==filters[z].middle)
{
printf("Error filter_data %d. The first sample is equal to the
midlle sample !\n",z);
puts("Use CALC_MEL to recalculate the frequencies.");
exit(1);
}
if(filters[z].final==filters[z].middle)
{
printf("Error filter_data %d. The final sample is equal to the
midlle sample !\n",z);
puts("Use CALC_MEL to recalculate the frequencies.");
exit(1);
}
if(filters[z].final==filters[z].first)
{
printf("Error filter_data %d. The final sample is equal to the
initial sample !\n",z);
puts("Use CALC_MEL to recalculate the frequencies.");
exit(1);
}
}
133
for(z=0;z<NO_FILTER;z++)
{
weights[z]=new float [SAMPLINGFREQ/2.0+1];
weights[z][filters[z].first]=0.0;
weights[z][filters[z].middle]=1.0;
weights[z][filters[z].final]=0.0;
m=1.0/(float)(filters[z].middle-filters[z].first);
c=-m*filters[z].first;
for(int w=filters[z].first+1;w<filters[z].middle;w++)
weights[z][w]=m*w+c;
m=-1.0/(float)(filters[z].final-filters[z].middle);
c=-m*filters[z].final;
for(w=filters[z].middle+1;w<filters[z].final;w++)
weights[z][w]=m*w+c;
}
PreProcessing();
SampleFeature=new float[Frame];
SampleBlock = new double[Frame];
SampleWindow = new double[Frame];
cmel=new double[Num_Feature];
int i,j,k;
for(i=0;i<TotalFrame;i++)
{
for(j=0;j<=Frame-1;j++)
{
FrameBlocking(i,j);
HammingWindow(j);
SampleFeature[j]=(float)SampleWindow[j];
}
rsfft(SampleFeature, (int)log(512.0)/log(2.0));
for(k=0;k<Frame/2;k++)
{
mag[k] = (SampleFeature[k]*SampleFeature[k])+
(SampleFeature[Frame-k-1]*SampleFeature[Frame-k-1]);
//fout<<"mag["<<k<<"] = "<<mag[k]<<endl;
}
for(int p=0;p<NO_FILTER;p++)
{
xk[p]=0;
avgE=0;
for(int q=filters[p].first;q<=filters[p].final;q++)
{
xk[p] += mag[q]*weights[p][q];
avgE++;
134
}
xk[p] /= (float)avgE;
if(xk[p]==0)
xk[p]= 0.1;
xk[p]=log10(xk[p]);
}
for(int r=0;r<Num_Feature;r++)
{
cmel[r]=0;
for(int s=0;s<NO_FILTER;s++)
cmel[r]+=xk[s]*cos((float)(r+1)*(s+10.5)*(3.1428571/NO_FILTER));
COEFFBuf[i][r]=cmel[r]; //copy cmel ke COEFFBuf
}
if(MODE==3||MODE==4||MODE==5)
{
Energy(i);
}
}
if(SpeechFile.Find(".snd")>0)
SpeechFile.Replace(".snd",".mfc");
else if(SpeechFile.Find(".txt")>0)
SpeechFile.Replace(".txt",".mfc");
else if(SpeechFile.Find(".wav")>0)
SpeechFile.Replace(".wav",".mfc");
if(MODE==0)
{
WriteFeature();
}
else if(MODE==1)
{
Delta();
WriteFeature();
}
else if(MODE==2)
{
Delta();
DeltaDelta();
WriteFeature();
}
else if(MODE==3)
{
for(int i=0;i<TotalFrame;i++)
COEFFBuf[i][Num_Feature]=EnergyBuf[i];
135
WriteFeature();
delete []EnergyBuf;
}
else if(MODE==4)
{
Delta();
DeltaEnergy();
for(int i=0;i<TotalFrame;i++)
{
COEFFBuf[i][2*Num_Feature]=EnergyBuf[i];
COEFFBuf[i][2*Num_Feature+1]=EnergyBuf[TotalFrame+i];
}
WriteFeature();
delete []EnergyBuf;
}
else if(MODE==5)
{
Delta();
DeltaDelta();
DeltaEnergy();
DeltaDeltaEnergy();
for (int i=0;i<TotalFrame;i++)
{
COEFFBuf[i][3*Num_Feature]=EnergyBuf[i];
COEFFBuf[i][3*Num_Feature+1]=EnergyBuf[TotalFrame+i];
COEFFBuf[i][3*Num_Feature+2]=EnergyBuf[2*TotalFrame+i];
}
WriteFeature();
delete []EnergyBuf;
}
delete []SampleFeature;
delete []cmel;
delete []HamWindow;
delete []Sample;
delete []SamplePreemphasis;
delete []SampleBlock;
delete []SampleWindow;
for (int y=0;y<TotalFrame;y++)
delete [] COEFFBuf[y];
delete [] COEFFBuf;
for (y=0;y<NO_FILTER;y++)
delete [] weights[y];
}
APPENDIX B
Source Code of Simulated Annealing (Move 1)
typedef struct
{
float mfccF[12];
float mfccE[12];
}UnitMFCC;
typedef struct
{
UnitMFCC iUnitMFCC[150];
int iTotalUnit;
}Unit;
typedef struct
{
Unit iunitSel[20];
Unit iStage[20];
int stage,nextStage;
int iTUnitSel;
}UnitSel;
typedef struct
{
float UnitCost[20];
float fTotalCost;
CString sequence[200];
}UnitCost;
class CSimanDlg : public CDialog
{
// Construction
public:
int iStage;
Unit iunitSel[50];
CString
Inpho[20];
int BU[20];
int CU[20];
int RU[20];
int iTotalstage;
CString sCurProjPath;
CSimanDlg(CWnd* pParent = NULL);
// standard constructor
137
// Dialog Data
//{{AFX_DATA(CSimanDlg)
enum { IDD = IDD_SIMAN_DIALOG };
CString
m_dDistant;
CString
m_dDistance;
CString
m_dshortestDis;
CString
m_sPho;
int
m_iNumber;
//}}AFX_DATA
// ClassWizard generated virtual function overrides
//{{AFX_VIRTUAL(CSimanDlg)
protected:
virtual void DoDataExchange(CDataExchange* pDX);
// DDX/DDV
support
//}}AFX_VIRTUAL
// Implementation
protected:
HICON m_hIcon;
// Generated message map functions
//{{AFX_MSG(CSimanDlg)
virtual BOOL OnInitDialog();
afx_msg void OnSysCommand(UINT nID, LPARAM lParam);
afx_msg void OnPaint();
afx_msg HCURSOR OnQueryDragIcon();
afx_msg void OnButtonCompute();
afx_msg void OnButtonNext();
//}}AFX_MSG
DECLARE_MESSAGE_MAP()
};
CSimanDlg::CSimanDlg(CWnd* pParent /*=NULL*/)
: CDialog(CSimanDlg::IDD, pParent)
{
//{{AFX_DATA_INIT(CSimanDlg)
m_dDistant = _T("");
m_dDistance = _T("");
m_dshortestDis = _T("");
m_sPho = _T("");
m_iNumber = 0;
//}}AFX_DATA_INIT
// Note that LoadIcon does not require a subsequent DestroyIcon in Win32
m_hIcon = AfxGetApp()->LoadIcon(IDR_MAINFRAME);
iTotalstage=4;
iStage=0;
}
138
void CSimanDlg::OnButtonCompute()
{
UpdateData(TRUE);
int i,j;
double UnitCost[200];
double d=0;
double f;
double temperature=1000;
double curSol[200],TotalCost;
double delta,prob;
double g,u,shortest;
int join,stage,iNonImproveIte=0,largestjoinPosition;
CString sequence[200],Bestsequence[200],selectedUnit[200];
TotalCost=0;
double LocalCost[20],greatestJoin=0,LocalCostNe[20],greatestJoinNe=0;
int imaxIteration=200, iMaxNoOfJoin=2,iMaxNoOfStage=4,cc,ranNo;
ofstream ofp,ofpseq;
ofp.open("result.dat",ios::out);
ofpseq.open("bestsequence.dat",ios::out);
srand(time(0));//ori
for(j=0;j<imaxIteration;j++)//number of iterations.
{
cc=iMaxNoOfJoin+1;
ranNo=rand()%cc;
join=ranNo;
if(j==0)//initial solution
{
for(join=0;join<=iMaxNoOfJoin;join++)
{
d=0;
BU[join]=1;//sample number ..
for(i=0;i<12;i++)
{
d+=(pow(iunitSel[join].iUnitMFCC[BU[join]].mfccE[i]
-iunitSel[join+1].iUnitMFCC[BU[join]].mfccF[i],2));
}
LocalCost[join]=sqrt(d);/
TotalCost+=sqrt(d);
if(LocalCost[join]>greatestJoin)
{
greatestJoin=LocalCost[join];
TRACE("greatestJoin:%lf\n",greatestJoin);
}
CString str,s,st;
str.Format("%d",join);
s.Format("%d",BU[join]);
st.Format("%d",join+1);
139
selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+s+"]";
if(join==iMaxNoOfJoin)
{
selectedUnit[join+1]="iunitSel["+st+"].iUnitMFCC["+s+"]";
}
}
UnitCost[j]=TotalCost;
curSol[0]=UnitCost[j];
shortest=curSol[0];
CString tempStr;
tempStr.Format("%lf",curSol[0]);
m_dDistant=tempStr;
for(stage=0;stage<iMaxNoOfStage;stage++)
{
sequence[0]=selectedUnit[stage];
}
}
if(j>0 && temperature <=1000 && temperature>0.1
&&iNonImproveIte<100)
{
int c;
f=0;
TotalCost=0;
for(join=0;join<=iMaxNoOfJoin;join++)
{
f=0;
if(j==1)
{
CU[join]=BU[join];
}
if(j>1)
{
LocalCost[join]=LocalCostNe[join];
}
if(join==ranNo)
{
RU[join]=c;
for(i=0;i<12;i++)
{
f+=(pow(iunitSel[join].iUnitMFCC[CU[join]].mfccE[i]
-iunitSel[join+1].iUnitMFCC[RU[join]].mfccF[i],2));
}
CString str,s,st,string;
str.Format("%d",join);
s.Format("%d",RU[join]);
st.Format("%d",BU[join]);
140
string.Format("%d",join+1);
selectedUnit[join+1]="iunitSel["+string+"].iUnitMFCC["+s+"]";
if(join==0)
{
selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+st+"]";
}
largestjoinPosition=join;
RU[largestjoinPosition]=c;
}
if(largestjoinPosition+1==join )
{
CU[join]=RU[join];
for(i=0;i<12;i++)
{
f+=(pow(iunitSel[join].iUnitMFCC[RU[largestjoinPosition]].mfccE[i]
-iunitSel[join+1].iUnitMFCC[CU[join]].mfccF[i],2));
}
CU[join+1]=RU[join];
}
if(join!=ranNo && largestjoinPosition+1!=join)
{
for(i=0;i<12;i++)
{
f+=(pow(iunitSel[join].iUnitMFCC[CU[join]].mfccE[i]
-iunitSel[join+1].iUnitMFCC[RU[join]].mfccF[i],2));
}
CString str,s,st,string;
str.Format("%d",join);
s.Format("%d",RU[join]);
st.Format("%d",BU[join]);
string.Format("%d",join+1);
selectedUnit[join+1]="iunitSel["+string+"].iUnitMFCC["+s+"]";
if(join==0)
{
selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+st+"]"
}
CU[join+1]=RU[join];
}
LocalCostNe[join]=sqrt(f);
TotalCost+=sqrt(f);
if(join==iMaxNoOfJoin)
{
for(stage=0;stage<iMaxNoOfStage;stage++)
{
sequence[j]=selectedUnit[stage];
141
}
}
}
UnitCost[j]=TotalCost;
ofp<<j<<"
"<<UnitCost[j]<<endl;
if (j>=1 )
{
if(UnitCost[j]<curSol[j-1])
{
curSol[j]=UnitCost[j];
if(curSol[j]<shortest)
{
shortest=curSol[j];
iNonImproveIte=0;
for(stage=0;stage<iMaxNoOfStage;stage++)
{
Bestsequence[j]=selectedUnit[stage];
ofpseq<<j<<" "<<Bestsequence[j]<<endl;
}
}
else
{
iNonImproveIte++;
}
CString tempStr;
tempStr.Format("%lf",curSol[j]);
m_dDistance=tempStr;
temperature*=0.95;
}
else
{
iNonImproveIte++;
g=1+rand()%100;
u=1/g;
delta=UnitCost[j]-curSol[j-1];
prob=exp(-fabs((delta)/temperature));
if (prob>u)
{
curSol[j]=UnitCost[j];
CString tempStr;
tempStr.Format("%lf",curSol[j]);
m_dDistance=tempStr;
}
142
else
{
curSol[j]=curSol[j-1];
}
temperature*=0.95;
}
}
CString sh;
sh.Format("%lf",shortest);
m_dshortestDis=sh;
}
}
ofp.close();
ofpseq.close();
UpdateData(FALSE);
}
void CSimanDlg::OnButtonNext()
{
UpdateData (TRUE);
int i,len2;
int a,b;
char input[10];
CString wavName,mfccF,mfccE;
CString tempStr;
tempStr.Format("%d",m_iNumber);
ifstream ifp3;
ifp3.open(tempStr+".dat",ios::in);
ifp3>>len2;
ifstream ifp;
ifstream ifp1;
Inpho[iStage]=m_sPho;
iunitSel[m_iNumber-1].iTotalUnit=len2;
for(i=0;i<len2;i++)
{
ifp3>>input;
wavName=input;
ifp3>>input;
mfccF=input;
ifp3>>input;
mfccE=input;
ifp.open(sCurProjPath+"\\"+Inpho[m_iNumber-1]+"\\"+mfccE,ios::in);
for (a=0;a<12;a++)
{
ifp>>iunitSel[m_iNumber-1].iUnitMFCC[i+1].mfccE[a];
}
ifp.close();
143
ifp1.open(sCurProjPath+"\\"+Inpho[m_iNumber1]+"\\"+mfccF,ios::in);
for (b=0;b<12;b++)
{
ifp1>>iunitSel[m_iNumber-1].iUnitMFCC[i+1].mfccF[b];
}
ifp1.close();
}
ifp3.close();
iStage++;
}
APPENDIX C
Source Code of Simulated Annealing (Move 2)
typedef struct
{
float mfccF[12];
float mfccE[12];
}UnitMFCC;
typedef struct
{
UnitMFCC iUnitMFCC[150];
int iTotalUnit;
}Unit;
typedef struct
{
Unit iunitSel[20];
Unit iStage[20];
int stage,nextStage;
int iTUnitSel;
}UnitSel;
typedef struct
{
float UnitCost[20];
float fTotalCost;
CString sequence[200];
}UnitCost;
class CSimanDlg : public CDialog
{
// Construction
public:
int iStage;
Unit iunitSel[50];
CString
Inpho[20];
int BU[20];
int CU[20];
int RU[20];
int iTotalstage;
CString sCurProjPath;
CSimanDlg(CWnd* pParent = NULL);
// standard constructor
145
// Dialog Data
//{{AFX_DATA(CSimanDlg)
enum { IDD = IDD_SIMAN_DIALOG };
CString
m_dDistant;
CString
m_dDistance;
CString
m_dshortestDis;
CString
m_sPho;
int
m_iNumber;
//}}AFX_DATA
// ClassWizard generated virtual function overrides
//{{AFX_VIRTUAL(CSimanDlg)
protected:
virtual void DoDataExchange(CDataExchange* pDX);
// DDX/DDV
support
//}}AFX_VIRTUAL
// Implementation
protected:
HICON m_hIcon;
// Generated message map functions
//{{AFX_MSG(CSimanDlg)
virtual BOOL OnInitDialog();
afx_msg void OnSysCommand(UINT nID, LPARAM lParam);
afx_msg void OnPaint();
afx_msg HCURSOR OnQueryDragIcon();
afx_msg void OnButtonCompute();
afx_msg void OnButtonNext();
//}}AFX_MSG
DECLARE_MESSAGE_MAP()
};
CSimanDlg::CSimanDlg(CWnd* pParent /*=NULL*/)
: CDialog(CSimanDlg::IDD, pParent)
{
//{{AFX_DATA_INIT(CSimanDlg)
m_dDistant = _T("");
m_dDistance = _T("");
m_dshortestDis = _T("");
m_sPho = _T("");
m_iNumber = 0;
//}}AFX_DATA_INIT
// Note that LoadIcon does not require a subsequent DestroyIcon in Win32
m_hIcon = AfxGetApp()->LoadIcon(IDR_MAINFRAME);
iTotalstage=4;
iStage=0;
}
146
void CSimanDlg::OnButtonCompute()
{
UpdateData(TRUE);
int i,j;
double UnitCost[200];
double d=0;
double f;
double temperature=1000;
double curSol[200],TotalCost;
double delta,prob;
double g,u,shortest;
int join,stage,iNonImproveIte=0,largestjoinPosition;
CString sequence[200],Bestsequence[200],selectedUnit[200];
TotalCost=0;
double LocalCost[20],greatestJoin=0,LocalCostNe[20],greatestJoinNe=0;
int imaxIteration=200, iMaxNoOfJoin=2,iMaxNoOfStage=4,cc,ranNo;
ofstream ofp,ofpseq;
ofp.open("result.dat",ios::out);
ofpseq.open("bestsequence.dat",ios::out);
srand(time(0));//ori
for(j=0;j<imaxIteration;j++)//number of iterations.
{
cc=iMaxNoOfJoin+1;
ranNo=rand()%cc;
join=ranNo;
if(j==0)//initial solution
{
for(join=0;join<=iMaxNoOfJoin;join++)
{
d=0;
BU[join]=1;//sample number ..
for(i=0;i<12;i++)
{
d+=(pow(iunitSel[join].iUnitMFCC[BU[join]].mfccE[i]
-iunitSel[join+1].iUnitMFCC[BU[join]].mfccF[i],2));
}
LocalCost[join]=sqrt(d);/
TotalCost+=sqrt(d);
if(LocalCost[join]>greatestJoin)
{
greatestJoin=LocalCost[join];
TRACE("greatestJoin:%lf\n",greatestJoin);
}
CString str,s,st;
str.Format("%d",join);
s.Format("%d",BU[join]);
st.Format("%d",join+1);
147
selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+s+"]";
if(join==iMaxNoOfJoin)
{
selectedUnit[join+1]="iunitSel["+st+"].iUnitMFCC["+s+"]";
}
}
UnitCost[j]=TotalCost;
curSol[0]=UnitCost[j];
shortest=curSol[0];
CString tempStr;
tempStr.Format("%lf",curSol[0]);
m_dDistant=tempStr;
for(stage=0;stage<iMaxNoOfStage;stage++)
{
sequence[0]=selectedUnit[stage];
}
}
if(j>0 && temperature <=1000 && temperature>0.1
&&iNonImproveIte<50)
{
int c;
f=0;
TotalCost=0;
greatestJoinNe=0;
for(join=0;join<=iMaxNoOfJoin;join++)
{
f=0;
if(j==1)
{
CU[join]=BU[join];
}
if(j>1)
{
LocalCost[join]=LocalCostNe[join];
greatestJoin=TMP[iMaxNoOfJoin];
}
if(greatestJoin==LocalCost[join])
{
c=1+rand()%iunitSel[join+1].iTotalUnit;
RU[join]=c;
for(i=0;i<12;i++)
{
f+=(pow(iunitSel[join].iUnitMFCC[CU[join]].mfccE[i]iunitSel[join+1].iUnitMFCC[RU[join]].mfccF[i],2));
}
148
CString str,s,st,string;
str.Format("%d",join);
s.Format("%d",RU[join]);
st.Format("%d",BU[join]);
string.Format("%d",join+1);
selectedUnit[join+1]="iunitSel["+string+"].iUnitMFCC["+s+"]";
if(join==0)
{
selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+st+"]";
}
largestjoinPosition=join;
RU[largestjoinPosition]=c;
}
if(largestjoinPosition+1==join )
{
CU[join]=RU[join];
for(i=0;i<12;i++)
{
f+=(pow(iunitSel[join].iUnitMFCC[RU[largestjoinPosition]].mfccE[i
]-iunitSel[join+1].iUnitMFCC[CU[join]].mfccF[i],2));
}
}
if(greatestJoin!=LocalCost[join] && largestjoinPosition+1!=join )
{
for(i=0;i<12;i++)
{
f+=(pow(iunitSel[join].iUnitMFCC[CU[join]].mfccE[i]iunitSel[join+1].iUnitMFCC[RU[join]].mfccF[i],2));
}
CString str,s,st,string;
str.Format("%d",join);
s.Format("%d",RU[join]);
st.Format("%d",BU[join]);
string.Format("%d",join+1);
selectedUnit[join+1]="iunitSel["+string+"].iUnitMFCC["+s+"]";
if(join==0)
{
selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+st+"]";
}
CU[join+1]=RU[join];
}
LocalCostNe[join]=sqrt(f);
TotalCost+=sqrt(f);
149
if(join==iMaxNoOfJoin)
{
for(stage=0;stage<iMaxNoOfStage;stage++)
{
sequence[j]=selectedUnit[stage];
}
}
}
for(join=0;join<=iMaxNoOfJoin;join++)
TMP[join]=LocalCostNe[join];
for(int z=0;z<=iMaxNoOfJoin;z++)
for(join=0;join<iMaxNoOfJoin;join++)
if(
TMP[join]>TMP[join+1])
{
tmp=TMP[join];
TMP[join]=LocalCostNe[join+1];
TMP[join+1]=tmp;
}
UnitCost[j]=TotalCost;
if (j>=1 )
{
if(UnitCost[j]<curSol[j-1])
{
curSol[j]=UnitCost[j];
if(curSol[j]<shortest)
{
shortest=curSol[j];
iNonImproveIte=0;
for(stage=0;stage<iMaxNoOfStage;stage++)
{
Bestsequence[j]=selectedUnit[stage];
}
}
else
{
iNonImproveIte++;
}
CString tempStr;
tempStr.Format("%lf",curSol[j]);
m_dDistance=tempStr;
temperature*=0.9;
}
150
else
{
iNonImproveIte++;
g=1+rand()%100;
u=1/g;
delta=UnitCost[j]-curSol[j-1];
prob=exp(-fabs((delta)/temperature));
if (prob>u)
{
curSol[j]=UnitCost[j];
CString tempStr;
tempStr.Format("%lf",curSol[j]);
m_dDistance=tempStr;
}
else
{
curSol[j]=curSol[j-1];
}
temperature*=0.9;
}
}
CString sh;
sh.Format("%lf",shortest);
m_dshortestDis=sh;
}
}
UpdateData(FALSE);
}
void CSimanDlg::OnButtonNext()
{
UpdateData (TRUE);
int i,len2;
int a,b;
char input[10];
CString wavName,mfccF,mfccE;
CString tempStr;
tempStr.Format("%d",m_iNumber);
ifstream ifp3;
ifp3.open(tempStr+".dat",ios::in);
ifp3>>len2;
ifstream ifp;
ifstream ifp1;
Inpho[iStage]=m_sPho;
iunitSel[m_iNumber-1].iTotalUnit=len2;
for(i=0;i<len2;i++)
{
ifp3>>input;
wavName=input;
151
ifp3>>input;
mfccF=input;
ifp3>>input;
mfccE=input;
ifp.open(sCurProjPath+"\\"+Inpho[m_iNumber-1]+"\\"+mfccE,ios::in);
for (a=0;a<12;a++)
{
ifp>>iunitSel[m_iNumber-1].iUnitMFCC[i+1].mfccE[a];
}
ifp.close();
ifp1.open(sCurProjPath+"\\"+Inpho[m_iNumber1]+"\\"+mfccF,ios::in);
for (b=0;b<12;b++)
{
ifp1>>iunitSel[m_iNumber-1].iUnitMFCC[i+1].mfccF[b];
}
ifp1.close();
}
ifp3.close();
iStage++;
}
APPENDIX D
Source Code of Simulated Annealing (Move 3)
typedef struct
{
float mfccF[12];
float mfccE[12];
}UnitMFCC;
typedef struct
{
UnitMFCC iUnitMFCC[150];
int iTotalUnit;
}Unit;
typedef struct
{
Unit iunitSel[20];
Unit iStage[20];
int stage,nextStage;
int iTUnitSel;
}UnitSel;
typedef struct
{
float UnitCost[20];
float fTotalCost;
CString sequence[200];
}UnitCost;
class CSimanDlg : public CDialog
{
// Construction
public:
int iStage;
Unit iunitSel[50];
CString
Inpho[20];
int BU[20];
int CU[20];
int RU[20];
int iTotalstage;
CString sCurProjPath;
CSimanDlg(CWnd* pParent = NULL);
153
// Dialog Data
//{{AFX_DATA(CSimanDlg)
enum { IDD = IDD_SIMAN_DIALOG };
CString
m_dDistant;
CString
m_dDistance;
CString
m_dshortestDis;
CString
m_sPho;
int
m_iNumber;
//}}AFX_DATA
// ClassWizard generated virtual function overrides
//{{AFX_VIRTUAL(CSimanDlg)
protected:
virtual void DoDataExchange(CDataExchange* pDX);
// DDX/DDV
support
//}}AFX_VIRTUAL
// Implementation
protected:
HICON m_hIcon;
// Generated message map functions
//{{AFX_MSG(CSimanDlg)
virtual BOOL OnInitDialog();
afx_msg void OnSysCommand(UINT nID, LPARAM lParam);
afx_msg void OnPaint();
afx_msg HCURSOR OnQueryDragIcon();
afx_msg void OnButtonCompute();
afx_msg void OnButtonNext();
//}}AFX_MSG
DECLARE_MESSAGE_MAP()
};
CSimanDlg::CSimanDlg(CWnd* pParent /*=NULL*/)
: CDialog(CSimanDlg::IDD, pParent)
{
//{{AFX_DATA_INIT(CSimanDlg)
m_dDistant = _T("");
m_dDistance = _T("");
m_dshortestDis = _T("");
m_sPho = _T("");
m_iNumber = 0;
//}}AFX_DATA_INIT
// Note that LoadIcon does not require a subsequent DestroyIcon in Win32
m_hIcon = AfxGetApp()->LoadIcon(IDR_MAINFRAME);
iTotalstage=4;
iStage=0;
}
154
void CSimanDlg::OnButtonCompute()
{
UpdateData(TRUE);
int i,j;
double UnitCost[200];
double d=0;
double f;
double temperature=1000;
double curSol[200],TotalCost;
double delta,prob;
double g,u,shortest;
int join,stage,iNonImproveIte=0,largestjoinPosition;
CString sequence[200],Bestsequence[200],selectedUnit[200];
TotalCost=0;
double LocalCost[20],greatestJoin=0,LocalCostNe[20],greatestJoinNe=0;
int imaxIteration=200, iMaxNoOfJoin=2,iMaxNoOfStage=4,cc,ranNo;
ofstream ofp,ofpseq;
ofp.open("result.dat",ios::out);
ofpseq.open("bestsequence.dat",ios::out);
srand(time(0));//ori
for(j=0;j<imaxIteration;j++)//number of iterations.
{
if(j==0)//initial solution
{
for(join=0;join<=iMaxNoOfJoin;join++)
{
d=0;
BU[join]=1;//sample number ..
for(i=0;i<12;i++)
{
d+=(pow(iunitSel[join].iUnitMFCC[BU[join]].mfccE[i]
-iunitSel[join+1].iUnitMFCC[BU[join]].mfccF[i],2));
}
LocalCost[join]=sqrt(d);/
TotalCost+=sqrt(d);
if(LocalCost[join]>greatestJoin)
{
greatestJoin=LocalCost[join];
}
CString str,s,st;
str.Format("%d",join);
s.Format("%d",BU[join]);
st.Format("%d",join+1);
selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+s+"]";
if(join==iMaxNoOfJoin)
{
selectedUnit[join+1]="iunitSel["+st+"].iUnitMFCC["+s+"]";
}
}
155
UnitCost[j]=TotalCost;
curSol[0]=UnitCost[j];
shortest=curSol[0];
CString tempStr;
tempStr.Format("%lf",curSol[0]);
m_dDistant=tempStr;
for(stage=0;stage<iMaxNoOfStage;stage++)
{
sequence[0]=selectedUnit[stage];
}
}
if(j>0 && temperature <=1000 && temperature>0.1
&&iNonImproveIte<50)
{
int c;//random integer number between 1-13
f=0;
TotalCost=0;
greatestJoinNe=0;
for(join=iMaxNoOfJoin;join>=0;join--)
{
f=0;
if(j==1)
{
CU[join]=BU[join];
}
if(j>1)
{
LocalCost[join]=LocalCostNe[join];
greatestJoin=TMP[iMaxNoOfJoin];
}
if(greatestJoin==LocalCost[join])
{
c=1+rand()%iunitSel[join+1].iTotalUnit;
RU[join]=c;
for(i=0;i<12;i++)
{
f+=(pow(iunitSel[join+1].iUnitMFCC[CU[join]].mfccF[i]iunitSel[join].iUnitMFCC[RU[join]].mfccE[i],2));
}
CString str,s,st,string;
str.Format("%d",join+1);
s.Format("%d",RU[join]);
st.Format("%d",BU[join]);
string.Format("%d",join);
selectedUnit[join]="iunitSel["+string+"].iUnitMFCC["+s+"]";
156
if(join==iMaxNoOfJoin)
{
selectedUnit[join+1]="iunitSel["+str+"].iUnitMFCC["+st+"]";
}
largestjoinPosition=join;
RU[largestjoinPosition]=c;
}
if(largestjoinPosition-1==join )
{
CU[join]=RU[join];
for(i=0;i<12;i++)
{
f+=(pow(iunitSel[join+1].iUnitMFCC[RU[largestjoinPosition]].mfccF[i]iunitSel[join].iUnitMFCC[CU[join]].mfccE[i],2));
}
}
if(greatestJoin!=LocalCost[join] && largestjoinPosition-1!=join )
{
for(i=0;i<12;i++)
{
f+=(pow(iunitSel[join+1].iUnitMFCC[CU[join]].mfccF[i]iunitSel[join].iUnitMFCC[RU[join]].mfccE[i],2));
}
CString str,s,st,string;
str.Format("%d",join);
s.Format("%d",RU[join]);
st.Format("%d",BU[join]);
string.Format("%d",join+1);
selectedUnit[join]="iunitSel["+str+"].iUnitMFCC["+s+"]";//new
if(join==iMaxNoOfJoin)
{
selectedUnit[join+1]="iunitSel["+string+"].iUnitMFCC["+st+"]";
}
CU[join-1]=RU[join];
}
LocalCostNe[join]=sqrt(f);
TotalCost+=sqrt(f);
if(join==0)
{
for(stage=0;stage<iMaxNoOfStage;stage++)
{
sequence[j]=selectedUnit[stage];
}
}
}
157
for(join=0;join<=iMaxNoOfJoin;join++)
TMP[join]=LocalCostNe[join];
for(int z=0;z<=iMaxNoOfJoin;z++)
for(join=0;join<iMaxNoOfJoin;join++)
if(
TMP[join]>TMP[join+1])
{
tmp=TMP[join];
TMP[join]=LocalCostNe[join+1];
TMP[join+1]=tmp;
}
UnitCost[j]=TotalCost;
if (j>=1 )
{
if(UnitCost[j]<curSol[j-1])
{
curSol[j]=UnitCost[j];
if(curSol[j]<shortest)
{
shortest=curSol[j];
iNonImproveIte=0;
for(stage=0;stage<iMaxNoOfStage;stage++)
{
Bestsequence[j]=selectedUnit[stage];
}
}
else
{
iNonImproveIte++;
}
CString tempStr;
tempStr.Format("%lf",curSol[j]);
m_dDistance=tempStr;
temperature*=0.9;
}
else
{
iNonImproveIte++;
g=1+rand()%100;
u=1/g;
delta=UnitCost[j]-curSol[j-1];
prob=exp(-fabs((delta)/temperature));
if (prob>u)
158
{
curSol[j]=UnitCost[j];
CString tempStr;
tempStr.Format("%lf",curSol[j]);
m_dDistance=tempStr;
}
else
{
curSol[j]=curSol[j-1];
}
temperature*=0.9;
}
}
CString sh;
sh.Format("%lf",shortest);
m_dshortestDis=sh;
}
}
UpdateData(FALSE);
}
void CSimanDlg::OnButtonNext()
{
UpdateData (TRUE);
int i,len2;
int a,b;
char input[10];
CString wavName,mfccF,mfccE;
CString tempStr;
tempStr.Format("%d",m_iNumber);
ifstream ifp3;
ifp3.open(tempStr+".dat",ios::in);
ifp3>>len2;
ifstream ifp;
ifstream ifp1;
Inpho[iStage]=m_sPho;
iunitSel[m_iNumber-1].iTotalUnit=len2;
for(i=0;i<len2;i++)
{
ifp3>>input;
wavName=input;
ifp3>>input;
mfccF=input;
ifp3>>input;
mfccE=input;
ifp.open(sCurProjPath+"\\"+Inpho[m_iNumber-1]+"\\"+mfccE,ios::in);
for (a=0;a<12;a++)
{
ifp>>iunitSel[m_iNumber-1].iUnitMFCC[i+1].mfccE[a];
159
}
ifp.close();
ifp1.open(sCurProjPath+"\\"+Inpho[m_iNumber1]+"\\"+mfccF,ios::in);
for (b=0;b<12;b++)
{
ifp1>>iunitSel[m_iNumber-1].iUnitMFCC[i+1].mfccF[b];
}
ifp1.close();
}
ifp3.close();
iStage++;
}
160
APPENDIX E
Evaluation Questionnaire
Malay Text To Speech
Please Note:
• You do not have to take part in this questionnaire.
• If you find any of these questions intrusive feel free to leave them
unanswered
• Any data collected will remain strictly confidential, and anonymity
will be preserved.
SECTION 1Age:_____________
Race: □ Malay
PERSONAL AND BACKGROUND DETAILS
Gender: □ Female
□ Chinese
□ Indian
Is Malay your first spoken language? □ Yes
Where do you use computers?
□ Home
□ Work □ School
□ I Don’t
□ Male
□ Other __________
□ No
□ Other __________
What is your level of education?
□ Primary □ Secondary
□ Tertiary □ Other______
Where is your state of origin?
□ Perak
□ Perlis
□ Kedah
□ Pulau Pinang
□ Pahang
□ Kelantan
□ Terengganu □ N. Sembilan □ Selangor
□ Kuala Lumpur
□ Johor
□ Sabah
□ Sarawak
□ Labuan
□ Other __________
The next section will commence shortly.
PLEASE DO NOT TURN THE PAGE UNTIL TOLD TO DO SO
THANK YOU.
161
SECTION 2-
Word level Testing (Unit Selection)
Words answer sheet.
A:
There will be 10 words in this section. Listen to the sound file and
write down your answer.
Word 1
Word 2
Word 3
Word 4
Word 5
Word 6
Word 7
Word 8
Word 9
Word 10
162
SECTION 3-
Mean Opinion Score (MOS) Test Listening Test.
For all the ten words play twice, thick the word that you think is better in
term of naturalness.
Word 1:
Word 1a
Word 1b
Word 2:
Word 2a
Word 2b
Word 3:
Word 3a
Word 3b
Word 4:
Word 4a
Word 4b
Word 5:
Word 5a
Word 5b
Word 6:
Word 6a
Word 6b
Word 7:
Word 7a
Word 7b
Word 8:
Word 8a
Word 8b
Word 9:
Word 9a
Word 9b
Word 10:
Word 10a
Word 10b
Thank you for your participation.
<<<<<< End of questionnaire >>>>>>
Download