Slovko: A Phoneme-based Speech Re

advertisement
Slovko: A Phoneme-based Speech Re-sequencing System in MATLAB
Miloš Cerňak
Institute of Informatics of the Slovak Academy of Sciences, Dúbravská c. 9, 845 07 Bratislava
Milos.Cernak@SAVBA.SK
Abstract
The paper presents Slovak speech re-sequencing system.
The system is derived from the Festival TTS system, and
its first version uses a phoneme as a basic synthesis unit.
The design of the system is divided into two parts. Firstly,
semi-automatic orthoepic transcription, speech labeling
using HMM, and building of CARTs for each phoneme, is
done directly in the Festival system. A wrapper was created to convert CARTs from LISP structures to a
MATLAB readable format. Secondly, definition of the unit
costs and the concatenation costs, selection of the candidate clusters of speech segments, and finally Viterbi
search, is done in MATLAB.
1. Introduction
Speech re-sequencing systems belong to the most popular
approaches in present TTS systems. MATLAB is a mathematical tool, and it is often used by digital speech processing groups in the academic area. However, only few
works were published, where MATLAB is a framework
for TTS system.
Slovko TTS system emanates from Festival TTS system, and it is implemented in MATLAB environment.
Such usage of two different systems allows utilizing the
power of the Festival and Edinburgh speech tools in offline speech processing and CARTs creation, and the power of MATLAB in run-time speech segment selection and
concatenation. Presented approach has two main advantages. Firstly, MATLAB is well-known tool and it is
widely used by many speech labs, and secondly, it offers
lots of implementations of new scientific algorithms,
which might be easily used in the run-time speech segment selection and concatenation.
In section 2 outlines the process of automatic labeling
of recorded speech database using a statistical approach.
Section 3 briefly describes CART based phonetic transcription. Sections 4 and 5 give an overview of the unit
selection and unit concatenation.
The first uses DTW algorithm, and is based on time
alignment of labeled speech and its re-synthesized version. The second one is a statistical approach. It uses
HMM either with already trained acoustic models of context-dependent phoneme models or without them. I followed the second approach – using HMM without reference acoustic models. Speech recognition system Sphinx2
and SphinxTrain tool for acoustic model creation of Carnegie Mellon University were used
(http://www.speech.cs.cmu.edu/sphinx/).
The process of speech segmentation is shown at Fig. 1.
The system uses the speech signal files with precise orthographic transcription as its inputs. Customization of
Festival TTS system for usage with Slovak language was
necessary in this step.
Creation of the acoustic models by SphinxTrain tool,
and consecutive alignment by HMM, is not an optimal
solution. However, the concatenation of diphones instead
of phones (see [2] as an example) or the usage of optimal
coupling method [3] might these errors minimizes. Classification and regression trees (CART) method can be used
to automatically correct the errors of time alignment as
well. This approach is based on the assumptions that
HMM make systematic errors. Small fraction of annotated
corpus is taken and human expert corrects the mistakes.
Based on this information a CART is trained and then this
“correction” tree is used to automatically correct whole
speech corpus [4].
2. Automatic Speech Database Labeling
There are known two main approaches, which give the
best results for the automatic speech database labeling [1].
Figure 1: The flowchart of the automatic speech segmentation. CI means ‘Context independent’, and CD means ‘Context
Dependent’.
3. Data-driven Phonetic Transcription
The process of the building of phonetic transcription automatically, consists of the following stages:
- An assignment of allowed phonemes to each Slovak
letter. These assignments were created manually, in
order to facilitate automatic rule’s learning. Therefore
the system is semi-automatic.
- An alignment of the transcription of the training set,
where each letter had to have its respective transcription.
- A generalization of observed assignments in the test
set.
More details about these steps can be found in [6]. The
testing with a knowledge based transcription system was
made. The testing set consisted of 2918 phonemes. The
knowledge based system made 8 mistakes in the whole
testing set, while the data driven system failed 67 times.
So the probability of good transcription of the letters to
right phonemes was 99.73% for classical approach and
97.70% for this automatic approach.
4. Unit Selection
Slovko TTS system uses technique described in [5]. Because Festival TTS system implements this technique as
well, the preprocessing of the creation of the CARTs is
done directly in Festival. I developed a file format converter of the CARTs - from LISP structures into
MATLAB readable format. Features, witch are used in
Slovko TTS system are as follow:
- Vowel or consonant
- Vowel length
- Vowel height
- Vowel frontness
- Lip rounding
- Consonant type
- Place of articulation
- Consonant voicing
These features were used as question during building of
the CARTs and are extracted from an input text automatically, in order to find appropriate cluster of units from
speech database.
Let me define the speech synthesis framework for unit
concatenation. Let S  s1 , s1 , n be a speech vector
consisted of N units. Each unit si , where 1  i  n , is
defined as
si  siname , sisentence , sistart , simiddle , siend  ,
(4.1)
where siname is a unique name of the unit, sisentence is the
name of the file of speech database the unit came from,
and last three items represents time markers of the unit in
the sentence of the speech database.
Speech synthesis begins with a phonetic transcription
of the input text, producing the sequence of phones
P   p1 , p2 , pm  , where p1 and pm are pause segments
at the beginning and the end of the synthesized speech.
For each phoneme is then created feature vector v :
For all pi, 2≤ i≤ m-1
Find vector F (pi–1)
Find vector F (pi+1)
v(pi)=concat (F(pi–1),F(pi+1))
End
The letter F denotes a matrix of the features of all Slovak
phonemes. The CART method finds a set of cluster of
units from the speech database, which is indexed by the
features. A specific cluster was selected using the feature
vector v  pi  for each synthesized phoneme. In this way a
lattice is formed, consisted of m  2 clusters. The lattice
is weighted by a target cost and a concatenation cost. The
target cost d u  j , T  is defined as a distance from a center of a cluster and the concatenation cost d c  i ,  j  is defined as a weighted sum of differences of F0 between concatenated units, and a Euclid distance of concatenated micro segments.
The task of optimal sequence searching is to define of
the sequence
ˆ   , , ,  , 1  i  m  2 .

(4.2)
1
2
i
This might be formulized as follow:
N
N 1
ˆ  arg min   d  , T    d  ,   (4.3)

u
j
c
j
j 1

j 2
 j 1

A Viterbi algorithm did the effective searching of the optimal path. The relation of the vectors  and S is as follow:
S.
(4.4)
5. Unit Concatenation
The proposed method is working on the pitch synchronous
basis. To speech up the computations, pitch markers
should be defined at first. I have done it by the pitchmark program of the Edinburgh Speech Library. Speech
signals are double low pass filtered and then double high
pass filtered. Double filtering (feeding the waveform
through the filter, then reversing the waveform and feeding it through again) is performed to reduce any phase
shift between the input and output of the filtering operation. Using variables of minimal and maximal allowed
pitch periods, pitch markers are defined.
Figure 2: The creation of a new artificial pitch period from two
overlap periods of the concatenated segments.
To achieve minimal audible changes by concatenation
of i and  j , the concatenation point should be at the end
of pitch cycle with minimal difference of pitch between
the concatenated segments [7]. I followed this scheme in
the proposed waveform interpolation method. New artificial pitch period was created and inserted between the
concatenated segments. Figure [2] shows this schematically.
During acoustic parameterization of the acoustic inventory (speech database) I have found pitch markers mi .
For the concatenation of si and s j units, the closets pitch
marks mi to s
end
j
and m j to s
start
j
has been found. New
borders of both units si and s j has been defined as
sistart , mi ,
(5.1)
m j 1 , s end
.
j
(5.2)
and
Next pitch period from
si has been selected as:
mend  mi , mi 1 ,
and next pitch period from
(5.3)
si has been selected as:
mstart  m j , m j 1 .
(5.4)
Let mend be a pitch period consists of N samples and
mstart a pitch period consists of M samples. The length of
new mutually adapted pitch period has been defined as:
L  max  N , M  .
The overlap-and-add principle has been used to create
new artificial pitch period. Hann window with the length
2  N has been applied to the mend pitch period, centered
at the first sample of the
mend . If N  L , L  N zeros
have been added at the end of the pitch period. Similarly,
Hann window with the length 2  M has been applied to
Figure 3: Synthesized speech with a raw concatenation (up),
and with applying a pitch-synchronous windowed method.
the mstart pitch period, centered at the last sample of the
mstart . If M  L , M  L zeros have been inserted at the
beginning of the pitch period. Finally, both pitch periods
has been added, and the new pitch period has been inserted between the units sistart , mi and m j 1 , s end
. Figure 3
j
shows synthesized speech with raw concatenation and
with proposed pitch-synchronous windowed method.
6. Discussion and Future Work
During the evaluation process I tried to synthesize using
the original CARTs, however by a diphone concatenation.
For each unit si from the speech vector S , a diphone is
defined as
 s  1
middle
, s middle .
(6.1)
The concatenation cost was computed as described in the
previous section 5, and the Viterbi algorithm was used
again for a searching of the optimal sequence of the units.
Final concatenation of selected units was done as a concatenation of selected diphones. However described approach was not as good as was expected. The reason
might be in a fact, that the acoustic representation of the
phonemes instead of diphones was used to create the
CARTs.
Further improving might be in a dynamic concatenation. Concatenation points ikoniec and  istart
1 are then not
fixed, and can be moved in a certain range i middle,end and
, middle
, in accordance with some objective measurei start
1
ments (see [3] as an example).
Finally, prosodic modeling might not be neglected.
There is an often discussion what is better: either a
knowledge-based approach or a data-driven approach.
Recent results of data-driven approaches show us, that
they are very perspective and can produce very good
models without years of exhausting research. The work
done of relative Czech language [8] is a good example of
that. Slovko TTS system works with the CARTs, so there
could be included another CARTs for prosody modeling
as well.
Slovko TTS is implemented in a MATLAB environment as a set of scripts. Generation of a one-second synthesized speech takes linearly approximately one second
on the AMD Athlon 1.5 GHz, which is far away from real
time processing (for example the generation of 5 s speech
takes roughly 10 s : 5 s for the synthesizing + 5 s for the
replying). Slovko TTS was however designed as an experimental tool to verify speech synthesis data-driven approaches for Slovak language. Many technicians at the
most of universities are familiar to such environment,
which means a rapid transfer of the knowledge and implementation of speech synthesis techniques to the teachers and their students, and they can easily either further
expand the synthesizer, or replace some block with more
effective algorithms. The current version of Slovko TTS is
fully intelligible and maintains the speaker’s characteristics.
7. Acknowledgements
This work was supported by the Slovak Agency for Science VEGA grant No. 2/2087/22.
10. References
[1] F. Malfrére, O. Deroo, T. Dutoit, and C. Ris. Phonetic
Alignment: Speech Synthesis-based vs. Viterbi-based.
Speech Communication, 40:503–515, 2003.
[2] M. Beutnagel, A. Conkie, and A. Syrdal. Diphone
Synthesis Using Unit Selection. In Third International
Workshop on Speech Synthesis, Sydney, Australia, 1998.
[3] A. Conkie and S. Isard. Optimal Coupling of Diphones. In Jan P. H. Van Santen, Richard W. Sproat, JosephP. Olive, and Julia Hirschberg, editors, Progress in
Speech Synthesis, pages 293–303. Springer-Verlag New
York, Inc., New York, 1997.
[4] Jordi Adell and Antonio Bonafonte. Towards Phone
Segmentation For Concatenative Speech Synthesis. In 5th
ISCA Speech Synthesis Workshop, pages 139 – 144, Pittsburgh, USA, 2004.
[5] A. W. Black and P. Taylor. Automatically Clustering
Similar Units for Unit Selection in Speech Synthesis. In
Proc. of the European Conference on Speech Communication and Technology, volume 2, pages 601–604, Rhodos, Greece, 1997.
[6] Cerňak M., Rusko M., Trnka M. and S. Daržágín:
Data-driven Versus Knowledge-based Approaches to
Orthoepic Transcription in Slovak, In: Proceedings of the
2nd ICETA, pp. 95-98, Košice, Slovakia, 2003
[7] J. Yi. Corpus-Based Unit Selection for NaturalSounding Speech Synthesis. PhD thesis, MIT, 2003.
[8] R. Batušek: A Duration Model for Czech Text-toSpeech Synthesis, In International conference Speech
Prosody 2002, Aix-en-Provence, France, 2002.
Download