Speech Model for Segmentation Algorithms

advertisement
ELECTRONICS AND ELECTRICAL ENGINEERING
ISSN 1392 – 1215
20XX. No. X(XX)
ELEKTRONIKA IR ELEKTROTECHNIKA
SIGNAL TECHNOLOGY
T 121
SIGNALŲ TECHNOLOGIJA
Speech Segmentation Analysis Using Synthetic Signals
Mindaugas Greibus, Laimutis Telksnys
Vilnius University Institute of Mathematics and Informatics,
Akademijos str., 4, LT-08663 Vilnius, Lithuania, phone: +37068789464,
e-mail: mindaugas.greibus@exigenservices.com, telksnys@ktl.mii.lt
Introduction
Speech segmentation problem can defined
differently depending on application that the segments will
be used. Speech segments can be divided into levels:
phone, syllable, word and phrase. Automated speech
recognition algorithm can solve segmentation tasks
indirectly. Blind segmentation is used for task when it is
needed extract speech information e.g. speech corpora
construction, initial segment extraction to reduce amount
of data for ASR. Automatic speech segmentation is
attractive for different type of solutions. Cosi et al. [1]
defines that automated segmentation advantage is that
results are predictable: mistakes is done in a coherent way.
Evaluation speech segmentation precision is needed
to have appropriate reference segments. It is non-trivial
task to find the transcription that can serve as ideal for
comparison [2]. Manual segmentation is dependent on
expert experience and base on non-quantitative criteria to
define where segments start and where end. Same person
can put change point marker in slightly different place each
time when he or she does segmentation.
In the experiment Wesenick et al. [2] they
compared manual and automatic phoneme segmentation
performance. Results showed that in general errors are
done by experts and algorithm for certain phonemes
transitions: nasal to nasal and vowel to lateral. Manual and
automatic segmentation showed same tendencies for the
best segmentation cases also: algorithm and experts were
performed well with phoneme transitions: voiceless
plosive to nasal, voiceless plosive - vowel.
A natural speech corpus not always provides
enough speech cases for experimentation. Also elements in
natural speech corpus can suffer common issues:
 transcription errors,
 unpredictable noises,
 lack of possibility to control speech model,
 lack of environment noises adjustment.
Several speech model complexity levels would
allow indentify erroneous algorithm places. Synthetic
signal generated by speech model could solve some of the
issues.
Malfrère et al. [3] in their article propose use
synthetic speech for phone segment alignment. Author
mentioned that synthetic speech does not requires having
corpora for training, but it is possible generate signals that
will be similar enough to natural speech. Other approaches
like HMM requires large amount of training data and
supervised training. Malfrère noticed that phone alignment
is working when original signal and synthetic signals are
sex dependent. Horak [4] used similar algorithm. The
results of the experiment showed that vowels segmentation
error was smallest in comparison with other phoneme
types and fricatives showed highest error rate.
Sethy and Narayanan [5] was investigating what
signal features should be the best for aligning synthetic and
natural speech signals. Features were tested: Melfrequency cepstrum coefficients and their deltas, line
spectral frequencies, formants, energy and its delta and the
zero crossing rate. It was found out that there was no
feature or feature set that would be the best for all the
phonemes. It was proposed to use two feature
combinations depending of what phonemes are compared.
The most often the papers describes that speech
model is used for phoneme alignment. Synthetic speech
can be employed as a part of a speech algorithm. It can be
possible use speech model for the speech segmentation
also.
In this paper it will be proposed approach to use
synthetic speech as modeled signals to compare speech
segmentation algorithms in various noise levels. In the next
section, it will be described the method of generation the
synthetic signals. Later it will be presented the speech
model adequateness criteria. How the speech model can be
used for speech algorithm comparison is defined before
section about segmentation result evaluation method.
Evaluation results of two segmentation algorithms are
defined in separate section. In conclusion are summarized
this paper results.
Speech Model for Synthesis
Natural speech is a complex process as it is highly
dynamic. Speech generation is relaying on multiple
mechanisms: linguistic, articulatory, acoustic, and
perceptual [6].
Produce synthetic speech signal is not a trivial task.
Speech signal generation has several steps [7]: text
analysis, transforming to computer pronunciation
instruction using linguistic analysis and speech waveform
generation using recorded speech. Quality of synthetic
speech cannot be measured straight forward. Common
approach it is qualitative assessment of synthesized speech.
It is challenging task generate synthetic speech
same as natural speech. There are multiple speech engines
that can be used to simplify speech signals generation:
Mbrola [7], FreeTTS [8], Mary [9]. Mbrola project
supports 25 languages. This is very convenient to evaluate
speech algorithms if multi language is important. Also
Mbrola it is used by other researches in their papers [4] [5]
[10]
Mbrola speech engine plays important part in the
speech synthesis as it generates waveforms. Text analysis
and linguistic analysis should be done separately. It can be
done by transformation rules that are used in
transformation Letter-To-Sound engines like FreeTTS [8],
eSpeak [11]. This step is called Natural Language
Processing [12]. Such processing use 3 models:
phonetisation, duration, pitch.
Phonetisation model uses lexical analysis to map
grapheme to phoneme. This mapping converts one stream
of orthographical symbols into symbols of the
corresponding sequence of sounds [13].
Pitch model [8] determines synthesized speech
parameters for pitch, tone, stress, and amplitude. This
model is important for natural speech.
In natural speech phonemes are spelled differently
each time. Speech rate of Lithuanian language can
fluctuate from 2.6 up to 5.7 syllables per second [14]. In
order to imitate duration properties it is possible define
duration as random variable∼ 𝒩(𝜇𝐷𝑈𝑅 , 𝜎𝐷𝑈𝑅 2 ). Mean a
phoneme duration (𝜇𝐷𝑈𝑅 ) can be defined same as in a
natural speech and Variance phoneme duration (𝜎𝐷𝑈𝑅 2 )
should be chosen that not exceeding natural speech limits.
Synthetic speech without randomization property would
not be very efficient from the research point of view.
Noise resistance it is another important point when
evaluating speech algorithm performance. The most often
the stationary noise can be generating by computer and for
non-stationary should be used real life samples. Gaussian
stationary noise it is the worst case scenario as it has flat
spectrum and filtering techniques is hard to apply. As
synthetic speech signals has no noise it is easy to control
noisification level SNR (1) measured in dB.
𝑆𝑁𝑅 = 10 log (
𝜎𝑆2
)
𝜎𝑁2
(1)
, where 𝜎𝑆2 –speech variance, 𝜎𝑁2 – noise variance
In order that a signal would be noisified properly it
is needed to scale speech signal (2)
𝑆𝑁𝑅
𝑦̅[𝑡] =
𝜎𝑁 √10 10
𝜎𝑠
𝑦[𝑡]
(2)
, where 𝑦̅[𝑡] – t sample of noisified speech signal; 𝑦[𝑡] – t
sample of speech signal.
Synthetic speech could make investigation easier as
it is possible control the noise, content and generate
unlimited speech samples. Speech model signal has to be
sufficient to represent natural speech properties.
Adequate Speech Model
In generate if model is adequate enough it can be
defined by situation-depended criteria [15]: if it is “right”,
“useful” “believable” and “worth the costs”.
The costs of synthetic speech corpus generated with
speech model is low comparing with natural speech corpus
as requires less human effort to collect enough data for
investigations.
Synthetic speech corpus has to match major
requirements for the speech situations that are investigated
by researcher. Extreme naturalness may not be wanted in
regular research works as synthetic speech can require too
much resources and the effect of the results can be
insignificant in comparison with more primitive speech
model.
Speech model provides ability to control speech
content, generated sample size. Such tool can help
researcher to compare speech algorithms faster than with a
specialized natural speech corpus.
The definition, how synthetic speech signal is
similar to natural speech, can be defined as set of signal
attributes that defines signal suitability for communication
[16]. In general speech signal quality measurements [17]
can be divided in two types: objective and subjective. The
objective measurement cannot always unambiguously
define quality that is good enough for experimentation.
The
intelligibility
measurement
cannot
express
quantitatively easily. The methods, that using
intelligibility, requires expert group and they evaluate how
realistic sound is.
In this paper segmentation error was used as the
quantitative similarity measurement. The first step it is
record natural speech signals with selected phrase. Next
synthetic signals should be generated using same phrase.
After segmentation algorithms should be executed against
both corpora and segmentation errors calculated. Similarity
of synthetic and natural speech can be expressed as
segmentation error difference produced by segmentation
algorithms. If errors values are close, then synthetic signal
are similar to natural speech from speech segmentation
perspective.
For speech segmentation algorithms requires that
the speech model should match VIP (Variable, Intelligible,
Pronounceable) criteria:
 Variable – synthesized speech segment properties
must be different each time within limits of natural
speech.
 Intelligible – signals must be recognized as speech
and words must be understood.
 Pronounceable – phonemes of specific language
should be used.
Natural behavior of speech in the most cases same
word will be pronounced differently according to context,
without this speech model would be too static in
comparison with the natural speech. As it would be
expected speech model should generate signals that it is
understandable by human. In general it is not acceptable
reuse other language phonemes for experiments, as each
language has its own specific pronunciation.
Speech Model for Segmentation Algorithms
Speech segmentation algorithm creation has to be
tested in different cases of speech signal and environments.
For researchers are important to control environment
during their experiments. This allows indemnify defective
places easier, compare with other algorithms and optimize
algorithm parameters. Generated signals can cover cases
that in natural language are rare. Information where speech
segments start and end, allows eliminate human errors of
the manual segmentation. In the way as presented above
the generated signals will be less complex comparing with
natural speech, but the signals are good enough to
understand by human.
transformation should be done. Example generated by
eSpeak: word what phonemes [w-Q-t], word is phonemes
[I-z] , word the phonemes [D-@], word time phonemes [taI-m].
As it was mentioned earlier each phoneme has its
own duration. Example generated by eSpeak: [w]=65ms,
[Q]=62ms, [t]=101ms, [I]=74ms, [z]=65ms, [D]=65ms,
[@]=134ms, [aI]=109ms, [m]=159ms. The phoneme
duration value is used as a mean of random number.
The pitch model is important for natural speech. For
speech segmentation algorithms it was found that it is not
essential. Pitch definition brings big overhead due it is
needed estimate many speech engine parameters, but
results were not affected significantly. For the experiment
described in the next section, it was created synthetic
speech corpus, that does not require prosody information
and default values of the speech engine were used. This
can be improved during the next experiment iterations.
When parameters are prepared for speech
generation, then it is possible to generate unlimited speech
signals, which are different from speech segmentation
algorithm perspective. Each generated signal should have
transcription and audio files.
Speech segmentation algorithm should label each
automatic-segment. The labeling method requires training
data set. It is used same speech signal generation method
here as well. Training data set should not be used for
testing.
When data set is ready, the speech segmentation
algorithm process audio files and generates sequence of
auto-segments. Each auto-segment is matched to the
references signals and normally the label is assigned.
Next step it is compare auto-segments and original
segments. Each segment is matched with reference in
original transcription file and correctness of boundaries
and labels are evaluated.
Speech Result Evaluation
Fig. 1. Speech algorithm analysis using synthetic signals
Speech model preparation with segmentation
algorithm method steps are defined figure 1. Synthetic
speech corpus should be prepared for an experiment first of
all. Researchers should identify what cases should be
covered in their experiment: the set of phoneme and noise.
Example: Compare two segmentation algorithms with
phrase: "What is the time?". The 50 signals should be
generated for each noise level. It must be used 5 different
levels of white noise: 30dB, 15dB, 10dB, 5dB, 0dB (250
signals).
The next it is needed transform text information to
instructions for a speech engine. Simple transformation can
be done in two steps: first step it is map one or more
symbols from text alphabet to speech engine acceptable
alphabet and the second step it is define phoneme duration.
Mapping from regular alphabet to computer
readable phonetic alphabet depends on speech engine.
Mbrola uses SAMPA [18] alphabet that defines how
Figure 2 describes what errors can appear during
segmentation. It is compared original transcription
(MANL) and auto-segmentation(AUTO) results. First task
it is find if segment boundaries are detected correctly.
Few error types may be detected (figure 2):
 EXS – automatically noise was detected as segment.
 SFT – automatic segment boundary was shifted.
 ITR – additional boundary was inserted in a
segment.
 JON – boundary between
Fig. 2. Automatic speech segment comparison with referece
Auto-segment boundary detection error rate is
calculated using (3).
𝐸𝑅𝑅𝐵 =
𝐸𝑋𝑆 + 𝑆𝐹𝑇 + 𝐼𝑇𝑅 + 𝐽𝑂𝑁
100 (3)
𝑁𝐹
𝐸𝑅𝑅𝐵 – boundary detection error rate, 𝑁𝐹 – number of
segments that are found during experiment.
Segment with incorrect boundaries errors are not
used for pattern matching to define a label of a segment.
Recognition result can be rejected if is algorithm found see
similar references. Recognition error rate (4) is when
algorithm says that it was segment class A, but it was
segment class E (figure 2). It is calculated:
𝐸𝑅𝑅𝑅 =
𝐴𝐸
100
𝑁𝐵
Fig. 3 MEaRBS algorithm boundary detection cases
(4)
𝐸𝑅𝑅𝑅 – recognition error rate, 𝑁𝐵 – number of correct
segments from boundary detection step.
Experiment Data
Synthetic speech and natural speech was tested in
order to see if generated signals can be use speech
segmentation algorithm comparison. For the experiment
was created speech corpora as described above. Against
same signal set were executed two segmentation
algorithms: Multi Feature Extremum and Rule Based
Segmentation (MEaRBS) [19] and Threshold [20]. It will
be shown that both algorithms were showed similar results
with synthetic and natural speech, but their segmentation
results allow saying which performed better.
Synthetic speech corpora creation method is
described above.
It was chosen speaker dependent natural speech
corpus model for the experiment. Speaker independent
should require more effort to collect data and train
recognition engine. For natural speech it was recorded 51
signals of the same phrase by male speaker. One signal
was used to train speech recognition engine
All the natural speech signals where segmented
semi-automatically way. Segmentation was done in two
steps boundaries detection and segment label definition.
The expert segmentation results of natural speech are used
to compare with automated segmentation results.
Error rates were represented in figure 4. It is
possible to see that speech segment boundaries for natural
speech and synthetic speech with 5dB noise showed
similar result, but natural speech had higher errors for
recognition. Recognition result of synthetic signal with
10db noise is similar with natural speech.
Fig. 4 MEaRBS algorithm segmentation and label class
recognition errors
Experiment Results: Threshold Algorithm
Another experiment was done with static threshold
segmentation algorithm [20]. It was kept same experiment
data and experiment system parameters. The results are
shown in figure 5. Natural speech segmentation results are
similar to synthetic speech 5dB and 10dB, but segment
recognition results are better. This happened, because
DTW algorithm is very sensitive to finding beginning end
of a segment and this algorithm detected boundaries in
different way that MEaRBS algorithm.
Experiment Results: MEaRBS Algorithm
MEaRBS algorithm segmented 300 files (250
synthetic and 50 natural speech signals). In figure 3 results
are shown graphically: vertically it is representing signal
types: natural speech, total of all synthetic signals and
synthetic speech with different noise levels: 30 dB, 15 dB,
10 dB, 5dB, 0dB. Horizontally it is representing case
percent value. It is possible to see that the algorithm
showed the worst results for synthetic speech with 0 dB
noise level. Comparing all percent values natural speech
showed similar result as 30dB synthetic speech.
Fig. 5 Threshold algorithm segmentation and label class
recognition errors
Conclusions
Method of speech segmentation algorithm
comparison using synthetic speech corpora was presented.
It was reviewed and proposed modified version of speech
model usage for segmentation evaluation method. It was
shown by experiment that results of segmentation results
using natural speech and synthetic speech is similar:
speech segmentation error is in the confidence intervals of
15 dB synthetic signal and natural speech overlaps. It can
be noted that synthetic speech can be used for speech
research and some algorithms comparison and certain
cases investigation.
References1.
P. Cosi, D. Falavigna, and M. Omologo, A
preliminary statistical evaluation of manual and automatic
segmentation discrepancies// in Second European Conference on
Speech Communication and Technology, 1991.
2. B. M. Wesenick and A. Kipp, Estimating The
Quality Of Phonetic Transcriptions And Segmentations Of
Speech Signals// in Spoken Language, 1996. ICSLP 96.
Proceedings., Fourth International Conference on, 1996, P. 129132.
3. F. Malfrere, O. Deroo, T. Dutoit, and C. Ris,
Phonetic alignment: speech synthesis-based vs. viterbi-based//
Speech Communication. – , vol. 40, P. 503--515, 2003.
4. P. Horak, Automatic Speech Segmentation Based on
Alignment with a Text-to-Speech System// Improvements in
Speech Synthesis. – , P. 328--338, 2002.
5. A. Sethy and S.S. Narayanan, Refined speech
segmentation for concatenative speech synthesis// in Seventh
International Conference on Spoken Language Processing, 2002.
6. L. Deng, Dynamic Speech Models: Theory,
Algorithms, and Applications.: Morgan and Claypool, 2006.
7. T. Dutoit, High-quality text-to-speech synthesis: An
overview// JOURNAL OF ELECTRICAL AND ELECTRONICS
ENGINEERING AUSTRALIA. – , vol. 17, P. 25--36, 1997.
8. W. Walker, P. Lamere, and P. Kwok, FreeTTS: a
performance case study// Sun Microsystems, Inc., 2002.
9. M. Schroder and J. Trouvain, The German text-tospeech synthesis system MARY: A tool for research,
development and teaching// International Journal of Speech
Technology. – , vol. 6, P. 365--377, 2003.
10. F. Malfrere and T. Dutoit, High-quality speech
synthesis for phonetic speech segmentation// in Fifth European
Conference on Speech Communication and Technology, 1997.
11. H. Yang, C. Oehlke, and C. Meinel, German Speech
Recognition: A Solution for the Analysis and Processing of
Lecture Recordings// in 10th IEEE/ACIS International
Conference on Computer and Information Science, Sanya, 2011,
P. 201 - 206.
12. A. Chabchoub and others, An automatic mbrola
tool for high quality arabic speech synthesis// International
Journal of Computer Application. – , vol. 36, P. 1-5, 2011.
13. F. Yvon and others, Objective evaluation of
grapheme to phoneme conversion for text-to-speech synthesis in
French// Computer Speech and language, vol. 12, P. 393--410,
1998.
14. A. Kazlauskiene and K. Velickaite, Pastabos del
lietuviu kalbejimo tempo// Acta Linguistica Lituanica. – , vol. 48,
P. 49–58, 2003.
15. H.B. Weil and F.L. Street, What Is An Adequate
Model?// System Dynamics Conference, P. 281-321, 1983.
16. J.H. Eggen, On the quality of synthetic speech
evaluation and improvements// Technische Universiteit
Eindhoven, 1992.
17. A. Anskaitis, Koduoto balso kokybės tyrimas//
Vilnius Gediminas Technical University, 2009.
18. D. Gibbon, R. Moore, and R. Winski, Handbook of
standards and resources for spoken language systems., 1997.
19. M. Greibus and L. Telksnys, Rule Based Speech
Signal Segmentation// Journal of Telecommunications and
Information Technology (JTIT)2011, vol. 1, P. 37--44, 2011.
20. S.V. Gerven and F. Xie, A comparative study of
speech detection methods// in Fifth European Conference on
Speech Communication and Technology, 1997, P. 1095-1098.
Date of paper submission: 2012 03 12
M. Greibus, L. Telksnys. Speech Segmentation Analysis Using Synthetic Signals // Electronics and Electrical
Engineering. – Kaunas: Technologija, 20XX. – No. X(XXX). – P. XX–XX.
It is needed certain nuber of speech cases when real life problems related with speech signals are solved. A narual speech corpora
may suffer from limited sound cases for an experiment. In other hand an experimenter can create huge speech corpora with synthetic
signals that reflect investigative speech cases that it is similar enough to natural speech. In this paper is presented comparative usage of
synthetic and natural speech corpora for speech segmentation algorithms. It was define adequateness synthetic and natural corpora
criteria for speech segmentation. The experiments results showed that synthetic signals can be used for speech algorithm
research. Ill. X, bibl. X, tabl. X (in English; abstracts in English and Lithuanian).
M. Greibus, L. Telksnys. Šnekos segmentavimo analizė naudojantis sintezuotus signalus// Elektronika ir elektrotechnika. –
Kaunas: Technologija, 20XX. – Nr. X(XXX). – P. XX–XX.
Sprendžiant praktinius uždavinius susijusiais su šnekos signalais, reikia turėti tam tikrą skaičių specifinių situacijų. Neretai
natūralios šnekos garsynuose užtenkamai atvejų ne visuomet galima surasti. Vietoj natūralaus garsyno galima naudoti sintetintus. Šiame
darbe yra pateiktas sintetinio garsyno panaudojimas šnekos segmentavimo uždaviniui spręsti. Sintetiniai garsynai, kurie yra pakankamai
artimi natūraliems, gal būti naudojami tyrimuose. Šiame darbe yra apibrėžtas sintetinio garsyno panašumo matas šnekos segmentavimo
algoritmams. Eksperimentiškai yra parodyta, kad sintetinis garsynas gali būti naudojimas šnekos segmentavimo uždaviniams tirti. Il. X,
bibl. X, lent. X (anglų kalba; santraukos anglų ir lietuvių k.).
Download