mg_compare_segmentation_el12

advertisement
ELECTRONICS AND ELECTRICAL ENGINEERING
ISSN 1392 – 1215
20XX. No. X(XX)
ELEKTRONIKA IR ELEKTROTECHNIKA
SIGNAL TECHNOLOGY
T 121
SIGNALŲ TECHNOLOGIJA
Speech Segmentation Analysis Using Synthetic Signals
Mindaugas Greibus, Laimutis Telksnys
Vilnius University Institute of Mathematics and Informatics,
Akademijos str., 4, LT-08663 Vilnius, Lithuania, phone: +37068789464,
e-mail: mindaugas.greibus@exigenservices.com, telksnys@ktl.mii.lt
Introduction
Speech segmentation problem can defined
differently depending on application that the segments will
be used. Speech segments can be divided into levels:
phone, syllable, word and phrase. Automated speech
recognition algorithm can solve segmentation tasks
indirectly. Blind segmentation is used for task when it is
needed extract speech information e.g. speech corpora
construction, initial segment extraction to reduce amount
of data for ASR. Automatic speech segmentation is
attractive for different type of solutions. Cosi et al. [1]
defines that automated segmentation advantage is that
results are predictable: mistakes is done in a coherent way.
In the experiment Wesenick et al. [2] they
compared manual and automatic phoneme segmentation
performance. Results showed that in general errors are
done by experts and algorithm for certain phonemes
transitions: nasal to nasal and vowel to lateral. Manual and
automatic segmentation showed same tendencies for the
best segmentation cases also: algorithm and experts were
performed well with phoneme transitions: voiceless
plosive to nasal, voiceless plosive - vowel.
A natural speech corpus not always provides
enough speech cases for experimentation. Also elements in
natural speech corpus can suffer common issues:
transcription errors, unpredictable noises, lack of
possibility to control speech model, lack of environment
noises adjustment. Several speech model complexity levels
would allow indentify erroneous algorithm places.
Synthetic signal generated by speech model could solve
some of the issues.
Some authors [3][4] in their article propose use
synthetic speech for phone segment alignment. Author
mentioned that synthetic speech does not requires having
corpora for training, but it is possible generate signals that
will be similar enough to natural speech. Other approaches
like HMM requires large amount of training data and
supervised training. Malfrère noticed that phone alignment
is working when original signal and synthetic signals are
sex dependent.
Sethy and Narayanan [5] was investigating what
signal features should be the best for aligning synthetic and
natural speech signals. Features were tested: Melfrequency cepstrum coefficients and their deltas, line
spectral frequencies, formants, energy and its delta and the
zero crossing rate. It was found out that there was no
feature or feature set that would be the best for all the
phonemes. It was proposed to use two feature
combinations depending of what phonemes are compared.
The most often the papers describes that speech
model is used for phoneme alignment. Synthetic speech
can be employed as a part of a speech algorithm. It can be
possible use speech model for the speech segmentation
also.
Adequate Speech Model for Synthesis
Natural speech is a complex process as it is highly
dynamic. Speech generation is relaying on multiple
mechanisms: linguistic, articulatory, acoustic, and
perceptual[6].
Produce synthetic speech signal is not a trivial task.
Speech signal generation has several steps [7]: text
analysis, transforming to computer pronunciation
instruction using linguistic analysis and speech waveform
generation using recorded speech. Quality of synthetic
speech cannot be measured straight forward. Common
approach it is qualitative assessment of synthesized speech.
It is challenging task generate synthetic speech
same as natural speech. There are multiple speech engines
that can be used to simplify speech signals generation:
Mbrola[7], FreeTTS [8] and some others. Mbrola project
supports 25 languages. This is very convenient to evaluate
speech algorithms if multi language is important. Also
Mbrola it is used by other researches in their papers[3][5]
Mbrola speech engine plays important part in the
speech synthesis as it generates waveforms. Text analysis
and linguistic analysis should be done separately. It can be
done by transformation rules that are used in
transformation Letter-To-Sound engines like FreeTTS [8],
eSpeak [9]. Such processing use 3 models: phonetisation,
duration, pitch. Phonetisation model uses lexical analysis
to map grapheme to phoneme. This mapping converts one
stream of orthographical symbols into symbols of the
corresponding sequence of sound. Pitch model [8]
determines synthesized speech parameters for pitch, tone,
stress, and amplitude. This model is important for natural
speech.
In natural speech phonemes are spelled differently
each time. Speech rate of Lithuanian language can
fluctuate from 2.6 up to 5.7 syllables per second [10]. In
order to imitate duration properties it is possible define
duration as random variable∼ 𝒩(𝜇𝐷𝑈𝑅 , 𝜎𝐷𝑈𝑅 2 ). Synthetic
speech without randomization property would not be very
efficient from the research point of view.
Noise resistance it is another important point when
evaluating speech algorithm performance. The most often
the stationary noise can be generating by computer and for
non-stationary should be used real life samples. As
synthetic speech signals has no noise it is easy to control
noisification level SNR (1) measured in dB.
𝜎𝑆2
(1)
𝑆𝑁𝑅 = 10 log ( 2 )
𝜎𝑁
, where 𝜎𝑆2 –speech variance, 𝜎𝑁2 – noise variance
In order that a signal would be noisified properly it
is needed to scale speech signal (2)
𝑦̅[𝑡] =
𝑆𝑁𝑅
𝜎𝑁 √10 10
𝑦[𝑡]
(2)
𝜎𝑠
, where 𝑦̅[𝑡] – t sample of noisified speech signal; 𝑦[𝑡] – t
sample of speech signal.
Synthetic speech could make investigation easier as
it is possible control the noise, content and generate
unlimited speech samples. Speech model signal has to be
sufficient to represent natural speech properties.
In generate if model is adequate enough it can be
defined by situation-depended criteria [11]: if it is “right”,
“useful” and “believable”. Synthetic speech corpus has to
match major requirements for the speech situations that are
investigated by researcher. Extreme naturalness may
require too much effort and the effect of the results can be
insignificant in comparison with more primitive speech
model. Speech model provides ability to control speech
content, generated sample size. Such tool can help
researcher to compare speech algorithms faster than with a
specialized natural speech corpus.
The definition, how synthetic speech signal is
similar to natural speech, can be defined as set of signal
attributes that defines signal suitability for communication
[12]. In general speech signal quality measurements [13]
can be divided in two types: objective and subjective. The
objective measurement cannot always unambiguously
define quality that is good enough for experimentation.
The
intelligibility
measurement
cannot
express
quantitatively easily. The methods, that using
intelligibility, requires expert group and they evaluate how
realistic sound is.
In this paper segmentation error was used as the
quantitative similarity measurement. The first step it is
record natural speech signals with selected phrase. Next
synthetic signals should be generated using same phrase.
After segmentation algorithms should be executed against
both corpora and segmentation errors calculated. Similarity
of synthetic and natural speech can be expressed as
segmentation error difference produced by segmentation
algorithms. If errors values are close, then synthetic signal
are similar to natural speech from speech segmentation
perspective.
For speech segmentation algorithms requires that
the speech model should match VIP (Variable, Intelligible,
Pronounceable) criteria. Variable – synthesized speech
segment properties must be different each time within
limits of natural speech. Intelligible – signals must be
recognized as speech and words must be understood.
Pronounceable – phonemes of specific language should be
used.
Natural behavior of speech in the most cases same
word will be pronounced differently according to context,
without this speech model would be too static in
comparison with the natural speech. As it would be
expected speech model should generate signals that it is
understandable by human. In general it is not acceptable
reuse other language phonemes for experiments, as each
language has its own specific pronunciation.
Speech Model for Segmentation Algorithms
Speech segmentation algorithm creation has to be
tested in different cases of speech signal and environments.
For researchers are important to control environment
during their experiments. This allows indemnify defective
places easier, compare with other algorithms and optimize
algorithm parameters. Generated signals can cover cases
that are rare in natural language. If it is known in advance
segment boundaries, then segmentation human errors are
eliminated. In the way as presented above the generated
signals will be less complex comparing with natural
speech, but the signals are good enough to understand by
human.
Fig. 1. Speech algorithm analysis using synthetic signals
Speech model preparation with segmentation
algorithm method steps are defined figure 1. Synthetic
speech corpus should be prepared for an experiment first of
all. Researchers should identify what cases should be
covered in their experiment: the set of phoneme and noise.
Example: Compare two segmentation algorithms with
phrase: "What is the time?". The 50 signals should be
generated for each noise level. It must be used 5 different
levels of white noise: 30dB, 15dB, 10dB, 5dB, 0dB (250
signals).
The next it is needed transform text information to
instructions for a speech engine. Simple transformation can
be done in two steps: first step it is map one or more
symbols from text alphabet to speech engine acceptable
alphabet and the second step it is define phoneme duration.
Mapping from regular alphabet to computer
readable phonetic alphabet depends on speech engine.
Mbrola uses SAMPA [14] alphabet that defines how
transformation should be done.
When speech model parameters are prepared for
speech generation, then it is possible to generate unlimited
speech signals, which are different from speech
segmentation algorithm perspective. Each generated signal
should have transcription and audio files. A speech
segmentation algorithm process audio files and generates
sequence of auto-segments. Each auto-segment is matched
to the references signals and normally the label is assigned.
Next step it is compare auto-segments and original
segments. Each segment is matched with reference in
original transcription file and correctness of boundaries
and labels are evaluated.
Speech Result Evaluation and Experiment Data
Figure 2 describes what errors can appear during
segmentation. It is compared original transcription
(MANL) and auto-segmentation (AUTO) results. First task
it is find if segment boundaries are detected correctly. Few
error types may be detected (figure 2): EXS –automatically
noise was detected as segment; SFT – automatic segment
boundary was shifted; ITR – additional boundary was
inserted in a segment; JON – boundary between
Segmentation (MEaRBS) and Threshold. It will be shown
that both algorithms were showed similar results with
synthetic and natural speech, but their segmentation results
allow saying which performed better.
Synthetic speech corpora creation method is
described above. It was chosen speaker dependent natural
speech corpus model for the experiment. Speaker
independent should require more effort to collect data and
train recognition engine. For natural speech it was recorded
51 signals of the same phrase by male speaker. One signal
was used to train speech recognition engine
All the natural speech signals where segmented
semi-automatically way. Segmentation was done in two
steps boundaries detection and segment label definition.
The expert segmentation results of natural speech are used
to compare with automated segmentation results.
Experiment Results
MEaRBS algorithm segmented 300 files (250
synthetic and 50 natural speech signals). In figure 3 results
are shown graphically: vertically it is representing signal
types: natural speech, total of all synthetic signals and
synthetic speech with different noise levels: 30 dB, 15 dB,
10 dB, 5dB, 0dB. Horizontally it is representing case
percent value. It is possible to see that the algorithm
showed the worst results for synthetic speech with 0 dB
noise level. Comparing all percent values natural speech
showed similar result as 30dB synthetic speech.
Fig. 2. Automatic speech segment comparison with referece
Auto-segment boundary detection error rate is
calculated using (3).
𝐸𝑋𝑆 + 𝑆𝐹𝑇 + 𝐼𝑇𝑅 + 𝐽𝑂𝑁
𝐸𝑅𝑅𝐵 =
100 (3)
𝑁𝐹
𝐸𝑅𝑅𝐵 – boundary detection error rate, 𝑁𝐹 – number of
segments that are found during experiment.
Segment with incorrect boundaries errors are not
used for pattern matching to define a label of a segment.
Recognition result can be rejected if is algorithm found see
similar references. Recognition error rate (4) is when
algorithm says that it was segment class A, but it was
segment class E (figure 2). It is calculated:
𝐴𝐸
(4)
𝐸𝑅𝑅𝑅 =
100
𝑁𝐵
𝐸𝑅𝑅𝑅 – recognition error rate, 𝑁𝐵 – number of correct
segments from boundary detection step.
Synthetic speech and natural speech was tested in
order to see if generated signals can be use speech
segmentation algorithm comparison. For the experiment
was created speech corpora as described above. Against
same signal set were executed two segmentation
algorithms [15]: Multi Feature Extremum and Rule Based
Fig. 3 MEaRBS algorithm boundary detection cases
Error rates were represented in figure 4. It is
possible to see that speech segment boundaries for natural
speech and synthetic speech with 5dB noise showed
similar result, but natural speech had higher errors for
recognition. Recognition result of synthetic signal with
10db noise is similar with natural speech.
Fig. 4 MEaRBS algorithm segmentation and label class
recognition errors
Another experiment was done with static threshold
segmentation algorithm. It was kept same experiment data
and experiment system parameters. The results are shown
in figure 5. Natural speech segmentation results are similar
to synthetic speech 5dB and 10dB, but segment recognition
results are better. This happened, because DTW algorithm
is very sensitive to finding beginning end of a segment and
this algorithm detected boundaries in different way that
MEaRBS algorithm.
Fig. 5 Threshold algorithm segmentation and label class
recognition errors
Conclusions
The method of speech segmentation algorithm
comparison using synthetic speech corpora was presented.
It was reviewed and proposed modified version of speech
model usage for segmentation evaluation method. It was
shown by experiment that results of segmentation results
using natural speech and synthetic speech is similar:
speech segmentation error is in the confidence intervals of
15 dB synthetic signal and natural speech overlaps. It can
be noted that synthetic speech can be used for speech
research and some algorithms comparison and certain
cases investigation.
References
1. P. Cosi, D. Falavigna, and M. Omologo, A
preliminary statistical evaluation of manual and automatic
segmentation discrepancies// in Second European
Conference on Speech Communication and Technology,
1991.
2. B. M. Wesenick and A. Kipp, Estimating The
Quality Of Phonetic Transcriptions And Segmentations Of
Speech Signals// in Spoken Language, 1996. ICSLP 96.
Proceedings., Fourth International Conference on, 1996, P.
129--132.
3. P. Horak, Automatic Speech Segmentation Based on
Alignment with a Text-to-Speech System// Improvements
in Speech Synthesis. – , P. 328--338, 2002.
4. F. Malfrere, O. Deroo, T. Dutoit, and C. Ris,
Phonetic alignment: speech synthesis-based vs. viterbibased// Speech Communication. – , vol. 40, P. 503--515,
2003.
5. A. Sethy and S.S. Narayanan, Refined speech
segmentation for concatenative speech synthesis// in
Seventh International Conference on Spoken Language
Processing, 2002.
6. L. Deng, Dynamic Speech Models: Theory,
Algorithms, and Applications.: Morgan and Claypool,
2006.
7. T. Dutoit, High-quality text-to-speech synthesis: An
overview// JOURNAL OF ELECTRICAL AND
ELECTRONICS ENGINEERING AUSTRALIA. – , vol.
17, P. 25--36, 1997.
8. W. Walker, P. Lamere, and P. Kwok, FreeTTS: a
performance case study// Sun Microsystems, Inc., 2002.
9. H. Yang, C. Oehlke, and C. Meinel, German Speech
Recognition: A Solution for the Analysis and Processing of
Lecture Recordings// in 10th IEEE/ACIS International
Conference on Computer and Information Science, Sanya,
2011, P. 201 - 206.
10. A. Kazlauskiene and K. Velickaite, Pastabos del
lietuviu kalbejimo tempo// Acta Linguistica Lituanica. – ,
vol. 48, P. 49–58, 2003.
11. H.B. Weil and F.L. Street, What Is An Adequate
Model?// System Dynamics Conference, P. 281-321, 1983.
12. J.H. Eggen, On the quality of synthetic speech
evaluation and improvements// Technische Universiteit
Eindhoven, 1992.
13. A. Anskaitis, Koduoto balso kokybės tyrimas//
Vilnius Gediminas Technical University, 2009.
14. D. Gibbon, R. Moore, and R. Winski, Handbook of
standards and resources for spoken language systems.,
1997.
15. M. Greibus and L. Telksnys, Rule Based Speech
Signal Segmentation// Journal of Telecommunications and
Information Technology (JTIT)2011, vol. 1, P. 37--44,
2011.
Date of paper submission: 2012 03 12
M. Greibus, L. Telksnys. Speech Segmentation Analysis Using Synthetic Signals // Electronics and Electrical
Engineering. – Kaunas: Technologija, 20XX. – No. X(XXX). – P. XX–XX.
It is needed a certain nuber of speech cases when real life problems related with speech signals are solved. A narual speech corpora
may suffer from limited sound cases for an experiment. In other hand an experimenter can create huge speech corpora with synthetic
signals that reflect investigative speech cases that it is similar enough to natural speech. In this paper is presented comparative usage of
synthetic and natural speech corpora for speech segmentation algorithms. It was define adequateness synthetic and natural corpora
criteria for speech segmentation. The experiments results showed that synthetic signals can be used for speech algorithm research. Ill.
X, bibl. X, tabl. X (in English; abstracts in English and Lithuanian).
M. Greibus, L. Telksnys. Šnekos segmentavimo analizė naudojantis sintezuotus signalus// Elektronika ir elektrotechnika. –
Kaunas: Technologija, 20XX. – Nr. X(XXX). – P. XX–XX.
Sprendžiant praktinius uždavinius susijusiais su šnekos signalais, reikia turėti tam tikrą skaičių specifinių situacijų. Neretai
natūralios šnekos garsynuose užtenkamai atvejų ne visuomet galima surasti. Vietoje tokio garsyno galima naudoti sintetintus. Šiame
darbe yra pateiktas sintetinio garsyno panaudojimas šnekos segmentavimo uždaviniui spręsti. Sintetiniai garsynai, kurie yra pakankamai
artimi natūraliems, gal būti naudojami tyrimuose. Šiame darbe yra apibrėžtas sintetinio garsyno panašumo matas šnekos segmentavimo
algoritmams. Eksperimentiškai yra parodyta, kad sintetinis garsynas gali būti naudojimas šnekos segmentavimo uždaviniams tirti. Il. X,
bibl. X, lent. X (anglų kalba; santraukos anglų ir lietuvių k.).
Download