Chih-Ti Shih ThesisF.. - My FIT - Florida Institute of Technology

advertisement
1. INTRODUCTION:
The objective of this thesis is to research and develop prosodic features for
discriminating proper names used in alerting (e.g., “John, can I have that book?”)
from referential context (e.g., “I saw John yesterday”). Prosodic measurements
based on pitch and energy are analyzed to introduce new prosodic-based features
to the Wake-Up-Word Speech Recognition System (Këpuska V. C., 2006). During
the process of finding and analyzing the prosodic features, an innovative data
collection method was designed and developed.
In a conventional automatic speech recognition system, the users are required to
physically activate the recognition system by clicking a button or by manually
starting the application. Using the Wake-Up-Word Speech Recognition System, a
person can activate a system by using their voice only. The Wake-Up-Word
Speech Recognition System will eventually further improve the way people use
speech recognition by enabling speech only interfaces.
In the Wake-Up-Word Speech Recognition System, a word or phrase is used as a
“Wake-Up-Word” (WUW) indicating to the system that the user requires its
attention (e.g., alerting context). Any user can activate the system by uttering a
WUW (e.g., “Operator”), that will enable the application to accept the command
that follows (e.g., “Next slide please”). The non-Wake-Up-Words (non-WUWs)
include the WUWs uttered in referential context, other words, sounds, and noise.
1
Since the WUW may also occur within a referential context, and therefore
indicating that the user does not need attention from the system, it is important
for the system to be able to discriminate accurately between the two. The
following examples further demonstrate the use of the word “Operator” in those
two contexts:
Example sentence 1: “Operator, please go to the next slide.” (alerting context)
Example sentence 2: “We are using the word operator as the WUW.” (referential context)
The above cases indicate different user intentions. In the first example, the word
"operator" is been used as a way to alert the system and get its attention. In the
second example, the same word, “operator”, is used, but in this case it is used in a
“referential context”. Current Wake-Up-Word Speech Recognition system
implements only the pre and post WUW silence as a prosodic feature to
differentiate the alerting and referential contexts.
In this thesis, pitch and energy-based prosodic features are investigated. The
problem of general prosodic analysis is introduced in Section 1.1.In Chapter 2, the
use of pitch as a prosodic feature is described. In general, pitch represents the
intonation of the speech, and the intonation is used to convey linguistic and
paralinguistic information of that speech (Lehiste, 1970) . The definition and
characteristics of pitch will be covered in Section 2.1. In Section 2.2, a pitch
estimation method known as Enhanced Super Resolution Fundamental Frequency
Determinator or eSRFD (Bagshaw, 1994) is introduced. Finally, in Section 2.3,
2
derivation of multiple pitch-based features from pitch measurements are used to
find the best feature to discriminate the WUW used in alerting contexts and
referential contexts.
In Chapter 3, an additional prosodic feature based on energy measurement is
described. The definition of prominence, an important prosodic feature based on
energy and pitch, and its characteristics will be covered in Section 3.1. In Section
3.2, a description of energy computation is presented. Finally, in Section 3.3, a
derivation of multiple energy features from the energy measurement is presented
and analyzed.
In Chapter 4, an innovative idea of performing speech data collection is presented.
After a number of prosodic analysis experiments conducted using WUWII Corpus
(Tudor, 2007), the validation of the results obtained was deemed necessary using
a different data set. Since, to our knowledge, no specialized speech database is
available, an idea from Dr. R. Wallace was adopted to collect the data from the
movies. We designed a system which extracts speech from the audio channel and,
if necessary, video information from recorded media (e.g., DVD) of movies and/or
a TV series. This system is currently under development by Dr. Këpuska’s VoiceKey
Group.
The problem definition and system introduction will be explained in Section 4.1,
followed by the system design in Section 4.2.
3
1.1 PROSODIC ANALYSIS
The word prosody refers to the intonation and rhythmic aspect of a language
(Merriam-Webster Dictionary). Its etymology comes from ancient Greek, where it
was used in singing with instrumental music. In later times, the word was used for
the “science of versification” and the “laws of meter” (William J. Hardcastle, 1997),
governing the modulation of the human voice in reading poetry aloud. In modern
phonetics the word prosody is most often referred to those properties of speech
that cannot be derived from the segmental sequence of phonemes underlying
human utterances.
Human speech cannot be fully characterized as the expression of phonemes,
syllables, or words. For example, we can notice that the length of segments or
syllables are shortened or lengthened in normal speech, apparently in accordance
with some pattern. We can also hear that pitch moves up and down in some nonrandom way, providing speech with recognizable melody. In addition, one can
hear that some syllables or words are made to sound more prominent than others.
Based on the phonological aspect, prosody can be classified into prosodic
structure, tune, and prominence which can be described as follows:
1. Prosodic structure refers to the noticeable break or disjunctures between
words in sentences which can also be interpreted as the duration of the
silence between words as a person speaks. This factor has been considered
4
in the current Wake-Up-Word Speech Recognition system where the
minimal silence period before the WUW and after must be present. The
silence period just before the WUW is usually longer than the average
silence period of non-WUW or other parts of the sentence.
2. Tune refers to the intonational melody of an utterance (Jurafsky & Martin)
which can be quantified by pitch measurement also known as fundamental
frequency of the speech. The details on the pitch characteristic, pitch
estimation algorithm, and the usage of pitch features are presented and
explained in Chapter 2.
3. Prominence includes the measurement of the stress and accent in a
speech. Prominence is measured in our experiments using the energy of
the sound signal. The details of energy computation, feature derivation
based on energy, and the experimental results are presented in Chapter 3.
5
2. PITCH FEATURES
In this chapter, the intonation melody of an utterance, computed using pitch
measurements, is described. The pitch characteristics and a comparison of various
pitch estimation algorithms (Bagshaw, 1994) are covered in chapter 2.1. Based on
the comparison results of multiple fundamental frequency determination
algorithms (FDA) the Enhanced Super Resolution Fundamental Frequency
Determinator (eSRFD) (Bagshaw, 1994) is selected as the algorithm of choice to
perform the pitch estimation. The details of the eSRFD algorithm are covered in
chapter 2.2. Derivation of multiple pitch-based features and their performance
evaluations are covered in chapter 2.3.
2.1 PITCH AND PITCH ESTIMATION METHODS
Intonation is one of the prosodic features that contain the information that may
be the key to discriminate between the referential context and the alerting
context. The intonation of speech is strictly interpreted as “the ensemble of pitch
variations in the course of an utterance”(Hart, 1975). Unlike tonal languages such
as Mandarin Chinese that has lexical forms that are characterized by different
levels or patterns of pitch of a particular phoneme, pitch in the intonational
languages such as English, German, the Romance languages, and Japanese, has
been used syntactically. In addition, intonation patterns in the intonational
languages are grouped with number of words, which are called intonation groups.
6
Intonation groups of words are usually uttered in one single breath. The pitch
measurement in the intonational languages reveals the emotion of a person
and/or the intention of his/her speech. For example, consider the following
sentence:
Can you pass me the phone?
The pattern of continuously rising pitch in the last three words in the above
sentence indicates a request.
Strictly speaking, pitch is defined as the fundamental frequency or fundamental
repetition of a sound. The typical pitch range for adult males is between 60-200 Hz
and 200-400 Hz for adult females and children. The contraction of vocal fold in
humans produces a relatively high pitch and, conversely, the expanded vocal fold
produces a lower pitch. This explains the reason a person’s voice rises in pitch
when he/she gets nervous or surprised. That human males usually have a lower
voice pitch than females and children can also be explained by the fact that males
usually have longer and larger vocal folds.
After years of development of pitch estimation algorithms, pitch estimation
methods can be classified into the following three categories:
7
1. Frequency based methods such as CFD (Cepstrum-based FØ determinator)
and
HPS
(Harmonic product
spectrum), use
frequency domain
representation of the speech signal to find the fundamental frequency.
2. Time domain based methods such as FBFT (Feature-based FØ tracker)
(Phillips, 1985) uses perceptually motivated features and PP (Parallel
Processing) methods to produce fundamental frequency estimates by
analyzing the waveform in the time domain.
3. Cross-correlation methods, such as IFTA (Integrated FØ tracking algorithm)
and SRFD (Super resolution FØ determinator), uses a waveform similarity
metric based on a normalized cross-correlation coefficient.
The method of eSRFD (Enhanced Super Resolution Fundamental Frequency
Determinator) (Bagshaw, 1994) was chosen to extract the pitch measurement for
the Wake-Up-Word because of its high overall accuracy. According to Bagshaw’s
experiments, the accuracy of the eSRFD algorithm can have a voiced and unvoiced
combined error rate below 17% and a low-gross fundamental frequency error rate
of 2.1% and 4.2% for males and females, respectively. Figure 2-1 and Figure 2-2
show the error rate comparison charts between eSRFD and other FDAs for male
and female voices, respectively.
8
60
Gross Error Low
Gross Error High
50
Voiced
Unvoiced
40
30
20
10
0
CFD
HPS
FBFT
PP
IFTA
SRFD
eSRFD
Figure 2-1 FDA Evaluation Chart: Male Speech. Reproduced from (Bagshaw, 1994)
In Figure 2-1 and Figure 2-2, the purple bars indicate the low-gross FØ error which
refers to the halving error where the pitch has been estimated wrongly with a
value about half of the actual pitch. The green bars represent the high-gross FØ
error which refers to the doubling error where the pitch has been estimated
wrongly with a value about twice that of the actual pitch. The voiced error
represented by red bars refers to those unvoiced frames which have been
misidentified as voiced ones by the FDA. Finally, the blue bars show the unvoiced
errors which means the voiced data has been misidentified as unvoiced data.
9
70
Gross Error Low
Gross Error High
60
Voiced
Unvoiced
50
40
30
20
10
0
CFD
HPS
FBFT
PP
IFTA
SRFD
eSRFD
Figure 2-2 FDA Evaluation Chart: Female Speech. Reproduced from (Bagshaw, 1994)
Figure 2-1 and Figure 2-2 refer to male and female fundamental frequency
evaluation charts. They show that the eSRFD algorithm achieves the lowest overall
error rate. This result was confirmed in a more recent study (Veprek & Scordilis,
2002). Consequently, eSRFD was chosen to be the FDA used in the present project.
10
2.2 ESRFD PITCH ESTIMATION ALGORITHM
The eSRFD (Bagshaw, 1994) is the advanced version of SRFD (Medan, 1991). The
program flow chart of the eSRFD FDA is illustrated in Figure 2-3.
The theory behind the SRFD (Medan, 1991) algorithm is to use a normalized crosscorrelation coefficient to quantify the degree of similarity between two adjacent,
non-overlapping sections of speech. In eSRFD, a frame is divided into three
consecutive sections instead of two as in the original SRFD algorithm.
At the beginning, the sample waveform is passed through a low-pass filter to
remove the signal noise. The sample utterance is then divided into nonoverlapping frames of 6.5 ms length (tinterval = 6.5ms) and each frame contains a set
of samples, SN, where s N  {s(i) | i   N max ,..., N  N max } , which is divided into 3
consecutive segments with each containing an equal number of a varying number
of samples, n. The definition of segmentation is defined by Equation 2-1 and is
further described in Figure 2-4 below.
xn  {x(i )  s (i  n) | i 1,...n}
y n  {x(i )  s (i ) | i 1,...n}
z n  {x(i )  s (i  n) | i 1,...n}
Equation 2-1
11
Figure 2-3 eSRFD Flow chart
12
Figure 2-4 Analysis segments of eSRFD FDA
In eSRFD each frame is processed by a silence detector which labels the frame as
unvoiced or silence if the sum of the absolute values of xmin, xmax, ymin, ymax, zmin
and zmax is smaller than a preset value (e.g., 50db signal-to-noise level); conversely,
the frame is voiced if the sum of the absolute values of xmin, xmax, ymin, ymax, zmin
and zmax is equal to or larger than a preset value (e.g., 50db signal-to-noise level).
No fundamental frequency will be searched if the frame is marked as an unvoiced
frame. In those cases where at least one of the segments of xn, yn, or zn is not
defined, which usually happens at the beginning of the speech file and the end of
the speech file, these frames will be labeled as unvoiced and no FDA will be
applied to them.
If the frame of the sample is not labeled as unvoiced, then candidate values for
the fundamental period are searched from values of n within the range N min to
Nmax by using the normalized cross-correlation coefficient Px,y(n) as described by
Equation 2-2.
13
[n / L]
Px , y (n) 
 x( jL) * y( jL)
j 1
[n / L
 x( jL) 2 *
j 1
[n / L]
 y( jL)
2
j 1
{n  N min  iL | i  0,1,...; N min  n  N max}
Equation 2-2
In Equation 2-2, the decimation factor L is used to lower the computational load of
the algorithm. Smaller L values allow higher resolution but also causes increase in
the computational load of the FDA. Larger L values produce faster computation
with a lower resolution search. The L value is set to 1 since the purpose of this
research is to find as accurately as possible the relationship between pitch
measurements in WUW words. Therefore, computational speed is considered
secondary and thus is not taken into account. However, the variable L will be
considered when this algorithm is integrated into the WUW Speech Recognition
System.
Figure 2-5 Analysis segments for Px,y(n) in the eSRFD
The candidate values of the fundamental period of a frame are found by locating
peaks in the normalized cross-correlation result of Px,y(n). If this value exceeds a
14
specified threshold, Tsrfd, then the frame is further considered to be a voiced
candidate. This threshold is adaptive and is dependent on the voice classification
of the previous frame and three preset parameters. The definition of T srfd is
described in Equation 2-3. If the previous frame is unvoiced or silent, then Tsrfd is
arbitrarily set equal to 0.88. If the previous frame is voiced, then Tsrfd is equal to
the larger value between 0.75 and 0.85 times the value of Px,y of the previous
frame P’x,y. The threshold value is adjusted because the present frame has a higher
possibility to be classified as voiced if the previous frame is also voiced.
Tsrfd  0.88
If the previous frame is unvoiced or silent.
Tsrfd  max[ 0.75,0.85P' x , y (n' 0 )] If the previous frame is unvoiced or silent.
Equation 2-3
In case no candidates for the fundamental period are found in the frame, then the
frame is reclassified as ‘unvoiced’ and no further processing will be applied to the
unvoiced frame. On the other hand, if the frame is classified as voiced, then the
following process will be used to find the optimal candidate as described next.
After getting the first normalized cross-correlation coefficient Px,y, the second
normalized cross-correlation coefficient Py,z, will be calculated for the voiced
frame. The normalized cross-correlation coefficient Py,z is described by Equation
2-4.
15
[n / L]
Py , z (n) 
 x( jL) * y( jL)
j 1
[n / L
 x( jL) 2 *
j 1
[n / L]
 y( jL)
2
j 1
{n  N min  iL | i  0,1,...; N min  n  N max}
Equation 2-4
After the second normalized cross-correlation, the score will be given to all
candidates. If the candidate pitch value of a frame has both Px,y and Py,z values
larger than Tsrfd, then a score or value of 2 is given to the candidate. If only Px,y is
above Tsrfd values, then a score of 1 is assigned to the candidate. The higher score
indicates a higher possibility for the candidate to represent the fundamental
period of the frame. After candidate scores are given, if there are one or more
candidates with a score of 2, then all candidates’ scores with 1 in that frame are
removed from the candidate list. If there is only one candidate with a score of 2,
then that candidate is assumed to be the best estimation of the fundamental
period of that particular frame. If there are multiple candidates with a score of 1
but no candidate scores of 2, then an optimal fundamental period is sought from
the remaining candidates.
For the case of multiple candidates with scores of 1 but no candidate scores of 2,
then the candidates are sorted in ascending order of fundamental period. The last
16
candidate of the list which has the largest fundamental period represents a
fundamental period of nM and nm for the m-th candidate.
Figure 2-6 Analysis segments for q(nm) in the eSRFD
Then the third normalized cross-correlation coefficient, qnm, between two sections
of length nM spaced nm apart, is calculated for each candidate. The section nM in a
frame is illustrated in Figure 2-6, and Equation 2-5 describes the normalized crosscorrelation coefficient, q(nm) used in this case.
[ nM ]
q ( nm ) 
 s( j ) * s( j  n
j 1
[ nM ]
M
 nm )
[ nM ]
 s( j ) *  y( j  n
j 1
2
j 1
M
 nm ) 2
Equation 2-5
After the third normalized cross-correlation coefficient is generated, the q(nm)
value of the first candidate on the list is assumed to be the optimal value. If the
following q(nm), multiplied by 0.77, is larger than the current optimal value, then
the candidate for which q(nm) is considered to be the new optimal value. We
17
apply the same concept throughout the entire list of candidates, resulting in the
optimal candidate value.
For the case where only one candidate has a score of 1 and there are no candidate
scores of 2, then the possibility for the candidate to be the true fundamental
period of the frame is low. In such a case, if both previous frames and the next
frame are silent, then the current frame is an isolated frame and is reclassified as a
silent frame. If either the previous or the next frame is a voiced frame, then we
assume the candidate of the current frame is the optimal one and it defines the
fundamental period of the current frame.
The above algorithm has a high possibility to misidentify voiced frame as an
unvoiced or silent frame. In order to counteract this imbalance, biasing is applied
when all of the following three conditions are satisfied:

The two previous frames were voiced frames.

The fundamental period of the previous frame is not temporarily on hold.

The fundamental frequency of the previous frame is less than 7/4 times
the fundamental frequency of its next voiced frame and is greater than 5/8
of the next frame.
18
After obtaining the fundamental frequency, and in order to further minimize the
occurrence of doubling or halving errors, the pitch contour is passed through a
median filter.
The median filter will have a default length of 7, but the size will decrease to 5 or 3
in case there are less than 7 consecutive voiced frames. Figure 2-7 is an example
of doubling points being corrected by the medium filter. In Figure 2-7, the top row
shows the pitch measurement generated by eSRFD FDA and the bottom row
shows the fixed measurement passed through a medium filter. As we can see
from the figure, the two points marked as doubling errors were corrected by the
medium filter.
Doubling Error
Figure 2-7 medium filter example
We applied the above pitch estimation method to the WUWII (Wake-Up-Word II)
corpus. The WUWII corpus contains 3410 sample utterances and each utterance
19
sentence contains at least one of the five different WUWs. The five WUWs are
‘Wildfire’, ‘Operator’, ‘ThinkEngine’, ‘Onword’ and ‘Voyager’. Figure 2-8 displays a
sample utterance containing the following sentence where the word “Wildfire” is
the WUW of the sentence.
“Hi. You know, I have this cool wildfire service and, you
know, I'm gonna try to invoke it right now. Wildfire”
Figure 2-8 Example, WUWII00073_009.ulaw
In Figure 2-8, the first row shows the waveform of the speech, the second row
shows the pitch estimation from eSRFD FDA, the third shows the pitch estimation
after the median filter, and the last row shows the audio spectrogram of the
20
speech. The WUW of this sentence is ‘Wildfire’ which is the section delineated
between two red lines.
21
2.3 PITCH-BASED FEATURES
The pattern of the fundamental frequency contour of utterance waveforms
represents the intonation of the speech. To the best of our knowledge the
problem of discriminating between the uses of words in an alerting context from
words used in a referential context has never been done before. To accomplish
this, a specialized speech data corpus containing WUWs is necessary. In this
project, the corpus named WUWII (Këpuska V. ) was chosen. The WUWII corpus
contains 3410 sample utterances and each utterance sentence contains at least
one of the five different WUWs. The five WUWs are ‘Wildfire’, ‘Operator’,
‘ThinkEngine’, ‘Onword’ and ‘Voyager’.
In our hypothesis, the intonation will rise when the WUW is spoke, thus there
should be an increment on the average pitch and/or maximum pitch on the
WUWs sections compared to the non-WUWs sections in the utterance sentence.
Based on the above hypothesis, the average pitch and maximum pitch of the
WUWs are considered and twelve pitch-based features are derived and listed in
Table 2-1. The features are represented as the relative change between A and B
which is defined in Equation 2-6 as:
Relative Change between A and B = (A-B)/B.
Equation 2-6 Relative Change
22
Feature Name
Feature Definition
APW_AP1SBW
The relative change of the average pitch of the WUW to the
average pitch of the previous section just before the WUW.
AP1sSW_AP1SBW
The relative change of the average pitch of the first section of the
WUW to the average pitch of previous section just before the
WUW.
APW_APAll
The relative change of the average pitch of WUW to the average
pitch of the entire speech sample excluding the WUW sections.
AP1sSW_APAll
The relative change of the average pitch of the first section of the
WUW to the average pitch of the entire speech sample excluding
the WUW sections.
APW_APAllBW
The relative change of the average pitch of the WUW to the
average pitch of entire speech sample before the WUW.
AP1sSW_APAllBW
The relative changes of the average pitch of the first section of
the WUW to the average pitch of the entire speech sample
excluding the WUW sections.
MaxPW_MaxP1SBW
The relative change of the maximum pitch in the WUW sections
to the maximum pitch in the previous section just before the
WUW.
MaxP1sSW_MaxPAllBW
The relative change of the maximum pitch in the first section of
the WUW to the maximum pitch of the previous section just
before the WUW.
MaxPW_MaxPAll
The relative change of the maximum pitch of the WUW to the
maximum pitch of the entire speech sample excluding the WUW
sections.
MaxP1sSW_MaxPAll
The relative change of the maximum pitch of the first section of
the WUW to the maximum pitch of the entire speech sample
excluding the WUW sections.
MaxP1sSW_MaxPAllBW
The percentage changes of the maximum pitch in the first section
of the WUW to the maximum pitch of the entire speech before
the WUW.
MaxPW_MaxPAllBW
The percentage changes of the maximum pitch in the WUW
sections to the maximum pitch of the entire speech sample
before the WUW.
Table 2-1 Pitch Features definition
23
The pitch-based feature readings have been calculated for combinations of all five
different WUWs and each of the individual of five different WUWs. The detail
performance results are shown in Appendix A. In this section, the results of pitchbased features are shown and explained using the combination of all five WUWs.
This is presented in Table 2-2 below.
Pitch-Based Features
WUW: AllWUWs
APW_AP1SBW
Valid
Data
1415
AP1sSW_AP1SBW
Pt > 0
%>0
Pt = 0
%=0
Pt < 0
%<0
726
51
0
0
689
49
1415
735
52
0
0
680
48
APW_APALL
2282
947
41
0
0
1335
59
AP1sSW_APALL
2282
996
44
2
0
1284
56
APW_APALLBW
2188
962
44
0
0
1226
56
AP1sSW_APALL
2188
1003
46
2
0
1183
54
MaxP_MaxP1SBW
1415
948
67
53
4
414
29
MaxP1sSW_MaxP1SBW
1415
719
51
54
4
642
45
MaxPW_MaxPAll
2282
1020
45
109
5
1153
51
MaxP1sSW_MaxPAll
2282
716
31
213
9
1353
59
MaxP1sSW_MaxPAllBW
2188
1069
49
111
5
1008
46
MaxPW_MaxPAllBW
2188
1003
35
2
10
1183
55
Table 2-2 Pitch-Based Features Experimental Results of All WUWs
In Table 2-2, the first column indicate the name of the features, the second
column shows the number of valid data and only these samples with valid data are
processed for the particular features. The third and forth columns show the
number and percentage of samples respectively with that feature above zero.
Similarly, the fifth and sixth columns show the number and percentage of samples
with that feature equal to 0. And, finally, the seventh and eighth columns show
the number and percentage of samples with that feature below zero.
24
From an examination of Table 2-1, we see that the highest percentage of relative
change of all the features is MaxP_MaxP1SBW with only 67%. This means that
only 67% of the sample has positive relative change between the maximum pitch
measurement of the WUW sections and the maximum pitch measurement in the
section just before that WUW. This result can also be interoperated as showing
only 67% of these samples have a maximum pitch in the WUW sections higher
than the maximum pitch in the previous section of the WUW. The results for the
five individual WUWs used in this study are summarized in Table 2-3 below. The
full detail pitch-based feature experimental results can be found in Appendix A.
WUW
Best Performance Feature
Name
Percentage of Positive
Relative Change of the
feature
All WUWs
MaxP_MaxP1SBW
67%
Operator
MaxP_MaxP1SBW
58%
Onword
MaxP_MaxP1SBW
58%
ThinkEngine
APW_AP1SBW
65%
Wildfire
MaxP_MaxP1SBW
66%
Voyager
MaxP_MaxP1SBW
79%
Table 2-3 Summarized Pitch-Based Features Experimental Results
Although there appears to be no prior research which has establish a definite
standard by which performance can be rated, in this project we somewhat
arbitrarily set a minimum of 80% as the criterion for any given features to be
considered reliable. From the summarized results in Table 2-3, the feature with
25
the best performance is MaxP_MaxP1SBW which has percentages of positive
relative change from 58% to 79% depending on the WUW. This “best performance”
of the pitch based features analysis is below our 80% minimum standard, which
makes it too low to be considered reliable in discriminating between WUWs and
non-WUWs.
In the pitch-based features experiment, no significant discriminating pattern could
be found from the results obtained. These results could be improved if it were
possible to define clear syllabic boundaries. However, syllabic boundaries in the
English language are not clearly defined. In English there is no common agreement
among linguists on syllabic boundaries,
Based on the above experimental results, no pitch-based features can be used for
discriminating WUWs from non-WUWs. Thus, other approaches, such as pitch
measurement patterns, are under consideration and Raymond Sastraputera, a
graduate student working with Dr. Këpuska, will continue research on the new
approaches. Other possible approaches of pitch-based features are covered in
Chapter 5.
26
3. ENERGY FEATURES
As mentioned in Section 1.1, the prosodic feature known as prominence can be
measured using the energy of the utterance. If pitch represents the intonation of
speech then the energy represents the stress of the speech. In this chapter, the
same approach that was applied to pitch in Chapter 2 was used with energy to
generate a similar feature set.
3.1 ENERGY CHARACTERISTIC
In an English sentence, certain syllables are more prominent than others and
these syllables are called accented syllables. Accented syllables are usually either
louder or longer compared to the other syllables in the same word. In the English
language, different positions of the accented syllables on the same word are used
to differentiate the meaning of the word. For example, the word object (noun
[‘ab.dzekt]) compared to the same word object used as a verb ([ab.’dzekt]) (Cutler,
1986) has accented syllables in different locations. The position of the accented
syllables is indicated by “ ’ ” in the phonetic transcription. If this idea of accented
speech is applied to the entire sentence instead of to a single word, then it may
provide additional clues about the use of a word of interest and its meaning within
the sentence. Our hypothesis here is that the prominence of WUWs should be
more significant compare to the prominence of the non-WUWs in the sentence.
27
Classifying the factors that model speakers’ speech and how they choose to
accentuate a particular syllable within the whole sentence is a very complex
problem. However, the measurement of the accented syllables can be simply
done by using the energy of the speech signal and its pitch change.
28
3.2 ENERGY EXTRACTION
The energy of a speech signal can be expressed by Parseval’s Theorem as given in
Equation 3-1.


n  
x[n] 
2
1
2

 X ()
2
d

Equation 3-1
In Equation 3-1, the energy of a signal is defined in both the time or frequency
domain. Both |x[n]|2 and |X()|2 represent the energy density which can be
thought as energy per unit of time and energy per unit of frequency respectively.
The energy of a fixed frame size (6.5ms), the same as was used in the earlier pitch
computations, is used here as well. After the energy is calculated for all samples of
each utterance in the WUWII corpus, the energy features could be computed in a
manner similar to the way the pitch-based features were calculated in section 2.3.
This is done in the next section.
29
3.3 ENERGY-BASED FEATURES
Using the same technique as was done in the previous experiments with pitchbased features, 12 energy-based features were computed and tested. The energybased features are derived the same way as was done previously for the pitchbased features (Equation 2-6). The features are listed and defined in Table 3-1
below:
Feature Name
Feature Definition
AEW_AE1SBW
The relative change of the average energy of the WUW to
the average energy of the previous section just before the
WUW.
AE1sSW_AE1SBW
The relative change of the average energy of the first
section of the WUW to the average energy of previous
section just before the WUW.
AEW_AEAll
The relative change of the average energy of the WUW to
the average energy of the entire sample speech excluding
the WUW sections.
AE1sSW_AEAll
The relative change of the average energy of the first
section in the WUW to the average energy of the entire
utterance excluding the WUW sections.
AEW_AEAllBW
The relative change of the average energy of the WUW to
the average energy of all speech before the WUW.
AE1sSW_AEAllBW
The relative change of the average energy of the first
section in the WUW to the average energy of the entire
sample speech before the WUW.
MaxEW_MaxE1SBW
The relative change of the maximum energy in the WUW
sections to the maximum energy in the previous section of
the WUW.
MaxE1sSW_MaxEAllBW
The relative change of the maximum energy in the first
section of WUW to the maximum energy in the entire
speech before the WUW.
30
MaxEW_MaxEAll
The relative change of the maximum energy in the WUW to
the maximum energy of the entire speech sample excluding
the WUW section.
MaxE1sSW_MaxEAll
The relative change of the maximum energy in the first
section of the WUW to the maximum energy of the entire
speech sample excluding the WUW section.
MaxE1sSW_MaxEAllBW
The relative change of the maximum energy in the first
section of the WUW to the maximum energy of the entire
speech before the WUW.
MaxEW_MaxEAllBW
The relative change of the maximum energy in the WUW
sections to the maximum energy of the entire speech
sample before the WUW.
Table 3-1 Energy-Based Features Definition
In this experiment some of the features may not be implementable in real-time
applications since they rely on the measurements after the WUW of interest.
Nevertheless, even those features may lead to interesting conclusions. For real
time speech recognition systems those features that do not rely on the features
after the WUW of interest are the most useful. Table 3-2 below shows the results
of the measurements of energy features based on all five different WUWs of
WUWII corpus, namely the words “Operator”, “ThinkEngine”, “Onword”, “Wildfire”
and “Voyager”. The details of the results result for each of five different WUWs
can be found in Appendix B.
31
Energy-Based Features
Valid
Data
Pt > 0
%>0
Pt = 0
%=0
Pt < 0
%<0
AEW_AE1SBW
1479
1164
79
0
0
315
21
AE1sSW_AE1SBW
1479
1283
84
1
0
240
16
AEW_AEAll
2175
1059
49
9
9
1116
51
AE1sSW_AEAll
2175
1155
53
2
0
1018
47
AEW_AEAllBW
1969
1427
72
0
0
542
28
AE1sSW_AEAllBW
1969
1562
79
3
0
404
21
MaxEW_MaxE1SBW
1479
1244
84
20
1
215
15
MaxE1sSW_MaxEAllBW
1479
1221
83
13
1
245
17
MaxEW_MaxEAll
2175
1373
63
13
1
245
17
MaxE1sSW_MaxEAll
2175
1336
61
25
1
814
37
MaxE1sSW_MaxEAllBW
1969
1209
61
16
1
744
38
MaxEW_MaxEAllBW
1969
1562
60
3
1
404
39
WUW: All WUWs
Table 3-2 Energy-Based Feature Experimental Results of All WUWs
In Table 3-2, the first column indicate the name of the features, the second
column shows the number of valid data and only the samples with valid data are
processed for the particular features. The third and forth columns show the
number and percentage of samples with that feature above zero. The fifth and
sixth columns show the number and percentage of samples with that feature
equal to zero. The seventh and eighth columns show the number and percentage
of samples with that feature less than zero.
Based on the results shown in Table 3-2, the following three features performed
the best in discriminating WUW from other word tokens:

AE1sSW_AE1SBW: The relative change of the average energy of the first
section in the WUW compared to the average energy of the last section
32
before the WUW. Using this feature, 84% of the data shows the average
energy of the first section of the WUW is higher than the average energy of
the previous section. This result is illustrated in Figure 3.1 below depicting
the distribution of this feature in blue and its cumulative distribution in red.
The cumulative plot shown in red is a continuous curve that approaches a
value of 100%; the distrivution plot is discrete and is shown here in blue.
Both plots are presented in black in Appendices A and B.
(WUW1stAE-LSAE)/LSAE cumulative plot
100
90
80
70
%
60
50
40
30
20
10
0
0
10
20
30
40
50
60
(WUW1stAE-LSAE)/LSAE
70
80
90
100
Figure 3-1 Distribution and Cumulative plots of energy-based feature AE1sSW_AE1SBW of AllWUWs.

MaxEW_MaxE1SBW: The relative change of the Maximum energy in the
WUW sections compared to the maximum energy from the last section
before the WUW. Using this feature, 84% of the samples show that the
maximum energy in the WUW sections is higher than the maximum energy
33
of the previous section. The distribution and the cumulative plots of this
feature are shown in Figure 3-2 below.
(WUWMAXE-LSMAXE)/LSMAXE cumulative plot
100
90
80
70
%
60
50
40
30
20
10
0
0
10
20
30
40
50
60
(WUWMAXE-LSMAXE)/LSMAXE
70
80
90
100
Figure 3-2 Distribution and Cumulative plots of energy-based feature MaxEW_MaxE1SBW of AllWUWs.

MaxE1sSW_MaxEAllBW: The relative change of the maximum energy of
the first section of the WUW compared to the maximum energy from the
last section before the WUW. This feature correctly discriminated 83% of
those cases that exhibited a higher maximum energy of the first section of
the WUW than the maximum energy of the previous section. The
cumulative and distribution plots of this feature are shown in Figure 3-3.
34
(WUW1stMAXE-LSMAXE)/LSMAXE cumulative plot
100
90
80
70
%
60
50
40
30
20
10
0
0
10
20
30
40
50
60
(WUW1stMAXE-LSMAXE)/LSMAXE
70
80
90
100
Figure 3-3 Distribution and Cumulative plots of energy-based feature MaxE1sSW_MaxEAllBW of AllWUWs.
The above results are based on all the data including all five different WUWs. Thus,
investigating each word independently may be more appropriate. The detailed
performance result of each individual of all five different WUWs is covered in
Appendix B.
Linguistically, one of the more common and useful WUW’s is the word “Operator”.
This word is also used in the current Wake-Up-Word Speech Recognition System
(Këpuska V. C., 2006). Based on the results presented in Table 3-3 below, two
features show that over 90% of the WUW cases have an average or maximum
energy higher than the other sections of speech. These two features are:

AE1sSW_AE1SBW: The relative change of the average energy of the first
section in the WUW compared to the average energy of the last section
35
before the WUW. Using this feature, 94% of the samples has the first
section of the WUW with higher average energy then previous section.

AE1sSW_AEAllBW: The relative change of the average energy of the first
section in the WUW compared to the average energy of the entire speech
before the WUW sections. Using this feature, 91% of the samples show the
first section of WUW has higher average energy.
Energy-Based Feature
WUW: Operator
Valid
Data
Pt > 0
%>0
Pt = 0
% =0
Pt < 0
%<0
AEW_AE1SBW
275
228
83
0
0
47
17
AE1sSW_AE1SBW
275
258
94
0
0
17
6
AEW_AEAll
418
248
59
0
0
170
41
AE1sSW_AEAll
418
290
69
1
0
127
30
AEW_AEAllBW
394
303
77
0
0
91
23
AE1sSW_AEAllBW
394
359
91
1
0
34
9
MaxEW_MaxE1SBW
275
240
87
1
0
34
12
MaxE1sSW_MaxEAllBW
275
243
88
0
0
32
12
MaxEW_MaxEAll
418
290
69
4
1
124
30
MaxE1sSW_MaxEAll
418
285
68
6
1
127
30
MaxE1sSW_MaxEAllBW
394
272
69
4
1
118
30
MaxEW_MaxEAllBW
394
359
68
1
1
34
30
Table 3-3 Energy-Based Feature Experimental Results of the WUW “Operator”
Based on the preformed experiment, WUW “Wildfire” achieved the best overall
result. Using this word, four features scored higher than 90%. These results are
shown in Table 3-4.
36
Energy-Based Feature
Valid
Data
Pt > 0
%>0
Pt = 0
%=0
Pt < 0
%<0
AEW_AE1SBW
282
253
90
0
0
29
10
AE1sSW_AE1SBW
282
261
93
0
0
21
7
AEW_AEAll
340
173
51
0
0
167
49
AE1sSW_AEAll
340
185
54
0
0
155
46
AEW_AEAllBW
298
252
85
0
0
46
15
AE1sSW_AEAllBW
298
265
89
0
0
33
11
MaxEW_MaxE1SBW
282
258
91
8
3
16
6
MaxE1sSW_MaxEAllBW
282
253
90
2
1
27
10
MaxEW_MaxEAll
340
230
68
4
1
106
31
MaxE1sSW_MaxEAll
340
219
64
4
1
117
34
MaxE1sSW_MaxEAllBW
298
195
65
4
1
99
33
MaxEW_MaxEAllBW
298
265
62
0
1
33
36
WUW: Wildfire
Table 3-4 Energy-Based Feature Experimental Results of WUW “Wildfire”
The four best features were found to be following:

AEW_AE1SBW: The relative change of the average energy of the entire
WUW compared to the average energy of the last section just before the
WUW. It shows 90% of the average energy of the WUW is higher than the
previous section.

AE1sSW_AE1SBW: The relative change of the average energy of the first
section of the WUW compared to the average energy of the last section
before the WUW. Using this feature, 93% of these samples show the first
section of the WUW has higher average energy.

MaxEW_MaxE1SBW: The relative change of the maximum energy of the
WUW sections compared to the maximum energy in the last section
37
before the WUW. Using this feature, 91% of these samples show the WUW
has higher maximum energy.

MaxE1sSW_MaxEAllBW: The relative change of the maximum energy of
the WUW sections compared to the maximum energy of all sections before
the WUW. Using this feature, 90% of samples show the first section in the
WUW has higher maximum energy.
From an overall view, the experimental result energy-based features are
summarized in Table 3-5 below. The best two features are AE1sSW_AE1SBW with
scores ranging from 71% to 94%, and MaxEW_MaxE1SBW with a score range
between 66% and 87%. For the both of these two features, the lowest scores
occur for the WUW, “ThinkEngine”. The reason of that the word, ThinkEngine has
relative lower energy-based features is because the sound “th” is an unvoiced
fricative sound and has the lowest relative intensity of all English sounds (Fry,
1979). In addition, the nasal sound “eg” also has lower relative intensity of all
English sounds. Despite the lower performance results from the WUW,
“ThinkEngine”, the performance of these two features are between 84% - 94%.
38
WUW
Best Performance Feature
Name
Percentage of Positive
Relative Change of the
feature
All WUWs
AE1sSW_AE1SBW
MaxEW_MaxE1SBW
84%
Operator
AE1sSW_AE1SBW
94%
Onword
MaxEW_MaxE1SBW
87%
ThinkEngine
MaxEW_MaxE1SBW
71%
Wildfire
AE1sSW_AE1SBW
93%
Voyager
MaxEW_MaxE1SBW
83%
Table 3-5 Summarized Energy-Based Features Experimental Results
From the above results, it can be concluded that the WUW is frequently
emphasized or accentuated compared to the rest of the words in the utterance.
Thus the hypothesis that the prominence feature of the WUWs is more significant
than the prominence feature of the non-WUWs is verified. In addition, the energybased features of AE1sSW_AE1SBW and MaxEW_MaxE1SBW can be used reliably
for discriminating WUWs and non-WUWs with properly selected WUWs.
39
4. DATA COLLECTION
In this chapter, we introduce a revolutionary way to collect speech samples. The
preliminary design of this data collection system will also be presented in this
chapter.
4.1 INTRODUCTION TO THE DATA COLLECTION
After developing the WUW discriminating features based on two prosodic
measurements of pitch and energy, described in Chapter 2 and 3, it was realized
that the data used to generate those features may not be the most suitable. The
corpus used in this project is the WUWII corpus which was collected by Dr
Këpuska in 2002. It provides data on a number of WUWs under alerting situations
but doesn’t contain data for the same words used under referential situations. As
a result, it was only possible to perform an analysis based on the changes between
alerting types of WUWs against the overall sentence and not on information in
which the same word appears in a referential situation. Another drawback of the
current WUWII corpus is that it contains speech that is not spontaneous. The
speakers whose voices were used to develop this data set were given each WUW
and asked to make up a sentence using it as a WUW in an alerting situation. Under
these laboratory conditions, some speakers may change the way he/she normally
speaks.
40
In order to perform a more complete analysis, we will need a corpus which
includes both alerting and referential WUWs in context with natural spoken
utterances. Therefore, it was decided to use a suggestion made by Dr. R. Wallace,
to extract audio samples from a movie or a TV program.
Extracting speech samples from movies and TV programs has the following
advantages compared to the data collection method used in developing WUWII
corpus:
1. The speech examples are more natural. The speech from professional
actors is more natural since they tend to think and speak like a particular
character and act the situation of the character that they are depicting.
2. The data collection process will cost much less since we are not
compensating individuals to record their voices. There are no copyright
problems since the data is not being sold or used for commercial purposes.
3. A large amount of data can be collected in a short period of time once the
process is fully automated.
4. The voice channel data is of CD quality. In this project, speech data was
extracted from recorded videos rather than over conventional telephone
lines as was done in developing the WUWII corpus.
5. No manual labeling is required. We plan to use the transcripts obtained
from the video channel (see section 4.2), which provides time stamps for
all spoken sentences.
41
In view of the above listed advantages, it was decided to design an automatic data
collection system to collect specific speech data suitable for prosodic analysis of
proper first names used in referential contexts vs. alerting (or WUW) contexts.
42
4.2 SYSTEM DESIGN
The data collection project is a part of the prosodic features analysis project which
is illustrated by the program flow chart shown in Figure 4-1. The prosodic features
analysis project can be divided into three sub projects. In Figure 4-1, the green
boxes represent the project of prosodic features extraction which are described in
Chapters 2 and 3 of this thesis.
Figure 4-1 Program Flow Chart
The blue boxes depict the WUW data collection project. Finally, the purple boxes
represent a future project on video analysis.
In the prosodic feature analysis project, we use the prosodic features generated
from acoustic measurement to differentiate the context of the words. In a part of
the WUW data collection project the language analysis tools will be used to
43
automatically classify the words of interest – in this case referential or alerting. At
the moment the capabilities of this tool, RelEx (Novamente LLC) must be
augmented in order to achieve this goal. The outcome of the WUW Speech Data
Collection project will not only build a specialized corpus for the prosodic analysis
project, but also will provide confirmation of the results from the prosodic analysis.
The detailed program flow chart of the WUW Speech Data Collection System is
shown in Figure 4-2 below.
Figure 4-2 WUW Audio Data Collection System Program Flow Diagram
The input of the system will be (1) the video file of a movie or TV program, (2) a
video transcription file; if provided, will be used; otherwise it will be extracted
from the video stream by subtitle extractor, SubRip (Zuggy), and (3) an English
dictionary of proper American first names (Campbell). In the case that there is no
44
video transcription file and the subtitles are encoded into the video stream, the
subtitle extractor Subrip (Zuggy), will extract subtitles and time stamps of the
sentence from the video stream. An example of a transcription file extracted in
this manner is presented in Figure 4-3 below.
Figure 4-3 Example of Video Transcription File
The transcription files provide the following information: date and time when the
files were created, the subtitle index number, the start time and end time of each
subtitle, and the subtitle transcription.
The media audio extractor (AOAMedia.com) will extract an audio channel from
the video file. Then, using an English dictionary of first names and the sentence
transcription with time markers, an application program called sentence parser
was developed by VoiceKey team members (Pattarapong, Ronald, & Xeres,
Sentence Parser Program, 2009) to select sentences that include proper English
language first names. Figure 4-4 shows an example of the output of the sentence
parser program.
45
Figure 4-4 An example of the output of the sentence parser application program
In the next step, the audio parser (Pattarapong, Ronald, & Xerxes, Audio Parser
Program, 2009) will use the information from the sentence parser to extract the
corresponding audio sections from the audio file produced by media audio
extractor.
After the extraction of a sentence that contains an English first name, the RelEx
(Novamente LLC) is used to analyze the selected sentence. RelEx is an Englishlanguage semantic relationship extractor based on Carnegie-Mellon Link Parser
(Temperlyey, Lafferty, & Sleator, 2000). RelEx is able to provide sentence
information on various parts of speech such as subjects, direct objects, indirect
objects, and various words tagging such as verbs, gender, and nouns. The current
status of the WUW data collection project is developing a rule based process or a
statistical pattern recognition process, which is based on the relationship
information produced by RelEx. Ultimately, the system will be able to accurately
determine if the name in the sentence is used in a WUW or nonWUW context.
A necessary step in the automation process is to obtain precise time markers
indicating the words of interest. To achieve this, one could use the Hidden Markov
46
Model Toolkit (HTK) (Machine Intelligence Laboratory of the Cambridge University
Engineering Department ), to perform forced alignment of the audio stream. The
HTK was initially developed by Machine Intelligence Laboratory (formerly known
as the Speech, Vision, and Robotics Group) of Cambridge University’s Engineering
Department (CUED). The HTK uses Hidden Markov Model (HMM) which compares
the acoustic features of the incoming audio with the known acoustic features of
all 41 English phonemes to predict the most likely combination of phonemes
associated with the incoming audio. It then maps the words from the lexicon
dictionary. In our case, since the transcription of the sentences is known, HTK is
used to map the phonemes of known words to the corresponding time intervals.
The phoneme time labels or equivalently the word boundaries of the spoken
sentence are used to locate in the time domain the WUWs or nonWUWs. Note
that this step can be also performed by Microsoft’s Speech Development Kit (SDK)
which is a speech recognition system that is fully integrated in Microsoft’s Vista OS.
The advantage of Microsoft’s system is that we do not need to train it since the
acoustic models are built-in. However, a development of the application
incorporating Microsoft’s SDK features is necessary. Alternatively, HTK does not
require any significant integration coding, however it does require acoustically
accurate models. Automation of the described data collection process will be
made possible by integrating the outputs from RelEx with the forced alignment.
47
With time segmented sentence labels of the audio stream indicating the WUW or
non-WUW context, a new corpus can be generated just like WUWII corpus. This
data will then be used to perform prosodic analysis and develop new features or
refine existing prosodic features. It is expected that further study with the new
data will not only validate the current prosodic analysis results, but also will
provide information useful for developing new prosodic features. The ultimate
goal of this speech data collection project is to build a suitable specialized corpus
for the research on finding reliable features to discriminate between WUWs and
non-WUWs in both alerting and referential contexts.
48
5. Conclusions
This thesis investigated two prosodic features and designed an innovative speech
data collection system. The pitch-based features experimental results including all
5 different WUWs are shown in Table 5-1, and the energy-based features are
show in . In this study we arbitrarily decided that the relative change of any
features should be 80% or higher before we would consider it a reliable
discriminator between WUWs and non-WUWs. In addition, it was found that no
single feature works best on all five WUWs used in this study. Each different WUW
may may require a different feature to achieve the best performance. It can be
concluded that the same performance feature will not discriminate all WUWs
equally well between their use in alerting contexts and referential contexts.
Pitch Features
WUW: All WUWs
APW_AP1SBW
Valid
Data
1415
AP1sSW_AP1SBW
Pt > 0
%>0
Pt = 0
%=0
Pt < 0
%<0
726
51
0
0
689
49
1415
735
52
0
0
680
48
APW_APALL
2282
947
41
0
0
1335
59
AP1sSW_APALL
2282
996
44
2
0
1284
56
APW_APALLBW
2188
962
44
0
0
1226
56
AP1sSW_APALLBW
2188
1003
46
2
0
1183
54
MaxPW_MaxP1SBW
1415
948
67
53
4
414
29
MaxP1sSW_MaxP1SBW
1415
719
51
54
4
642
45
MaxPW_MaxPAll
2282
1020
45
109
5
1153
51
MaxP1sSW_MaxPAll
2282
716
31
213
9
1353
59
MaxP1sSW_MaxPAllBW
2188
1069
49
111
5
1008
46
MaxPW_MaxPAllBW
2188
1003
35
2
10
1183
55
Table 5-1 Pitch Features Result of All WUWs
49
Table 5-1 shows the results of the pitch-based features performance when five
different WUWs are included. As we can see from the table, the best performance
would be only 67% which is not high enough to allow reliable discrimination
between WUWs and non-WUWs.
The below shows the energy-based features performance results of all five WUWs
used in the present study.
Energy Features
Valid
Data
Pt > 0
%>0
Pt = 0
%=0
Pt < 0
%<0
AEW_AE1SBW
1479
1164
79
0
0
315
21
AE1sSW_AE1SBW
1479
1283
84
1
0
240
16
AEW_AEAll
2175
1059
49
9
9
1116
51
AE1sSW_AEAll
2175
1155
53
2
0
1018
47
AEW_AEAllBW
1969
1427
72
0
0
542
28
AE1sSW_AEAllBW
1969
1562
79
3
0
404
21
MaxEW_MaxE1SBW
1479
1244
84
20
1
215
15
MaxE1sSW_MaxEAllBW
1479
1221
83
13
1
245
17
MaxEW_MaxEAll
2175
1373
63
13
1
245
17
MaxE1sSW_MaxEAll
2175
1336
61
25
1
814
37
MaxE1sSW_MaxEAllBW
1969
1209
61
16
1
744
38
MaxEW_MaxEAllBW
1969
1562
60
3
1
404
39
WUW: All WUWs
Table 5-2 Energy Features Result of All WUWs
One can see from the , there are several energy-based features with positive
relative changes above 80%. In addition, some individual WUWs achieve multiple
energy-based features having positive relative change of 90% or more which is
covered in section 3.3 and detailed in Appendix B. These results provide firm
evidence that there are significant increases for the energy measurement when
50
WUWs are spoken. These results confirm that the prominence of WUWs is more
significant than the prominence of non-WUWs. Therefore, we can conclude that
energy-based features can be used to discriminate between WUWs and nonWUWs. A future improvement would be to quantify the level of change comparing
WUWs to non-WUWs.
6. Future Work
Two potential solutions aare are being considered addressing the insufficient
accuracy reported in this work for pich based features are outlined as follows:
1.
Build a specialized corpus which contains the same words in both
WUWs and non-WUWs. The speech sentences in the current corpus,
WUWII, only contain WUWs but no non-WUWs. A new speech data
collection system is presented in Chapter 4, which will allow creation of
a database from the collected data that includes both WUWs and nonWUWs.
2.
Use different approaches in defining pitch-based features. For
example, when using average and maximum pitch measurements of
the WUW, how the pitch pattern changes should also be considered.
51
Finally the new data collection system which collects both WUWs and non-WUWs
has been designed and partially implemented. Work on this data collection system
will be continued by VoiceKey group at Florida Institute of Technology. The
ultimate goal of this speech data collection project is to build a suitable specialized
corpus of data samples in order to find suitable prosodic features to reliably
discriminate between WUWs and non-WUWs.
52
53
Download