Chih-Ti Shih ThesisF.. - My FIT - Florida Institute of Technology

1. INTRODUCTION: The objective of this thesis is to research and develop prosodic features for discriminating proper names used in alerting (e.g., “John, can I have that book?”) from referential context (e.g., “I saw John yesterday”). Prosodic measurements based on pitch and energy are analyzed to introduce new prosodic-based features to the Wake-Up-Word Speech Recognition System (Këpuska V. C., 2006). During the process of finding and analyzing the prosodic features, an innovative data collection method was designed and developed. In a conventional automatic speech recognition system, the users are required to physically activate the recognition system by clicking a button or by manually starting the application. Using the Wake-Up-Word Speech Recognition System, a person can activate a system by using their voice only. The Wake-Up-Word Speech Recognition System will eventually further improve the way people use speech recognition by enabling speech only interfaces. In the Wake-Up-Word Speech Recognition System, a word or phrase is used as a “Wake-Up-Word” (WUW) indicating to the system that the user requires its attention (e.g., alerting context). Any user can activate the system by uttering a WUW (e.g., “Operator”), that will enable the application to accept the command that follows (e.g., “Next slide please”). The non-Wake-Up-Words (non-WUWs) include the WUWs uttered in referential context, other words, sounds, and noise. 1 Since the WUW may also occur within a referential context, and therefore indicating that the user does not need attention from the system, it is important for the system to be able to discriminate accurately between the two. The following examples further demonstrate the use of the word “Operator” in those two contexts: Example sentence 1: “Operator, please go to the next slide.” (alerting context) Example sentence 2: “We are using the word operator as the WUW.” (referential context) The above cases indicate different user intentions. In the first example, the word "operator" is been used as a way to alert the system and get its attention. In the second example, the same word, “operator”, is used, but in this case it is used in a “referential context”. Current Wake-Up-Word Speech Recognition system implements only the pre and post WUW silence as a prosodic feature to differentiate the alerting and referential contexts. In this thesis, pitch and energy-based prosodic features are investigated. The problem of general prosodic analysis is introduced in Section 1.1.In Chapter 2, the use of pitch as a prosodic feature is described. In general, pitch represents the intonation of the speech, and the intonation is used to convey linguistic and paralinguistic information of that speech (Lehiste, 1970) . The definition and characteristics of pitch will be covered in Section 2.1. In Section 2.2, a pitch estimation method known as Enhanced Super Resolution Fundamental Frequency Determinator or eSRFD (Bagshaw, 1994) is introduced. Finally, in Section 2.3, 2 derivation of multiple pitch-based features from pitch measurements are used to find the best feature to discriminate the WUW used in alerting contexts and referential contexts. In Chapter 3, an additional prosodic feature based on energy measurement is described. The definition of prominence, an important prosodic feature based on energy and pitch, and its characteristics will be covered in Section 3.1. In Section 3.2, a description of energy computation is presented. Finally, in Section 3.3, a derivation of multiple energy features from the energy measurement is presented and analyzed. In Chapter 4, an innovative idea of performing speech data collection is presented. After a number of prosodic analysis experiments conducted using WUWII Corpus (Tudor, 2007), the validation of the results obtained was deemed necessary using a different data set. Since, to our knowledge, no specialized speech database is available, an idea from Dr. R. Wallace was adopted to collect the data from the movies. We designed a system which extracts speech from the audio channel and, if necessary, video information from recorded media (e.g., DVD) of movies and/or a TV series. This system is currently under development by Dr. Këpuska’s VoiceKey Group. The problem definition and system introduction will be explained in Section 4.1, followed by the system design in Section 4.2. 3 1.1 PROSODIC ANALYSIS The word prosody refers to the intonation and rhythmic aspect of a language (Merriam-Webster Dictionary). Its etymology comes from ancient Greek, where it was used in singing with instrumental music. In later times, the word was used for the “science of versification” and the “laws of meter” (William J. Hardcastle, 1997), governing the modulation of the human voice in reading poetry aloud. In modern phonetics the word prosody is most often referred to those properties of speech that cannot be derived from the segmental sequence of phonemes underlying human utterances. Human speech cannot be fully characterized as the expression of phonemes, syllables, or words. For example, we can notice that the length of segments or syllables are shortened or lengthened in normal speech, apparently in accordance with some pattern. We can also hear that pitch moves up and down in some nonrandom way, providing speech with recognizable melody. In addition, one can hear that some syllables or words are made to sound more prominent than others. Based on the phonological aspect, prosody can be classified into prosodic structure, tune, and prominence which can be described as follows: 1. Prosodic structure refers to the noticeable break or disjunctures between words in sentences which can also be interpreted as the duration of the silence between words as a person speaks. This factor has been considered 4 in the current Wake-Up-Word Speech Recognition system where the minimal silence period before the WUW and after must be present. The silence period just before the WUW is usually longer than the average silence period of non-WUW or other parts of the sentence. 2. Tune refers to the intonational melody of an utterance (Jurafsky & Martin) which can be quantified by pitch measurement also known as fundamental frequency of the speech. The details on the pitch characteristic, pitch estimation algorithm, and the usage of pitch features are presented and explained in Chapter 2. 3. Prominence includes the measurement of the stress and accent in a speech. Prominence is measured in our experiments using the energy of the sound signal. The details of energy computation, feature derivation based on energy, and the experimental results are presented in Chapter 3. 5 2. PITCH FEATURES In this chapter, the intonation melody of an utterance, computed using pitch measurements, is described. The pitch characteristics and a comparison of various pitch estimation algorithms (Bagshaw, 1994) are covered in chapter 2.1. Based on the comparison results of multiple fundamental frequency determination algorithms (FDA) the Enhanced Super Resolution Fundamental Frequency Determinator (eSRFD) (Bagshaw, 1994) is selected as the algorithm of choice to perform the pitch estimation. The details of the eSRFD algorithm are covered in chapter 2.2. Derivation of multiple pitch-based features and their performance evaluations are covered in chapter 2.3. 2.1 PITCH AND PITCH ESTIMATION METHODS Intonation is one of the prosodic features that contain the information that may be the key to discriminate between the referential context and the alerting context. The intonation of speech is strictly interpreted as “the ensemble of pitch variations in the course of an utterance”(Hart, 1975). Unlike tonal languages such as Mandarin Chinese that has lexical forms that are characterized by different levels or patterns of pitch of a particular phoneme, pitch in the intonational languages such as English, German, the Romance languages, and Japanese, has been used syntactically. In addition, intonation patterns in the intonational languages are grouped with number of words, which are called intonation groups. 6 Intonation groups of words are usually uttered in one single breath. The pitch measurement in the intonational languages reveals the emotion of a person and/or the intention of his/her speech. For example, consider the following sentence: Can you pass me the phone? The pattern of continuously rising pitch in the last three words in the above sentence indicates a request. Strictly speaking, pitch is defined as the fundamental frequency or fundamental repetition of a sound. The typical pitch range for adult males is between 60-200 Hz and 200-400 Hz for adult females and children. The contraction of vocal fold in humans produces a relatively high pitch and, conversely, the expanded vocal fold produces a lower pitch. This explains the reason a person’s voice rises in pitch when he/she gets nervous or surprised. That human males usually have a lower voice pitch than females and children can also be explained by the fact that males usually have longer and larger vocal folds. After years of development of pitch estimation algorithms, pitch estimation methods can be classified into the following three categories: 7 1. Frequency based methods such as CFD (Cepstrum-based FØ determinator) and HPS (Harmonic product spectrum), use frequency domain representation of the speech signal to find the fundamental frequency. 2. Time domain based methods such as FBFT (Feature-based FØ tracker) (Phillips, 1985) uses perceptually motivated features and PP (Parallel Processing) methods to produce fundamental frequency estimates by analyzing the waveform in the time domain. 3. Cross-correlation methods, such as IFTA (Integrated FØ tracking algorithm) and SRFD (Super resolution FØ determinator), uses a waveform similarity metric based on a normalized cross-correlation coefficient. The method of eSRFD (Enhanced Super Resolution Fundamental Frequency Determinator) (Bagshaw, 1994) was chosen to extract the pitch measurement for the Wake-Up-Word because of its high overall accuracy. According to Bagshaw’s experiments, the accuracy of the eSRFD algorithm can have a voiced and unvoiced combined error rate below 17% and a low-gross fundamental frequency error rate of 2.1% and 4.2% for males and females, respectively. Figure 2-1 and Figure 2-2 show the error rate comparison charts between eSRFD and other FDAs for male and female voices, respectively. 8 60 Gross Error Low Gross Error High 50 Voiced Unvoiced 40 30 20 10 0 CFD HPS FBFT PP IFTA SRFD eSRFD Figure 2-1 FDA Evaluation Chart: Male Speech. Reproduced from (Bagshaw, 1994) In Figure 2-1 and Figure 2-2, the purple bars indicate the low-gross FØ error which refers to the halving error where the pitch has been estimated wrongly with a value about half of the actual pitch. The green bars represent the high-gross FØ error which refers to the doubling error where the pitch has been estimated wrongly with a value about twice that of the actual pitch. The voiced error represented by red bars refers to those unvoiced frames which have been misidentified as voiced ones by the FDA. Finally, the blue bars show the unvoiced errors which means the voiced data has been misidentified as unvoiced data. 9 70 Gross Error Low Gross Error High 60 Voiced Unvoiced 50 40 30 20 10 0 CFD HPS FBFT PP IFTA SRFD eSRFD Figure 2-2 FDA Evaluation Chart: Female Speech. Reproduced from (Bagshaw, 1994) Figure 2-1 and Figure 2-2 refer to male and female fundamental frequency evaluation charts. They show that the eSRFD algorithm achieves the lowest overall error rate. This result was confirmed in a more recent study (Veprek & Scordilis, 2002). Consequently, eSRFD was chosen to be the FDA used in the present project. 10 2.2 ESRFD PITCH ESTIMATION ALGORITHM The eSRFD (Bagshaw, 1994) is the advanced version of SRFD (Medan, 1991). The program flow chart of the eSRFD FDA is illustrated in Figure 2-3. The theory behind the SRFD (Medan, 1991) algorithm is to use a normalized crosscorrelation coefficient to quantify the degree of similarity between two adjacent, non-overlapping sections of speech. In eSRFD, a frame is divided into three consecutive sections instead of two as in the original SRFD algorithm. At the beginning, the sample waveform is passed through a low-pass filter to remove the signal noise. The sample utterance is then divided into nonoverlapping frames of 6.5 ms length (tinterval = 6.5ms) and each frame contains a set of samples, SN, where s N  {s(i) | i   N max ,..., N  N max } , which is divided into 3 consecutive segments with each containing an equal number of a varying number of samples, n. The definition of segmentation is defined by Equation 2-1 and is further described in Figure 2-4 below. xn  {x(i )  s (i  n) | i 1,...n} y n  {x(i )  s (i ) | i 1,...n} z n  {x(i )  s (i  n) | i 1,...n} Equation 2-1 11 Figure 2-3 eSRFD Flow chart 12 Figure 2-4 Analysis segments of eSRFD FDA In eSRFD each frame is processed by a silence detector which labels the frame as unvoiced or silence if the sum of the absolute values of xmin, xmax, ymin, ymax, zmin and zmax is smaller than a preset value (e.g., 50db signal-to-noise level); conversely, the frame is voiced if the sum of the absolute values of xmin, xmax, ymin, ymax, zmin and zmax is equal to or larger than a preset value (e.g., 50db signal-to-noise level). No fundamental frequency will be searched if the frame is marked as an unvoiced frame. In those cases where at least one of the segments of xn, yn, or zn is not defined, which usually happens at the beginning of the speech file and the end of the speech file, these frames will be labeled as unvoiced and no FDA will be applied to them. If the frame of the sample is not labeled as unvoiced, then candidate values for the fundamental period are searched from values of n within the range N min to Nmax by using the normalized cross-correlation coefficient Px,y(n) as described by Equation 2-2. 13 [n / L] Px , y (n)   x( jL) * y( jL) j 1 [n / L  x( jL) 2 * j 1 [n / L]  y( jL) 2 j 1 {n  N min  iL | i  0,1,...; N min  n  N max} Equation 2-2 In Equation 2-2, the decimation factor L is used to lower the computational load of the algorithm. Smaller L values allow higher resolution but also causes increase in the computational load of the FDA. Larger L values produce faster computation with a lower resolution search. The L value is set to 1 since the purpose of this research is to find as accurately as possible the relationship between pitch measurements in WUW words. Therefore, computational speed is considered secondary and thus is not taken into account. However, the variable L will be considered when this algorithm is integrated into the WUW Speech Recognition System. Figure 2-5 Analysis segments for Px,y(n) in the eSRFD The candidate values of the fundamental period of a frame are found by locating peaks in the normalized cross-correlation result of Px,y(n). If this value exceeds a 14 specified threshold, Tsrfd, then the frame is further considered to be a voiced candidate. This threshold is adaptive and is dependent on the voice classification of the previous frame and three preset parameters. The definition of T srfd is described in Equation 2-3. If the previous frame is unvoiced or silent, then Tsrfd is arbitrarily set equal to 0.88. If the previous frame is voiced, then Tsrfd is equal to the larger value between 0.75 and 0.85 times the value of Px,y of the previous frame P’x,y. The threshold value is adjusted because the present frame has a higher possibility to be classified as voiced if the previous frame is also voiced. Tsrfd  0.88 If the previous frame is unvoiced or silent. Tsrfd  max[ 0.75,0.85P' x , y (n' 0 )] If the previous frame is unvoiced or silent. Equation 2-3 In case no candidates for the fundamental period are found in the frame, then the frame is reclassified as ‘unvoiced’ and no further processing will be applied to the unvoiced frame. On the other hand, if the frame is classified as voiced, then the following process will be used to find the optimal candidate as described next. After getting the first normalized cross-correlation coefficient Px,y, the second normalized cross-correlation coefficient Py,z, will be calculated for the voiced frame. The normalized cross-correlation coefficient Py,z is described by Equation 2-4. 15 [n / L] Py , z (n)   x( jL) * y( jL) j 1 [n / L  x( jL) 2 * j 1 [n / L]  y( jL) 2 j 1 {n  N min  iL | i  0,1,...; N min  n  N max} Equation 2-4 After the second normalized cross-correlation, the score will be given to all candidates. If the candidate pitch value of a frame has both Px,y and Py,z values larger than Tsrfd, then a score or value of 2 is given to the candidate. If only Px,y is above Tsrfd values, then a score of 1 is assigned to the candidate. The higher score indicates a higher possibility for the candidate to represent the fundamental period of the frame. After candidate scores are given, if there are one or more candidates with a score of 2, then all candidates’ scores with 1 in that frame are removed from the candidate list. If there is only one candidate with a score of 2, then that candidate is assumed to be the best estimation of the fundamental period of that particular frame. If there are multiple candidates with a score of 1 but no candidate scores of 2, then an optimal fundamental period is sought from the remaining candidates. For the case of multiple candidates with scores of 1 but no candidate scores of 2, then the candidates are sorted in ascending order of fundamental period. The last 16 candidate of the list which has the largest fundamental period represents a fundamental period of nM and nm for the m-th candidate. Figure 2-6 Analysis segments for q(nm) in the eSRFD Then the third normalized cross-correlation coefficient, qnm, between two sections of length nM spaced nm apart, is calculated for each candidate. The section nM in a frame is illustrated in Figure 2-6, and Equation 2-5 describes the normalized crosscorrelation coefficient, q(nm) used in this case. [ nM ] q ( nm )   s( j ) * s( j  n j 1 [ nM ] M  nm ) [ nM ]  s( j ) *  y( j  n j 1 2 j 1 M  nm ) 2 Equation 2-5 After the third normalized cross-correlation coefficient is generated, the q(nm) value of the first candidate on the list is assumed to be the optimal value. If the following q(nm), multiplied by 0.77, is larger than the current optimal value, then the candidate for which q(nm) is considered to be the new optimal value. We 17 apply the same concept throughout the entire list of candidates, resulting in the optimal candidate value. For the case where only one candidate has a score of 1 and there are no candidate scores of 2, then the possibility for the candidate to be the true fundamental period of the frame is low. In such a case, if both previous frames and the next frame are silent, then the current frame is an isolated frame and is reclassified as a silent frame. If either the previous or the next frame is a voiced frame, then we assume the candidate of the current frame is the optimal one and it defines the fundamental period of the current frame. The above algorithm has a high possibility to misidentify voiced frame as an unvoiced or silent frame. In order to counteract this imbalance, biasing is applied when all of the following three conditions are satisfied:  The two previous frames were voiced frames.  The fundamental period of the previous frame is not temporarily on hold.  The fundamental frequency of the previous frame is less than 7/4 times the fundamental frequency of its next voiced frame and is greater than 5/8 of the next frame. 18 After obtaining the fundamental frequency, and in order to further minimize the occurrence of doubling or halving errors, the pitch contour is passed through a median filter. The median filter will have a default length of 7, but the size will decrease to 5 or 3 in case there are less than 7 consecutive voiced frames. Figure 2-7 is an example of doubling points being corrected by the medium filter. In Figure 2-7, the top row shows the pitch measurement generated by eSRFD FDA and the bottom row shows the fixed measurement passed through a medium filter. As we can see from the figure, the two points marked as doubling errors were corrected by the medium filter. Doubling Error Figure 2-7 medium filter example We applied the above pitch estimation method to the WUWII (Wake-Up-Word II) corpus. The WUWII corpus contains 3410 sample utterances and each utterance 19 sentence contains at least one of the five different WUWs. The five WUWs are ‘Wildfire’, ‘Operator’, ‘ThinkEngine’, ‘Onword’ and ‘Voyager’. Figure 2-8 displays a sample utterance containing the following sentence where the word “Wildfire” is the WUW of the sentence. “Hi. You know, I have this cool wildfire service and, you know, I'm gonna try to invoke it right now. Wildfire” Figure 2-8 Example, WUWII00073_009.ulaw In Figure 2-8, the first row shows the waveform of the speech, the second row shows the pitch estimation from eSRFD FDA, the third shows the pitch estimation after the median filter, and the last row shows the audio spectrogram of the 20 speech. The WUW of this sentence is ‘Wildfire’ which is the section delineated between two red lines. 21 2.3 PITCH-BASED FEATURES The pattern of the fundamental frequency contour of utterance waveforms represents the intonation of the speech. To the best of our knowledge the problem of discriminating between the uses of words in an alerting context from words used in a referential context has never been done before. To accomplish this, a specialized speech data corpus containing WUWs is necessary. In this project, the corpus named WUWII (Këpuska V. ) was chosen. The WUWII corpus contains 3410 sample utterances and each utterance sentence contains at least one of the five different WUWs. The five WUWs are ‘Wildfire’, ‘Operator’, ‘ThinkEngine’, ‘Onword’ and ‘Voyager’. In our hypothesis, the intonation will rise when the WUW is spoke, thus there should be an increment on the average pitch and/or maximum pitch on the WUWs sections compared to the non-WUWs sections in the utterance sentence. Based on the above hypothesis, the average pitch and maximum pitch of the WUWs are considered and twelve pitch-based features are derived and listed in Table 2-1. The features are represented as the relative change between A and B which is defined in Equation 2-6 as: Relative Change between A and B = (A-B)/B. Equation 2-6 Relative Change 22 Feature Name Feature Definition APW_AP1SBW The relative change of the average pitch of the WUW to the average pitch of the previous section just before the WUW. AP1sSW_AP1SBW The relative change of the average pitch of the first section of the WUW to the average pitch of previous section just before the WUW. APW_APAll The relative change of the average pitch of WUW to the average pitch of the entire speech sample excluding the WUW sections. AP1sSW_APAll The relative change of the average pitch of the first section of the WUW to the average pitch of the entire speech sample excluding the WUW sections. APW_APAllBW The relative change of the average pitch of the WUW to the average pitch of entire speech sample before the WUW. AP1sSW_APAllBW The relative changes of the average pitch of the first section of the WUW to the average pitch of the entire speech sample excluding the WUW sections. MaxPW_MaxP1SBW The relative change of the maximum pitch in the WUW sections to the maximum pitch in the previous section just before the WUW. MaxP1sSW_MaxPAllBW The relative change of the maximum pitch in the first section of the WUW to the maximum pitch of the previous section just before the WUW. MaxPW_MaxPAll The relative change of the maximum pitch of the WUW to the maximum pitch of the entire speech sample excluding the WUW sections. MaxP1sSW_MaxPAll The relative change of the maximum pitch of the first section of the WUW to the maximum pitch of the entire speech sample excluding the WUW sections. MaxP1sSW_MaxPAllBW The percentage changes of the maximum pitch in the first section of the WUW to the maximum pitch of the entire speech before the WUW. MaxPW_MaxPAllBW The percentage changes of the maximum pitch in the WUW sections to the maximum pitch of the entire speech sample before the WUW. Table 2-1 Pitch Features definition 23 The pitch-based feature readings have been calculated for combinations of all five different WUWs and each of the individual of five different WUWs. The detail performance results are shown in Appendix A. In this section, the results of pitchbased features are shown and explained using the combination of all five WUWs. This is presented in Table 2-2 below. Pitch-Based Features WUW: AllWUWs APW_AP1SBW Valid Data 1415 AP1sSW_AP1SBW Pt > 0 %>0 Pt = 0 %=0 Pt < 0 %<0 726 51 0 0 689 49 1415 735 52 0 0 680 48 APW_APALL 2282 947 41 0 0 1335 59 AP1sSW_APALL 2282 996 44 2 0 1284 56 APW_APALLBW 2188 962 44 0 0 1226 56 AP1sSW_APALL 2188 1003 46 2 0 1183 54 MaxP_MaxP1SBW 1415 948 67 53 4 414 29 MaxP1sSW_MaxP1SBW 1415 719 51 54 4 642 45 MaxPW_MaxPAll 2282 1020 45 109 5 1153 51 MaxP1sSW_MaxPAll 2282 716 31 213 9 1353 59 MaxP1sSW_MaxPAllBW 2188 1069 49 111 5 1008 46 MaxPW_MaxPAllBW 2188 1003 35 2 10 1183 55 Table 2-2 Pitch-Based Features Experimental Results of All WUWs In Table 2-2, the first column indicate the name of the features, the second column shows the number of valid data and only these samples with valid data are processed for the particular features. The third and forth columns show the number and percentage of samples respectively with that feature above zero. Similarly, the fifth and sixth columns show the number and percentage of samples with that feature equal to 0. And, finally, the seventh and eighth columns show the number and percentage of samples with that feature below zero. 24 From an examination of Table 2-1, we see that the highest percentage of relative change of all the features is MaxP_MaxP1SBW with only 67%. This means that only 67% of the sample has positive relative change between the maximum pitch measurement of the WUW sections and the maximum pitch measurement in the section just before that WUW. This result can also be interoperated as showing only 67% of these samples have a maximum pitch in the WUW sections higher than the maximum pitch in the previous section of the WUW. The results for the five individual WUWs used in this study are summarized in Table 2-3 below. The full detail pitch-based feature experimental results can be found in Appendix A. WUW Best Performance Feature Name Percentage of Positive Relative Change of the feature All WUWs MaxP_MaxP1SBW 67% Operator MaxP_MaxP1SBW 58% Onword MaxP_MaxP1SBW 58% ThinkEngine APW_AP1SBW 65% Wildfire MaxP_MaxP1SBW 66% Voyager MaxP_MaxP1SBW 79% Table 2-3 Summarized Pitch-Based Features Experimental Results Although there appears to be no prior research which has establish a definite standard by which performance can be rated, in this project we somewhat arbitrarily set a minimum of 80% as the criterion for any given features to be considered reliable. From the summarized results in Table 2-3, the feature with 25 the best performance is MaxP_MaxP1SBW which has percentages of positive relative change from 58% to 79% depending on the WUW. This “best performance” of the pitch based features analysis is below our 80% minimum standard, which makes it too low to be considered reliable in discriminating between WUWs and non-WUWs. In the pitch-based features experiment, no significant discriminating pattern could be found from the results obtained. These results could be improved if it were possible to define clear syllabic boundaries. However, syllabic boundaries in the English language are not clearly defined. In English there is no common agreement among linguists on syllabic boundaries, Based on the above experimental results, no pitch-based features can be used for discriminating WUWs from non-WUWs. Thus, other approaches, such as pitch measurement patterns, are under consideration and Raymond Sastraputera, a graduate student working with Dr. Këpuska, will continue research on the new approaches. Other possible approaches of pitch-based features are covered in Chapter 5. 26 3. ENERGY FEATURES As mentioned in Section 1.1, the prosodic feature known as prominence can be measured using the energy of the utterance. If pitch represents the intonation of speech then the energy represents the stress of the speech. In this chapter, the same approach that was applied to pitch in Chapter 2 was used with energy to generate a similar feature set. 3.1 ENERGY CHARACTERISTIC In an English sentence, certain syllables are more prominent than others and these syllables are called accented syllables. Accented syllables are usually either louder or longer compared to the other syllables in the same word. In the English language, different positions of the accented syllables on the same word are used to differentiate the meaning of the word. For example, the word object (noun [‘ab.dzekt]) compared to the same word object used as a verb ([ab.’dzekt]) (Cutler, 1986) has accented syllables in different locations. The position of the accented syllables is indicated by “ ’ ” in the phonetic transcription. If this idea of accented speech is applied to the entire sentence instead of to a single word, then it may provide additional clues about the use of a word of interest and its meaning within the sentence. Our hypothesis here is that the prominence of WUWs should be more significant compare to the prominence of the non-WUWs in the sentence. 27 Classifying the factors that model speakers’ speech and how they choose to accentuate a particular syllable within the whole sentence is a very complex problem. However, the measurement of the accented syllables can be simply done by using the energy of the speech signal and its pitch change. 28 3.2 ENERGY EXTRACTION The energy of a speech signal can be expressed by Parseval’s Theorem as given in Equation 3-1.   n   x[n]  2 1 2   X () 2 d  Equation 3-1 In Equation 3-1, the energy of a signal is defined in both the time or frequency domain. Both |x[n]|2 and |X()|2 represent the energy density which can be thought as energy per unit of time and energy per unit of frequency respectively. The energy of a fixed frame size (6.5ms), the same as was used in the earlier pitch computations, is used here as well. After the energy is calculated for all samples of each utterance in the WUWII corpus, the energy features could be computed in a manner similar to the way the pitch-based features were calculated in section 2.3. This is done in the next section. 29 3.3 ENERGY-BASED FEATURES Using the same technique as was done in the previous experiments with pitchbased features, 12 energy-based features were computed and tested. The energybased features are derived the same way as was done previously for the pitchbased features (Equation 2-6). The features are listed and defined in Table 3-1 below: Feature Name Feature Definition AEW_AE1SBW The relative change of the average energy of the WUW to the average energy of the previous section just before the WUW. AE1sSW_AE1SBW The relative change of the average energy of the first section of the WUW to the average energy of previous section just before the WUW. AEW_AEAll The relative change of the average energy of the WUW to the average energy of the entire sample speech excluding the WUW sections. AE1sSW_AEAll The relative change of the average energy of the first section in the WUW to the average energy of the entire utterance excluding the WUW sections. AEW_AEAllBW The relative change of the average energy of the WUW to the average energy of all speech before the WUW. AE1sSW_AEAllBW The relative change of the average energy of the first section in the WUW to the average energy of the entire sample speech before the WUW. MaxEW_MaxE1SBW The relative change of the maximum energy in the WUW sections to the maximum energy in the previous section of the WUW. MaxE1sSW_MaxEAllBW The relative change of the maximum energy in the first section of WUW to the maximum energy in the entire speech before the WUW. 30 MaxEW_MaxEAll The relative change of the maximum energy in the WUW to the maximum energy of the entire speech sample excluding the WUW section. MaxE1sSW_MaxEAll The relative change of the maximum energy in the first section of the WUW to the maximum energy of the entire speech sample excluding the WUW section. MaxE1sSW_MaxEAllBW The relative change of the maximum energy in the first section of the WUW to the maximum energy of the entire speech before the WUW. MaxEW_MaxEAllBW The relative change of the maximum energy in the WUW sections to the maximum energy of the entire speech sample before the WUW. Table 3-1 Energy-Based Features Definition In this experiment some of the features may not be implementable in real-time applications since they rely on the measurements after the WUW of interest. Nevertheless, even those features may lead to interesting conclusions. For real time speech recognition systems those features that do not rely on the features after the WUW of interest are the most useful. Table 3-2 below shows the results of the measurements of energy features based on all five different WUWs of WUWII corpus, namely the words “Operator”, “ThinkEngine”, “Onword”, “Wildfire” and “Voyager”. The details of the results result for each of five different WUWs can be found in Appendix B. 31 Energy-Based Features Valid Data Pt > 0 %>0 Pt = 0 %=0 Pt < 0 %<0 AEW_AE1SBW 1479 1164 79 0 0 315 21 AE1sSW_AE1SBW 1479 1283 84 1 0 240 16 AEW_AEAll 2175 1059 49 9 9 1116 51 AE1sSW_AEAll 2175 1155 53 2 0 1018 47 AEW_AEAllBW 1969 1427 72 0 0 542 28 AE1sSW_AEAllBW 1969 1562 79 3 0 404 21 MaxEW_MaxE1SBW 1479 1244 84 20 1 215 15 MaxE1sSW_MaxEAllBW 1479 1221 83 13 1 245 17 MaxEW_MaxEAll 2175 1373 63 13 1 245 17 MaxE1sSW_MaxEAll 2175 1336 61 25 1 814 37 MaxE1sSW_MaxEAllBW 1969 1209 61 16 1 744 38 MaxEW_MaxEAllBW 1969 1562 60 3 1 404 39 WUW: All WUWs Table 3-2 Energy-Based Feature Experimental Results of All WUWs In Table 3-2, the first column indicate the name of the features, the second column shows the number of valid data and only the samples with valid data are processed for the particular features. The third and forth columns show the number and percentage of samples with that feature above zero. The fifth and sixth columns show the number and percentage of samples with that feature equal to zero. The seventh and eighth columns show the number and percentage of samples with that feature less than zero. Based on the results shown in Table 3-2, the following three features performed the best in discriminating WUW from other word tokens:  AE1sSW_AE1SBW: The relative change of the average energy of the first section in the WUW compared to the average energy of the last section 32 before the WUW. Using this feature, 84% of the data shows the average energy of the first section of the WUW is higher than the average energy of the previous section. This result is illustrated in Figure 3.1 below depicting the distribution of this feature in blue and its cumulative distribution in red. The cumulative plot shown in red is a continuous curve that approaches a value of 100%; the distrivution plot is discrete and is shown here in blue. Both plots are presented in black in Appendices A and B. (WUW1stAE-LSAE)/LSAE cumulative plot 100 90 80 70 % 60 50 40 30 20 10 0 0 10 20 30 40 50 60 (WUW1stAE-LSAE)/LSAE 70 80 90 100 Figure 3-1 Distribution and Cumulative plots of energy-based feature AE1sSW_AE1SBW of AllWUWs.  MaxEW_MaxE1SBW: The relative change of the Maximum energy in the WUW sections compared to the maximum energy from the last section before the WUW. Using this feature, 84% of the samples show that the maximum energy in the WUW sections is higher than the maximum energy 33 of the previous section. The distribution and the cumulative plots of this feature are shown in Figure 3-2 below. (WUWMAXE-LSMAXE)/LSMAXE cumulative plot 100 90 80 70 % 60 50 40 30 20 10 0 0 10 20 30 40 50 60 (WUWMAXE-LSMAXE)/LSMAXE 70 80 90 100 Figure 3-2 Distribution and Cumulative plots of energy-based feature MaxEW_MaxE1SBW of AllWUWs.  MaxE1sSW_MaxEAllBW: The relative change of the maximum energy of the first section of the WUW compared to the maximum energy from the last section before the WUW. This feature correctly discriminated 83% of those cases that exhibited a higher maximum energy of the first section of the WUW than the maximum energy of the previous section. The cumulative and distribution plots of this feature are shown in Figure 3-3. 34 (WUW1stMAXE-LSMAXE)/LSMAXE cumulative plot 100 90 80 70 % 60 50 40 30 20 10 0 0 10 20 30 40 50 60 (WUW1stMAXE-LSMAXE)/LSMAXE 70 80 90 100 Figure 3-3 Distribution and Cumulative plots of energy-based feature MaxE1sSW_MaxEAllBW of AllWUWs. The above results are based on all the data including all five different WUWs. Thus, investigating each word independently may be more appropriate. The detailed performance result of each individual of all five different WUWs is covered in Appendix B. Linguistically, one of the more common and useful WUW’s is the word “Operator”. This word is also used in the current Wake-Up-Word Speech Recognition System (Këpuska V. C., 2006). Based on the results presented in Table 3-3 below, two features show that over 90% of the WUW cases have an average or maximum energy higher than the other sections of speech. These two features are:  AE1sSW_AE1SBW: The relative change of the average energy of the first section in the WUW compared to the average energy of the last section 35 before the WUW. Using this feature, 94% of the samples has the first section of the WUW with higher average energy then previous section.  AE1sSW_AEAllBW: The relative change of the average energy of the first section in the WUW compared to the average energy of the entire speech before the WUW sections. Using this feature, 91% of the samples show the first section of WUW has higher average energy. Energy-Based Feature WUW: Operator Valid Data Pt > 0 %>0 Pt = 0 % =0 Pt < 0 %<0 AEW_AE1SBW 275 228 83 0 0 47 17 AE1sSW_AE1SBW 275 258 94 0 0 17 6 AEW_AEAll 418 248 59 0 0 170 41 AE1sSW_AEAll 418 290 69 1 0 127 30 AEW_AEAllBW 394 303 77 0 0 91 23 AE1sSW_AEAllBW 394 359 91 1 0 34 9 MaxEW_MaxE1SBW 275 240 87 1 0 34 12 MaxE1sSW_MaxEAllBW 275 243 88 0 0 32 12 MaxEW_MaxEAll 418 290 69 4 1 124 30 MaxE1sSW_MaxEAll 418 285 68 6 1 127 30 MaxE1sSW_MaxEAllBW 394 272 69 4 1 118 30 MaxEW_MaxEAllBW 394 359 68 1 1 34 30 Table 3-3 Energy-Based Feature Experimental Results of the WUW “Operator” Based on the preformed experiment, WUW “Wildfire” achieved the best overall result. Using this word, four features scored higher than 90%. These results are shown in Table 3-4. 36 Energy-Based Feature Valid Data Pt > 0 %>0 Pt = 0 %=0 Pt < 0 %<0 AEW_AE1SBW 282 253 90 0 0 29 10 AE1sSW_AE1SBW 282 261 93 0 0 21 7 AEW_AEAll 340 173 51 0 0 167 49 AE1sSW_AEAll 340 185 54 0 0 155 46 AEW_AEAllBW 298 252 85 0 0 46 15 AE1sSW_AEAllBW 298 265 89 0 0 33 11 MaxEW_MaxE1SBW 282 258 91 8 3 16 6 MaxE1sSW_MaxEAllBW 282 253 90 2 1 27 10 MaxEW_MaxEAll 340 230 68 4 1 106 31 MaxE1sSW_MaxEAll 340 219 64 4 1 117 34 MaxE1sSW_MaxEAllBW 298 195 65 4 1 99 33 MaxEW_MaxEAllBW 298 265 62 0 1 33 36 WUW: Wildfire Table 3-4 Energy-Based Feature Experimental Results of WUW “Wildfire” The four best features were found to be following:  AEW_AE1SBW: The relative change of the average energy of the entire WUW compared to the average energy of the last section just before the WUW. It shows 90% of the average energy of the WUW is higher than the previous section.  AE1sSW_AE1SBW: The relative change of the average energy of the first section of the WUW compared to the average energy of the last section before the WUW. Using this feature, 93% of these samples show the first section of the WUW has higher average energy.  MaxEW_MaxE1SBW: The relative change of the maximum energy of the WUW sections compared to the maximum energy in the last section 37 before the WUW. Using this feature, 91% of these samples show the WUW has higher maximum energy.  MaxE1sSW_MaxEAllBW: The relative change of the maximum energy of the WUW sections compared to the maximum energy of all sections before the WUW. Using this feature, 90% of samples show the first section in the WUW has higher maximum energy. From an overall view, the experimental result energy-based features are summarized in Table 3-5 below. The best two features are AE1sSW_AE1SBW with scores ranging from 71% to 94%, and MaxEW_MaxE1SBW with a score range between 66% and 87%. For the both of these two features, the lowest scores occur for the WUW, “ThinkEngine”. The reason of that the word, ThinkEngine has relative lower energy-based features is because the sound “th” is an unvoiced fricative sound and has the lowest relative intensity of all English sounds (Fry, 1979). In addition, the nasal sound “eg” also has lower relative intensity of all English sounds. Despite the lower performance results from the WUW, “ThinkEngine”, the performance of these two features are between 84% - 94%. 38 WUW Best Performance Feature Name Percentage of Positive Relative Change of the feature All WUWs AE1sSW_AE1SBW MaxEW_MaxE1SBW 84% Operator AE1sSW_AE1SBW 94% Onword MaxEW_MaxE1SBW 87% ThinkEngine MaxEW_MaxE1SBW 71% Wildfire AE1sSW_AE1SBW 93% Voyager MaxEW_MaxE1SBW 83% Table 3-5 Summarized Energy-Based Features Experimental Results From the above results, it can be concluded that the WUW is frequently emphasized or accentuated compared to the rest of the words in the utterance. Thus the hypothesis that the prominence feature of the WUWs is more significant than the prominence feature of the non-WUWs is verified. In addition, the energybased features of AE1sSW_AE1SBW and MaxEW_MaxE1SBW can be used reliably for discriminating WUWs and non-WUWs with properly selected WUWs. 39 4. DATA COLLECTION In this chapter, we introduce a revolutionary way to collect speech samples. The preliminary design of this data collection system will also be presented in this chapter. 4.1 INTRODUCTION TO THE DATA COLLECTION After developing the WUW discriminating features based on two prosodic measurements of pitch and energy, described in Chapter 2 and 3, it was realized that the data used to generate those features may not be the most suitable. The corpus used in this project is the WUWII corpus which was collected by Dr Këpuska in 2002. It provides data on a number of WUWs under alerting situations but doesn’t contain data for the same words used under referential situations. As a result, it was only possible to perform an analysis based on the changes between alerting types of WUWs against the overall sentence and not on information in which the same word appears in a referential situation. Another drawback of the current WUWII corpus is that it contains speech that is not spontaneous. The speakers whose voices were used to develop this data set were given each WUW and asked to make up a sentence using it as a WUW in an alerting situation. Under these laboratory conditions, some speakers may change the way he/she normally speaks. 40 In order to perform a more complete analysis, we will need a corpus which includes both alerting and referential WUWs in context with natural spoken utterances. Therefore, it was decided to use a suggestion made by Dr. R. Wallace, to extract audio samples from a movie or a TV program. Extracting speech samples from movies and TV programs has the following advantages compared to the data collection method used in developing WUWII corpus: 1. The speech examples are more natural. The speech from professional actors is more natural since they tend to think and speak like a particular character and act the situation of the character that they are depicting. 2. The data collection process will cost much less since we are not compensating individuals to record their voices. There are no copyright problems since the data is not being sold or used for commercial purposes. 3. A large amount of data can be collected in a short period of time once the process is fully automated. 4. The voice channel data is of CD quality. In this project, speech data was extracted from recorded videos rather than over conventional telephone lines as was done in developing the WUWII corpus. 5. No manual labeling is required. We plan to use the transcripts obtained from the video channel (see section 4.2), which provides time stamps for all spoken sentences. 41 In view of the above listed advantages, it was decided to design an automatic data collection system to collect specific speech data suitable for prosodic analysis of proper first names used in referential contexts vs. alerting (or WUW) contexts. 42 4.2 SYSTEM DESIGN The data collection project is a part of the prosodic features analysis project which is illustrated by the program flow chart shown in Figure 4-1. The prosodic features analysis project can be divided into three sub projects. In Figure 4-1, the green boxes represent the project of prosodic features extraction which are described in Chapters 2 and 3 of this thesis. Figure 4-1 Program Flow Chart The blue boxes depict the WUW data collection project. Finally, the purple boxes represent a future project on video analysis. In the prosodic feature analysis project, we use the prosodic features generated from acoustic measurement to differentiate the context of the words. In a part of the WUW data collection project the language analysis tools will be used to 43 automatically classify the words of interest – in this case referential or alerting. At the moment the capabilities of this tool, RelEx (Novamente LLC) must be augmented in order to achieve this goal. The outcome of the WUW Speech Data Collection project will not only build a specialized corpus for the prosodic analysis project, but also will provide confirmation of the results from the prosodic analysis. The detailed program flow chart of the WUW Speech Data Collection System is shown in Figure 4-2 below. Figure 4-2 WUW Audio Data Collection System Program Flow Diagram The input of the system will be (1) the video file of a movie or TV program, (2) a video transcription file; if provided, will be used; otherwise it will be extracted from the video stream by subtitle extractor, SubRip (Zuggy), and (3) an English dictionary of proper American first names (Campbell). In the case that there is no 44 video transcription file and the subtitles are encoded into the video stream, the subtitle extractor Subrip (Zuggy), will extract subtitles and time stamps of the sentence from the video stream. An example of a transcription file extracted in this manner is presented in Figure 4-3 below. Figure 4-3 Example of Video Transcription File The transcription files provide the following information: date and time when the files were created, the subtitle index number, the start time and end time of each subtitle, and the subtitle transcription. The media audio extractor (AOAMedia.com) will extract an audio channel from the video file. Then, using an English dictionary of first names and the sentence transcription with time markers, an application program called sentence parser was developed by VoiceKey team members (Pattarapong, Ronald, & Xeres, Sentence Parser Program, 2009) to select sentences that include proper English language first names. Figure 4-4 shows an example of the output of the sentence parser program. 45 Figure 4-4 An example of the output of the sentence parser application program In the next step, the audio parser (Pattarapong, Ronald, & Xerxes, Audio Parser Program, 2009) will use the information from the sentence parser to extract the corresponding audio sections from the audio file produced by media audio extractor. After the extraction of a sentence that contains an English first name, the RelEx (Novamente LLC) is used to analyze the selected sentence. RelEx is an Englishlanguage semantic relationship extractor based on Carnegie-Mellon Link Parser (Temperlyey, Lafferty, & Sleator, 2000). RelEx is able to provide sentence information on various parts of speech such as subjects, direct objects, indirect objects, and various words tagging such as verbs, gender, and nouns. The current status of the WUW data collection project is developing a rule based process or a statistical pattern recognition process, which is based on the relationship information produced by RelEx. Ultimately, the system will be able to accurately determine if the name in the sentence is used in a WUW or nonWUW context. A necessary step in the automation process is to obtain precise time markers indicating the words of interest. To achieve this, one could use the Hidden Markov 46 Model Toolkit (HTK) (Machine Intelligence Laboratory of the Cambridge University Engineering Department ), to perform forced alignment of the audio stream. The HTK was initially developed by Machine Intelligence Laboratory (formerly known as the Speech, Vision, and Robotics Group) of Cambridge University’s Engineering Department (CUED). The HTK uses Hidden Markov Model (HMM) which compares the acoustic features of the incoming audio with the known acoustic features of all 41 English phonemes to predict the most likely combination of phonemes associated with the incoming audio. It then maps the words from the lexicon dictionary. In our case, since the transcription of the sentences is known, HTK is used to map the phonemes of known words to the corresponding time intervals. The phoneme time labels or equivalently the word boundaries of the spoken sentence are used to locate in the time domain the WUWs or nonWUWs. Note that this step can be also performed by Microsoft’s Speech Development Kit (SDK) which is a speech recognition system that is fully integrated in Microsoft’s Vista OS. The advantage of Microsoft’s system is that we do not need to train it since the acoustic models are built-in. However, a development of the application incorporating Microsoft’s SDK features is necessary. Alternatively, HTK does not require any significant integration coding, however it does require acoustically accurate models. Automation of the described data collection process will be made possible by integrating the outputs from RelEx with the forced alignment. 47 With time segmented sentence labels of the audio stream indicating the WUW or non-WUW context, a new corpus can be generated just like WUWII corpus. This data will then be used to perform prosodic analysis and develop new features or refine existing prosodic features. It is expected that further study with the new data will not only validate the current prosodic analysis results, but also will provide information useful for developing new prosodic features. The ultimate goal of this speech data collection project is to build a suitable specialized corpus for the research on finding reliable features to discriminate between WUWs and non-WUWs in both alerting and referential contexts. 48 5. Conclusions This thesis investigated two prosodic features and designed an innovative speech data collection system. The pitch-based features experimental results including all 5 different WUWs are shown in Table 5-1, and the energy-based features are show in . In this study we arbitrarily decided that the relative change of any features should be 80% or higher before we would consider it a reliable discriminator between WUWs and non-WUWs. In addition, it was found that no single feature works best on all five WUWs used in this study. Each different WUW may may require a different feature to achieve the best performance. It can be concluded that the same performance feature will not discriminate all WUWs equally well between their use in alerting contexts and referential contexts. Pitch Features WUW: All WUWs APW_AP1SBW Valid Data 1415 AP1sSW_AP1SBW Pt > 0 %>0 Pt = 0 %=0 Pt < 0 %<0 726 51 0 0 689 49 1415 735 52 0 0 680 48 APW_APALL 2282 947 41 0 0 1335 59 AP1sSW_APALL 2282 996 44 2 0 1284 56 APW_APALLBW 2188 962 44 0 0 1226 56 AP1sSW_APALLBW 2188 1003 46 2 0 1183 54 MaxPW_MaxP1SBW 1415 948 67 53 4 414 29 MaxP1sSW_MaxP1SBW 1415 719 51 54 4 642 45 MaxPW_MaxPAll 2282 1020 45 109 5 1153 51 MaxP1sSW_MaxPAll 2282 716 31 213 9 1353 59 MaxP1sSW_MaxPAllBW 2188 1069 49 111 5 1008 46 MaxPW_MaxPAllBW 2188 1003 35 2 10 1183 55 Table 5-1 Pitch Features Result of All WUWs 49 Table 5-1 shows the results of the pitch-based features performance when five different WUWs are included. As we can see from the table, the best performance would be only 67% which is not high enough to allow reliable discrimination between WUWs and non-WUWs. The below shows the energy-based features performance results of all five WUWs used in the present study. Energy Features Valid Data Pt > 0 %>0 Pt = 0 %=0 Pt < 0 %<0 AEW_AE1SBW 1479 1164 79 0 0 315 21 AE1sSW_AE1SBW 1479 1283 84 1 0 240 16 AEW_AEAll 2175 1059 49 9 9 1116 51 AE1sSW_AEAll 2175 1155 53 2 0 1018 47 AEW_AEAllBW 1969 1427 72 0 0 542 28 AE1sSW_AEAllBW 1969 1562 79 3 0 404 21 MaxEW_MaxE1SBW 1479 1244 84 20 1 215 15 MaxE1sSW_MaxEAllBW 1479 1221 83 13 1 245 17 MaxEW_MaxEAll 2175 1373 63 13 1 245 17 MaxE1sSW_MaxEAll 2175 1336 61 25 1 814 37 MaxE1sSW_MaxEAllBW 1969 1209 61 16 1 744 38 MaxEW_MaxEAllBW 1969 1562 60 3 1 404 39 WUW: All WUWs Table 5-2 Energy Features Result of All WUWs One can see from the , there are several energy-based features with positive relative changes above 80%. In addition, some individual WUWs achieve multiple energy-based features having positive relative change of 90% or more which is covered in section 3.3 and detailed in Appendix B. These results provide firm evidence that there are significant increases for the energy measurement when 50 WUWs are spoken. These results confirm that the prominence of WUWs is more significant than the prominence of non-WUWs. Therefore, we can conclude that energy-based features can be used to discriminate between WUWs and nonWUWs. A future improvement would be to quantify the level of change comparing WUWs to non-WUWs. 6. Future Work Two potential solutions aare are being considered addressing the insufficient accuracy reported in this work for pich based features are outlined as follows: 1. Build a specialized corpus which contains the same words in both WUWs and non-WUWs. The speech sentences in the current corpus, WUWII, only contain WUWs but no non-WUWs. A new speech data collection system is presented in Chapter 4, which will allow creation of a database from the collected data that includes both WUWs and nonWUWs. 2. Use different approaches in defining pitch-based features. For example, when using average and maximum pitch measurements of the WUW, how the pitch pattern changes should also be considered. 51 Finally the new data collection system which collects both WUWs and non-WUWs has been designed and partially implemented. Work on this data collection system will be continued by VoiceKey group at Florida Institute of Technology. The ultimate goal of this speech data collection project is to build a suitable specialized corpus of data samples in order to find suitable prosodic features to reliably discriminate between WUWs and non-WUWs. 52 53

Chih-Ti Shih ThesisF.. - My FIT - Florida Institute of Technology

Related documents

Products

Support

Chih-Ti Shih ThesisF.. - My FIT - Florida Institute of Technology

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib