ABSTRACT Title of Thesis DETECTION OF IRREGULAR PHONATION IN SPEECH Srikanth Vishnubhotla, Master of Science, January 2007 Directed By: Prof. Carol Y Espy-Wilson Department of Electrical Engineering, University of Maryland The problem addressed in this work is that of detecting and characterizing occurrences of irregular phonation in spontaneous speech. While published work tackles this problem as a two-hypothesis problem only in those regions of speech where phonation occurs, this work also focuses on trying to distinguish aperiodicity due to frication from that arising due to irregular voicing. In addition, this work also deals with correction of a current pitch tracking algorithm in regions of irregular phonation, where most pitch trackers fail to perform well, as evidenced in literature. Relying on the detection of such regions of irregular phonation, an acoustic parameter is then developed in order to characterize these regions for speaker identification applications. The algorithm builds upon the Aperiodicity, Periodicity and Pitch (APP) detector, a system designed to measure the amount of aperiodic and periodic energy in a speech signal on a frame-by-frame basis. The detection performance of the algorithm has been tested on a clean speech corpus, the TIMIT database, and on telephone speech corpus, the NIST 98 database, where regions of irregular phonation have been labeled by hand. The detection performance is seen to be 91.8% for the TIMIT database, with the percentage of false detections being 17.42%. The detection performance is 89.2% for the NIST 98 database, with the percentage of false detections being 12.8%. The corresponding pitch detection accuracy increased from 95.4% to 98.3% for the TIMIT database, and from94.8% to 97.4% for the NIST 98 database, on a frame basis, with the reference pitch coming from the ESPS pitch tracker. The creakiness parameter was added to a set of seven acoustic parameters for speaker identification on the NIST 98 database, and the performance was found to be enhanced by 1.5% for female speakers and 0.4% for male speakers for a population of 250 speakers. These results lead to the conclusion that the creakiness detection parameter can be used for speech technology. This work also has potential applications in the field of non-intrusive diagnosis of pathological voices. DETECTION OF IRREGULAR PHONATION IN SPEECH by Srikanth Vishnubhotla Thesis submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Master of Science 2007 Advisory Committee Dr. Carol Y Espy-Wilson, Chair and Advisor Dr. Shihab Shamma Dr. Rama Chellappa © Copyright by Srikanth Vishnubhotla 2007 ACKNOWLEDGEMENTS “Mathru Devo Bhava, Pithru Devo Bhava, Acharya Devo Bhava” Mother is equivalent to God, Father is equivalent to God, and Teacher is equivalent to God. I would like to express my undying and infinite gratitude towards my parents, who have given me everything I wished for, and have made me what I am. I owe everything to them. This is the best opportunity to express my sincere thanks and immense gratitude to my teacher and guide – Prof. Carol Espy-Wilson. She has been my inspiration and Guru during my work in the Speech Lab. She has taught me the philosophy and method of research, and has helped me shape my research career. I would like to express my thanks to the members of my thesis committee, Prof. Rama Chellappa and Prof. Shihab Shamma, for sparing their invaluable time in reviewing the manuscript of the thesis, and for providing valuable comment and suggestions. My thanks to all my colleagues of the Speech Communication lab. Discussions with Om, Tarun, Xinhui, Sandeep and Daniel have always proved to be fruitful and insightful. I would especially like to thank Om for all his help and suggestions at various stages of my research, especially with the APP detector. Sandeep has been a very good friend & colleague, and has helped in acquainting me with the speaker ID setup. Daniel has provided valuable advice and comments with the speaker ID experiments. I would like to thank Olakunle for his help with verifying manual transcription of creakiness in the TIMIT database. I would like to acknowledge the timely help and support from the ECE and ISR IT staff. My thanks go to all my friends here at UMD, who have made the last two years really memorable and have helped and supported me at various stages. Abhinav, Anuj and Pavan will always find a place among the best of my friends, and they have my ever-lasting gratitude. Ramya & Swati – thanks for all the good times outside and at home. And ii everybody - thanks for all the participation in the innumerable games of AoM! Rahul, Sudarshan, Gunjan – thanks for the wonderful food at 143, Westway! Jishnu & Sravya – thanks for the late night rides. They’ve made a huge difference, I assure you! Bargava – thanks for all the philosophical (and all else!) discussions. Pooja – thanks for participating in the birthday bashes and the get-togethers. And for all the feedback! Last but not the least – thanks to my family. My mom, dad and sister – I love you and miss you all. iii TABLE OF CONTENTS 1 2 3 4 Acknowledgements ii Table of Contents iv List of Tables vi List of Figures vii Introduction & Background 1 1.1 The Speech Production Process & The Acoustic Parameters 4 1.2 Phonation & Voice Quality 7 1.3 Irregular Phonation & Its Types 12 1.4 Why This Thesis? 13 1.5 Literature Survey & Contributions of this Thesis 16 1.6 Acoustic Parameters for Speaker Identification 19 1.7 Outline of the Thesis 22 Modifications to the APP detector 24 2.1 The APP detector 24 2.2 Dip Profiles in the APP Detector 27 2.3 Recommended Specifications 33 Acoustic Characteristics for Detection of Irregular Phonation 35 3.1 Detecting Suspect Regions of Irregular Phonation 36 3.2 Energy Threshold 37 3.3 Zero Crossing Rate 38 3.4 Spectral Slope 43 3.5 Dip Profile – Eliminating Voiced Fricatives and Breathy Vowels 46 3.6 Pitch Estimation 48 3.7 Final Decision Process : Combining Current Pitch Information and Dip Profile Decisions 55 Experiments & Results 57 iv 5 4.1 Database Used For The Experiments 57 4.2 Experiment 1 – Accuracy of Pitch Estimation 59 4.3 Experiment 2 – Detection of Irregular Phonation 62 4.4 Speaker Identification using Acoustic Parameters 67 4.5 Error Analysis and Discussion 70 Summary & Future Directions 5.1 74 Future Research Directions 74 Bibliography 78 v LIST OF TABLES 4.1 Pitch Estimation Accuracy of the modified APP detector, as compared to the earlier version 60 4.2 Insertion of voiced decisions for unvoiced speech 61 4.3a Accuracy of detection of creakiness instances in the TIMIT database 64 4.3b Accuracy of detection of creakiness frames in the TIMIT database 64 4.4a Accuracy of detection of creakiness instances in the NIST 98 database 64 4.4b Accuracy of detection of creakiness frames in the NIST 98 database 65 4.5 Error Rates for Speaker ID without and with the creakiness parameter 69 vi LIST OF FIGURES 1.1 A Typical Pattern Recognition System for ASR Speaker ID 2 1.2 The Speech Production Process modeled as an electrical system 5 1.3 A front view and top view of the vocal folds and the associated muscles 7 1.4 Modulation of the air flow by the vocal folds 8 1.5 Volume velocity of the air flow through the glottis during modal (normal) voicing 9 Glottal Configuration, Volume Velocity and Glottal Spectrum for a creaky, modal and breathy vowel 11 Illustrative example showing the behavior of the pitch, zero crossing rate and energy in a speech file. Shown are : the time signal, spectrogram with the formants overlaid, pitch track as calculated by the ESPS software, pitch track calculated by the modified APP detector, zero crossing rate, short-term energy. The red lines delineate the boundaries of a region of laughter in the speech file. The black lines separate the breathy voiced regions of laughter from the aspiration regions. 20 2.1 Block diagram of the Aperiodicity, Periodicity & Pitch detector 25 2.2 Sample AMDF and dip strengths for a strongly periodic (left) and aperiodic (right) frame 26 Spectro-temporal profiles & dip profiles for periodic or vowel, aperiodic or fricative, breathy vowel, voiced fricative and creaky vowel regions 28 A sample modal vowel and creaky vowel compared in terms of spectrogram, per/aper profile, time waveform, source or glottal signal exciting the vocal tract and AMDF of the signal and associated dip structure. The vowels being compared are the same (/e/) 31 Spectrogram, old APP detector and modified APP detector spectrotemporal profiles for a segment of speech contrasting a modal vowel, creaky vowel and aperiodic region 33 Flowchart showing the main stages of the algorithm for detection of irregular phonation 36 Illustration of the zero crossing rate in case of the voiced and unvoiced regions. The zero crossing rate (ZCR) for the signal components > 2000Hz, the signal components < 2000 Hz with all its frequency components are compared. 39 Spectrogram and high-spectrum ZCR of a region of speech, showing a fricative, modal vowel, creaky vowel and laughter 41 3.4 Comparison of spectral tilt of a creaky vowel, voiced stop and fricative 44 3.5 Pitch tracking accuracy of three algorithms in creaky regions. Shown are the waveform, spectrogram, pitch tracking using the ESPS software, pitch 1.6 1.7 2.3 2.4 2.5 3.1 3.2 3.3 vii 3.6 3.9 track from the modified APP detector, and pitch track from the original version of the APP detecto 51 Pitch tracking of the new APP detector algorithm in creaky regions. Shown are the waveform, spectrogram, pitch tracking using ESPS software, pitch track from the modified APP detector. The instantaneous pitch tracking accuracy of the algorithm may be noted. 52 Failure to correct pitch in certain regions, due to high pitch confidence. Shown are the waveform, spectrogram, pitch tracking using ESPS software, pitch track from the modified APP detector. The pitch period of 0.004 sec (highlighted region in yellow) may be compared to the estimated pitch frequency of 109.59 Hz, which is a pitch halving error. This is because of high pitch confidence at twice the pitch period. 54 4.1 Creakiness detection results. Shown are the spectrogram showing the regions of creakiness in the original signal and the same spectrogram as above, with the creaky regions identified by the APP detector overlaid in black 4.2 Improvement in speaker ID performance upon introduction of creakiness as the eighth parameter viii 63 69 Chapter 1 INTRODUCTION & BACKGROUND Speech is the primary mode of communication in humans. Indeed, the most obvious trait that sets the human race apart from the rest of the living world is the ability of humans to communicate with each other with richness, efficiency and variety. The speech signal produced by humans contains much more information than purely the message to be passed: it also gives the listener an idea of the gender and emotional state of the speaker, the language of communication, as well as the speaker’s identity in some cases. Each of these components of information is processed by the listener, and then put together in the brain to generate a complete picture of the particular conversation. Automatic methods have been developed for a long time to capture these various levels of information, and researchers have contributed to different areas of speech processing by machines, like Automatic Speaker Recognition (ASR), Language Identification, Natural Language Processing (NLP), Speaker Identification (Speaker ID), Emotion Recognition etc. In addition, research has also focused on giving machines the ability to speak, or Speech Synthesis. Further, there have also been major contributions to the area of Speech Enhancement, with the intention of rendering speech signals more usable for machines and humans. However, while there have been significant advances in all of these components of speech technology, each of them is still far from perfect or even near-human performance, and a lot needs to be done before the problem of “teaching machines to converse like humans” can be declared solved. Indeed, for example, in the technology of speech recognition, human speech recognition beats ASR by an order of magnitude in both clear and noisy speech [1] – a gap that has not been bridged over the past nearly ten years. 1 There are several reasons for the marked difference between the performance of humans and machines. In particular, for the problems of ASR and speaker ID, the performance difference is significantly high and for most cases, unacceptable, and this might be attributed to the approach taken towards the problem [1]. Most ASR and speaker ID systems rely on the statistical pattern recognition framework, wherein the front-end contains the Feature Extractor, and the back-end consists of one of several class-discriminating modeling techniques. Figure 1.1 shows a typical statistical pattern recognition system akin to those used for ASR and speaker ID applications: Speech Signal Probability Models Feature Extraction Training Phase Speech Signal Testing for Most Likely Model Feature Extraction Decision Testing Phase Figure 1.1 : A Typical Pattern Recognition System for ASR Speaker ID The speech signal is first processed in order to extract certain features that enhance the discrimination between the different classes that are required to be identified (phonemes for ASR and speaker characteristics for speaker ID). The features used in current state of the art systems are the Mel-Frequency Cepstral Coefficients (MFCCs), spectral features of speech that have been processed by auditory-inspired techniques in order to mimic human auditory processing [2]. In order to minimize the effect of channel distortions, these features are smoothed using Relative Spectral Transform (RASTA) filtering [3], a technique that 2 applies a band-pass filter to the energy in each frequency band in order to smooth shortterm noise and to remove any constant offset in the band. To capture implicitly some of the temporal information in speech, these MFCCs are complemented by their first and second derivatives (ΔMFCCs and ΔΔMFCCs) which give the rate of change of these features with time. Once the features are extracted, they are then fed into a modeling back-end, which characterizes their statistical and temporal behavior by means of marginal and conditional probability distributions, respectively. In state-of-the-art ASR systems, the back-end is a Hidden Markov Model (HMM) [4] which characterizes temporal behavior of the features, as well as their statistical description, by their conditional and marginal probability distributions. For speaker ID applications, the Gaussian Mixture Model (GMM), which accurately describes the statistical distribution of the extracted features in terms of a set of Gaussian distributions with varying means and variances, has been found to be most useful [5]. The reason why machine recognition has not lived up to expectations and still demands new developments is that the modeling techniques, as well as the feature extraction methods, have not been (i) accurate and (ii) appropriate to the tasks at hand. As an illustration, the speaker ID technology uses a purely statistical mechanism to model speaker characteristics, without taking temporal information into account. It is a known fact [6] that speakers differ in the way they articulate the same phoneme, and therefore, the characteristics of speech they produce should show different dynamic behavior. However, the GMMs used for Speaker ID do not explicitly capture this temporal variation of the features, and rely on the ability of features to model temporal behavior. This is unlike the HMMs which characterize conditional distributions to capture temporal behavior, but which are not used for speaker ID because of the inherent complexities in modeling multimodal 3 distributions, and the lack of enough training data for each of the marginal and conditional probabilities. More importantly, the choice of features is also a bottleneck in pushing the limits of current speech technology. The MFCCs that have been traditionally used for speech and speaker recognition are not very intuitively motivated, and more significantly, do not provide a clear picture of why they should work or fail in certain circumstances. These features implicitly code the human vocal tract characteristics, and their temporal behavior does not clearly reflect the vocal tract activity that would have generated such a MFCC structure. Further, the traditional set of MFCCs used are low time liftered, capturing only the vocal tract information and are thus not fully efficient, since for example, the source information of the speaker is also rich in speaker-specific information like voice quality and could be used for speaker ID. Thus, it is useful to explore beyond the MFCCs and look for efficient features that explicitly capture specific information that would be useful for the particular task at hand. For example, recent work has shown that for speaker ID applications, the performance of a set of eight acoustic parameters matches that of the traditional 39 MFCCs for varying population sizes, typically beating the MFCCs in the case of female speakers [6]. Thus, it is a promising enterprise to embark on the search for parameters that would explicitly capture specific information from the speech signal. 1.1 The Speech Production Process & The Acoustic Parameters If it were possible to extract reliable features that explicitly capture the activities of the speech-producing mechanisms, then it would be possible to improve the performance of current speech technology. A brief look at the speech production process explains how: 4 Voicing Source Switch Vocal Tract Filter Speech Signal Aspiration Source Figure 1.2 : The Speech Production Process modeled as an electrical system The speech production process can be approximated by an electrical system, where a source signal excites a filter [7]. The behavior of the glottis is modeled by the source – when the glottis opens and closes periodically, as in the case of vowels, the source is represented by a periodic or quasi-periodic signal. When the glottis remains partially open allowing air to flow from the lungs and there are no major constrictions in the vocal tract, aspiration noise created just above the glottis produces the sound /h/. When the glottis remains open and a major constriction is created along the vocal tract, a second source is produced, either by creating a close constriction like in the case of fricatives (/f/, /s/ etc.) or by allowing the articulators to excite the vocal tract with sudden impulses, like in the case of stops (/p/, /t/ etc). In such cases, the source is represented by a noise source. In some phonemes, like the voiced fricatives (/z/), both the periodic and aperiodic sources are simultaneously active, in which case the periodic energy typically manifests itself in the lower frequencies, and the aperiodic energy in the higher frequencies. Each of the different vowels is produced when a periodic source is modified by a different configuration of the filter representing the vocal tract; these configurations are represented by the resonant frequencies of the filter. The harmonic spectrum of the periodic source thus gets modified by the vocal tract frequency response, giving the resultant speech a spectrum that has peaks at certain frequencies. These frequencies, called the Formant frequencies, are different for different vowels (and are in fact characteristic of them). 5 In particular, taking the example of ASR for vowels, a simple way to identify vowels automatically would then be to estimate the resonant frequencies from the speech spectrum, and then identify the vowel using the formants thus obtained. Though the variability in speech, both due to medium as well as due to speaker identity, renders the problem not so simple to solve, the principle is more or less the same. Thus, it comes naturally to use the formants as parameters for recognition of vowels – a subset of the parameters motivated by the speech production process and the acoustics involved. These parameters that are called Acoustic Parameters (APs) in the remainder of this thesis. For the case of speaker ID, the motivation of using acoustic parameters is that each speaker has a characteristic pattern of movement of articulators [6]. Thus, when acoustic parameters are used to characterize the speaker’s speech, their behavior (both temporal and spectral) can help a lot in identifying the speaker from others. Further, the physical attributes of a speaker also limit the range of values that each of the acoustic parameters derived can take, and this can provide a valuable clue to speaker identity. Putting together this potpourri of acoustic parameters, that captures various characteristics of the speaker and speech generated, in an adequately capable modeling system, can give significant results for speech technology. For example, it was recently shown in Espy-Wilson et al, [6] that the speaker ID performance of a set of eight acoustic parameters either matches or beats that of the traditional 39 MFCCs. While capturing features, it is necessary to characterize both the vocal tract filter as well as the source signal exciting the vocal tract. The set of acoustic parameters that have been used in Espy-Wilson et al, to characterize the vocal tract are the four formants. The degree of periodicity (and aperiodicity) in the source signal exciting the vocal tract, as well as the spectral tilt of the speech produced, are used to characterize the source information. The motivation for selection of these source features is elaborated on in [6]. 6 The aim of this thesis is to develop an acoustic parameter that characterizes a particular type of source information (i.e., voice quality). In particular, this work deals with automatically detecting regions in speech where the source signal exciting the vocal tract shows irregular behavior – the signal is neither periodic nor noise-like, but has a different structure as a consequence of irregularity in the pattern of vibration of the vocal folds [8]. 1.2 Phonation & Voice Quality While it is the vocal tract that determines the phoneme to be produced, it is the mechanism of production of the glottal pulses exciting the vocal tract that determines the quality and perceptual attributes of the speech signal produced [7,9]. In this section, a brief description of the source mechanism during voiced speech is given [7]. The glottal pulses exciting the vocal tract are actually produced by the building of air pressure just behind the glottis. When the air pressure (coming from the lungs) is adequately high, the vocal folds that close the glottal opening are pushed apart to allow the air to escape. The volume velocity of Figure 1.3 : A front view (left) and top view (right) of the vocal folds and the associated muscles (adapted from [7]) 7 this air excites the vocal tract, causing it to resonate at certain frequencies according to its configuration. Figure 1.3 shows the front view of the glottis and vocal folds. Air pressure from the lungs causes the glottis to open by pushing the vocal folds apart and leaking through the opening thus formed. The figure on the right shows a top-view of the vocal folds and associated muscles. Because of the close (but importantly, not exact) symmetry between the muscles controlling the two vocal folds, the vocal folds usually open and close in synchrony to each other, and thus the resultant volume velocity is a smooth function of time, with both the vocal folds modulating the air velocity equally and in a balanced way. It is noteworthy that the modulation by the vocal folds shapes the volume velocity that excites the vocal tract, and thus influences the nature of the speech produced. The actual modulation of the air by the vocal folds is shown in Figure 1.4. This is a view of the glottis and vocal folds from the front. It can be seen that the lower part of the vocal folds first separate out due to build up of sub-pressure, and then there is a gust of air that pushes through the vocal folds. Thus, the upper muscles now open apart and the air escapes from the glottis to the vocal tract. As the gust of air moves through the glottis, the lower muscles of the vocal folds start pulling them back together and the vocal folds Figure 1.4 : Modulation of the air flow by the vocal folds (adapted from [7]) start closing, from lower part to upper part. Thus, a Bernoulli Effect is seen in the opening 8 and closure of the vocal folds. Further, the closure of the vocal folds is faster than the opening. This is because the vocal folds that are inherently closed have to be pushed apart and therefore take a sufficiently large sub-glottal pressure to be opened. Once they are opened, there is a large volume of air traversing upwards and thus, the vocal folds are held apart. In contrast, once the air rushes out, there is no pressure to keep the vocal fold muscles apart and they close shut more rapidly. These mechanisms are reflected in the structure of the volume velocity that excites the vocal tract. The source signal that excites the vocal tract is seen to have a slower and more gradual opening phase, and a faster, more rapid closing phase (Figure 1.5). Figure 1.5 : Volume velocity of the air flow through the glottis during modal (normal) voicing (adapted from [7]) It is obvious that the physical characteristics of the speaker greatly determine the source signal modulating the vocal tract. Indeed, the typical durations of opening and closing of the glottis are a function of the acoustic mass of the upper and lower part of the vocal folds, as well as their acoustic compliances and the coupling compliance between them. (Mathematical details can be found in [7]). Thus, different speakers are expected to have different excitation patterns for their vocal tracts. Further, speakers are not consistent in the way they excite their vocal tract – the source signal for the same vowel can significantly change within a period of 10-15 msec for the same speaker during spontaneous speech. To complicate matters, the two vocal folds are not perfectly symmetric in terms of their 9 properties like acoustic mass, hardness etc. and this causes an imbalance in the modulation of the volume velocity by the two vocal folds. All these phenomena result in the source signal exhibiting characteristics that differ so greatly from one phoneme to another, that it sometimes becomes difficult to define what can be called “normal” and “abnormal” for a specific speaker. Perceptually, however, one can broadly classify “what speech sounds like” (i.e., the perceptual quality of speech, or Voice Quality) into three classes – Modal, Breathy and Creaky. Modal voicing is simply the normal voicing of a speaker – perceptually, it can be said that modal voicing is what a speaker usually sounds like. In this case, the vocal folds vibrate in the mode as described above, with the glottis opening and closing at regularly intervals and the vocal folds behaving symmetrically offering no greater resistance than they usually do [9]. In the case of Breathy voicing, the vocal folds do not always close completely and there is always a leakage of air through the slightly-open glottis (figure 1.6). Consequently, the speech produced has a slightly whispery or breathy quality to it, in the sense that there is always an underlying noisy element even during the vowels. Because of the leakage of air, the volume velocity has a non-zero DC component. This constant leakage of air also causes the resistance of the vocal fold muscles to be less cumbersome to air-flow, and thus, the opening phase is usually as free and the closing phase – giving rise to a symmetric glottal pulse. This, along with the fact that there is coupling between the sub-glottal and supra-glottal regions, explains the more sinusoidal nature of the glottal source pulse. Spectrally, the glottal source signal has a steeper slope (-18 dB/octave) than that of the modal case (-12 dB/octave). This is because of two reasons – the DC offset given by the leakage of the air, and the sinusoidal nature of the signal (as a signal approaches a sinusoid, its spectrum approaches that of an impulse and thus becomes steeper). This also causes the first harmonic (H1) of the source 10 spectrum to be significantly higher than the second harmonic (H2). The H1-H2 difference for breathy vowels is usually about 6 dB more than that of the modal vowels. It must be noted that in the case of breathy voicing, the frequency of vocal fold vibration does not change and the pitch (F0) remains the same as it should in the modal case. Physiologically, the difference only exists in that the vocal folds allow a leakage of air during closure. Finally, this kind of voicing should also be noted as another example where a voiced source and aperiodic source act simultaneously, with the former dominating the lower frequencies and the latter, the higher frequencies. Figure 1.6 : Glottal Configuration (top pane), Volume Velocity (middle pane) and Glottal Spectrum (bottom pane) for a creaky (left column), modal (middle column) and breathy (right column) vowel. (adapted from [9]) In the third voice quality class, i.e., creaky (or laryngealized) voicing, the vocal folds do not open completely, and close more abruptly than in the modal case. This could be due to one of several reasons – physically, creakiness is usually attributed to heavier or strained 11 vocal folds [7]. The vocal folds do not open completely and therefore, the volume of air flowing out is also usually less as compared to the other two voicing modes. The frequency of glottal opening and closure also reduces (i.e, the period, or pitch, of the source signal reduces) because of the difficulty in opening the vocal folds. The opening and closing phases are significantly shorter than the modal case, for the same reason. Spectrally, this translates to a spectrum that is flatter than the modal or breathy case. This is because of the abruptness in the opening and closing, which makes the source more impulse-like, rendering it a flatter spectrum. The typical spectral roll-off is -12 dB/octave, and the H1-H2 difference is actually negative, and less than that of the modal or breathy case. In this thesis, this mode of voicing is also classified as being irregular because the glottal opening and closure is not as it normally should be, and the pitch also does not fall in the same range it usually does. 1.3 Irregular Phonation & Its Types In this study, we focus on one of the variations of voice quality, namely irregular phonation. This particular category is a super-class comprising of sounds that various researchers from different disciplines have called creak, vocal fry, diplophonia, diplophonic double pulsing [10, 11], glottalization [12], laryngealization [9], pulse register phonation [10], vocal fry [10] and glottal squeak [10]. While it is clear that each of these terms is well-defined in literature, there still remains some difficulty in defining the exact perceptual correlates of each of these phenomena. This is because of some common source mechanisms occurring in these various forms, which do not give a characteristic clue that identifies them from other types. However, the main characteristics of all these various forms of phonation are that the vocal folds do not vibrate as they would for a modal case, and the difference comes 12 either in asymmetry of the behavior of the vocal folds, or a significant difficulty in opening the two vocal folds to the extent that would happen in modal case. As an example, in the case of diplophonia (or diplophonic double pulsing), the vocal folds do not vibrate simultaneously and are out of phase. Thus, the volume velocity is modulated in a two-fold way, and there are two glottal pulses in each cycle, one being delayed and damped with respect to the other. The corresponding speech signal exhibits a waveform that shows two distinct pulses exciting the vocal tract, and a spectrum that shows interharmonics arising due to the interaction of these phase-inconsistent source signals. Most pitch trackers fail to track the pitch correctly in such cases, as was tested by the author on a few samples of such speech. An extreme case of diplophonia is diplothongia, where the vocal folds do not vibrate in synchrony at all and thus, two distinct tones are produced, one due to each vocal fold. In this work, we therefore avoid the confusion due to literary definitions, and instead rely on the work in [10] that has shown that in spite of their varying definitions, some of these phenomena are similar to each other, and are classified together by listeners to fall in the same perceptual space, thus narrowing the variety to very few classes. We thus clearly define our task as the automatic detection of all sounds that will fall in the above perceptual category, namely irregular phonation. Irregular phonation is defined for the scope of this thesis as “those sounds which fall into one of the categories of creak, vocal fry, diplophonia, diplophonic double pulsing, glottalization, laryngealization, pulse register phonation, glottal stop or any mode of phonation that arises from either asymmetric vibration of the vocal folds or due to a relatively abrupt closure of the vocal folds”. 1.4 Why This Thesis? 13 Speakers have a certain characteristic quality to their voice, which is a consequence of their style of phonation, as well as their individual source properties. The variety in voice quality is a well-studied phenomenon, and researchers from various disciplines like speech processing, voice pathology, phonetics, linguistics and music have examined the various aspects of phonation. Detection of irregular phonation by visual inspection, and description of the characteristic features that identify it, as well as those that distinguish it from modal or breathy speech, has opened the doors to approaching one of the most important problems in speech processing – capturing the source information. This is especially true in present speaker identification systems, as studies have shown that different speakers can glottalize at different rates that are characteristic to them [12]. Traditional speaker identification and speech recognition systems have, however, focused on modeling the vocal tract, and there have been relatively few studies that try to incorporate source information into the system. One of the probable reasons for this is that there is still no set of parameters that one can use to explicitly and automatically arrive at the source information. The goal of research in the Speech Communication Lab at the University of Maryland is, in addition to other applications, arrive at a set of parameters that can identify automatically the degree and type of irregular phonation in a speaker. Since voice quality is a characteristic of the speaker, such a set of parameters would certainly boost the performance of a speaker identification system. Identification of creaky regions has been one of the tasks in this long-term goal. In addition to speaker identification, the task of automatic detection can contribute to ASR as well. Most ASR systems rely on the traditional MFCCs, but recently, research is also being pushed into the direction of landmark-based ASR using acoustic-parameters [c.f. 13]. Identifying the source information and removing it from the speech signal to arrive at the vocal tract behavior could give insight into the actual articulators involved during the 14 production of that phoneme. This could well lead to the recognition of the phoneme being spoken. However, in order to remove the source information, it is important to get an idea of what kind of voicing is involved and what the glottis configuration has been – the irregular phonation detector could give insight into these issues and thus help in ASR. Automatic identification of irregular phonation can also aid in the study of voice pathology. This is an important potential application, as it can lead to non-intrusive diagnosis of laryngeal problems. It may be possible to identify the exact problem by simply processing the speech file with an algorithm that can identify how the vocal folds are vibrating and what physical activity is going on inside the larynx. While this thesis does not really focus on this particular problem, it is a step in that direction and gives sufficient thought to that possibility as being a possible future application, in terms of design of the pitch detection algorithm and other internal working details. This work can contribute to phonetics by identifying regions of speech such as turntaking etc., and also to the task of language identification, as there are a variety of languages that exploit creakiness and breathiness to articulate certain sounds [11]. Analyzing the rate of occurrence of certain creaky vowels can help in identifying the language of communication. Lastly, this work can also aid in automatic emotion recognition. Speakers under stress and those who have spoken for a long time, as well as physically exhausted speakers, show a tendency to go creaky in their speech. This fact can be exploited to detect if a certain speaker is under a stress of some kind. It also seems possible to use this system to detect voice imitation, as certain speakers may have certain patterns of irregularity in their phonation, and this may not be possible to mimic by the imitator. While a number of such potential applications have been listed, this work also supports one of them– speaker identification – numerically, as an illustration. Thus, it is 15 indeed a worthwhile task to design a system that can automatically detect irregular phonation. 1.5 Literature Survey & Contributions of this Thesis The problem of trying to identify certain source characteristics has long been investigated by various researchers for different potential applications, and are too numerous to list with just a few references. However, there has been significantly little work done on actually arriving at irregular vocal fold activity during spontaneous speech. Further, developing this phenomenon as a means to characterize speakers is an idea that has not been investigated previously. Published attempts at explicitly detecting creaky voicing by automatic methods seem to have been made in parallel by three other authors, all in the period of 2004-2005. In the first of these works, published by Ishi in 2004, [14], the author tries to classify segments of clean speech as belonging to either vocal fry or modal voicing. Both voiced and unvoiced regions of speech are investigated. The acoustic cues that have been proposed for the purpose are the variation of short-time power over consecutive glottal cycles, and features called intra-frame periodicity and inter-pulse similarity, based on the properties of the autocorrelation function of the voiced segments. The detection rate reported was 73.3%, with an insertion rate of 3.9%. Independent and parallel work was published by Slifka [15] in 2005, where the task was to classify tokens of clean speech from the TIMIT corpus containing modal voicing from those containing creaky voicing. The acoustic features extracted from the voiced segments are the pitch, normalized root mean square amplitude, smoothed energy difference 16 and shift-difference amplitude. These features have been used with a Linear SVM as a backend, and have yielded a detection accuracy of 91.25%, with insertion error rate 4.98%. Parallel work was also published by Yoon et al [16] in the same year, with the task being the same as the other two, namely, classification of segments of speech into modal or creaky voicing. However, the task was tougher in the sense that the corpus used was telephone speech, which exhibits significant channel distortion, and where pitch detection algorithms have been reported to fail even in voiced regions of speech [17]. Preprocessing steps were to boost the speech in order to equalize the channel effects, and then use a robust pitch tracking algorithm to detect creakiness. Fundamental frequency (F0) and spectral cues were used with an SVM to perform classification, and the classification accuracy was reported to be 75%. The work done in this thesis, independent from and parallel with the above works, is a significantly different task because of various reasons. The first is the data that has been used to evaluate the algorithm. The idea of developing the Irregular Phonation Detector was to be able to detect and numerically characterize creakiness in spontaneous speech, which includes regions of speech with no voicing and where the speech need not be clean. Indeed, this algorithm has been tested on both clean and noisy speech corpuses, and has been found to give results approaching the best of the above approaches ([15]) in clean (no results are reported in [15] for noisy telephone speech). Specifically, in the case of noisy speech that the algorithm was tested on, various degrading elements like channel distortion, laughter and coughing, etc. have been worked on. Also, while the above works have been tested on regions of voiced speech that have been predetermined and extracted manually, this algorithm operates on speech and non-speech from all sources. While it is a trivial affair to identify regions of voicing in the case of clean speech like from the TIMIT corpus, using 17 either the transcription labels or a sufficiently good pitch detection algorithm, the same is not the case with spontaneous speech. In the latter case, transcriptions are not available for detecting sonorant (voiced) regions, and pitch detection algorithms have been reported to fail detection of voicing [17] and detect voicing in regions where there is no voicing in reality (observed by author on various occasions). Even other simple and well-established acoustic features like the zero crossing rate, energy thresholds etc. fail when they are used to identify voiced from unvoiced regions in spontaneous speech from telephone data. All these facts are illustrated in the Figure 1.7 (next page), which demonstrates the temporal behavior of some of these features in telephone speech. The figure 1.7 shows a region of laughter (/ehehe/, boundary marked by red lines), where the region is partitioned into five regions: the first, third and fifth regions are voiced and breathy (/e/) and the second and fourth are aspirated. These two regions should be separated from each other by an appropriate voiced/unvoiced detector. Further, the detector must be able to separate unvoiced regions from creaky regions (which are voiced). It may be noted that the pitch detector of the ESPS software does not do a good job at detecting all the creaky regions, and thus cannot be used for purposes of this thesis. Further, the zero crossing rates and the energy are also not very useful in separating out these regions. For example, in regions 1,2 and 3, wherein 1 and 3 fall in the voiced category and 2 in the unvoiced, the zero crossing rate it seen to be in the same range. Further, even the short-time energy parameter in regions 1 and 2 are in the same range. Thus, no statistical framework can be used in this case to separate the voiced regions from unvoiced. Further, it may also be seen that the energy parameter in the unvoiced regions (which is typically expected to be low compared to voiced regions) is significantly higher than the voiced regions towards the end, where creakiness sets in. This kind of behavior is very common in spontaneous speech, and 18 it is simply impractical to use any of these measures in a simple way and expect good separation between the voiced and unvoiced regions. In this work, not only has the creakiness detection algorithm been designed to handle noisy speech containing distortions, as well as non-voiced parts of speech that would be identified as voiced by most current voicing-detection approaches, but an accurate pitch detection algorithm has also been incorporated into an Aperiodicity, Periodicity and Pitch (APP) detector [18]. An illustration of its performance may also be seen in figure 1.7. The ESPS pitch tracker clearly fails to track the correct pitch in creaky regions, and in regions with modal regions followed by creaky, it gives wrong pitch estimates for one of these tow modes of voicing. The APP detector pitch tracker, however, is seen to perform a decently better job at tracking the pitch, and fails to find only one voiced region – the breathy region 5. It does not give the correct pitch value in some frames in one of the creaky regions, and shows a doubling error, but still, it could be said that on an overall basis, this modified pitch tracker shows greater accuracy than the ESPS pitch tracker. This modified pitch detection algorithm has been found to give pitch detection accuracy of 98.3%, compared to detection accuracy of 95.4% of the earlier version of the APP detector. The irregular phonation detector gave a detection accuracy of 91.8% with an insertion rate of 17.4% instances, or frame-wise, 5.6%. The irregular phonation detector was incorporated into the APP detector system. Further, as a demonstration of the speaker-distinguishing capability of the parameter thus developed, speaker identification experiments have been performed. 1.6 Acoustic Parameters for Speaker Identification 19 20 (vi) (v) (iv) (iii) (ii) (i) 2 3 4 5 Figure 1.7 : Illustrative example showing the behavior of the pitch, zero crossing rate and energy in a speech file. (i) the time signal, (ii) spectrogram with the formants overlaid, (iii) pitch track from the ESPS software, (iv) pitch track from the modified APP detector, (v) zero crossing rate, (vi) short-term energy. The red lines delineate the boundaries of a region of laughter in the speech file. The black lines separate the breathy voiced regions of laughter from the aspiration regions. 1 While traditional speaker identification systems rely on the vocal tract dynamics and under-emphasize the significance of the source, more recent work has shown (c.f. [19]) that the addition of source information can prove to be valuable speaker-specific information. The set of knowledge-based APs proposed for speaker ID includes four parameters (the amount of periodic and aperiodic energy in the speech signal, the spectral slope of the signal, the amount of creaky energy in the speech signal) related to source information complementing four parameters characterizing the vocal tract behavior four formants (F1, F2, F3, F4) [6]. Text-independent speaker identification experiments were performed using this feature set and the standard MFCCs for populations varying from 50 to 250 speakers of same gender. The relative amounts of periodic and aperiodic energy differ considerably depending upon the voice quality. For modal voice, sonorant regions (vowels and sonorant consonants) are strongly periodic with little if any aperiodicity. For breathy voice, sonorant regions are periodic at low frequencies with some aperiodicity in the region of F3 and the higher formants [9]. Finally, for creaky voice, there is irregular vocal fold vibration so that the irregular phonation detector (now incorporated into the APP detector) finds creaky energy. Speakers differ not only in the voice quality used to produce sonorant sounds, but also in the tradeoff between the supraglottal turbulent source and the glottal voicing source used to produce voiced fricatives. The differences in the amount of periodic, aperiodic and creaky energies, due to various factors, is captured by the APP detector. Another measure used to capture information about the source is the spectral tilt of the speech signal, as explained in an earlier section. In addition to the various measures used to characterize the source, the formant frequencies F1 through F4 are used to characterize the vocal tract. The frequency range over 21 which these formants vary is an indication of vocal tract size and shape. F3 characterizes the length of the vocal tract of the speaker, and hence an average measure of F3 helps in capturing this speaker-specific information. F4 provides the higher frequency characterization of the vocal tract. Finally, it has been speculated that F4 (sometimes called the singer’s formant) in vowel regions is attributed to the resonance of the laryngeal cavity [20]. Given the relative stability of F4 in running speech for a particular speaker and its variance in frequency across speakers, it may be an additional measure of voice quality [21]. 1.7 Outline of the Thesis Chapter 1 provides a brief introduction to speech technology and the speech production process, and motivates the use of the acoustic parameters for various speech technology applications. This chapter describes in detail the source mechanisms in voiced speech, as well as irregular phonation. The motivation behind this work is then described, and the contributions of this thesis are highlighted. Chapter 2 briefly describes the APP detector system, and explains the failure modes of the pitch detection algorithm in regions of irregular phonation, which also hinders the detection of such regions. Solution to this problem is then motivated, and the chapter ends with a description of appropriate signal processing parameters for the APP system. Chapter 3 details the acoustic characteristics that are essential to detect creakiness, and motivates their use. The correction of pitch, and detection of creakiness, is then discussed. Chapters 4 discusses the results of the creakiness detection and pitch detection experiments, and demonstrates the use of creakiness as an acoustic parameter for speaker 22 identification. This chapter also discusses the reasons for errors and the influence that some of these errors may have on the other tasks. Chapter 5 discusses ways to improve performance and to deal with channel variations. The interplay of the acoustic parameters, and possible extension to the realms of ASR and other applications is then discussed. 23 Chapter 2 MODIFICATIONS TO THE APP DETECTOR The algorithm for the detection of irregular phonation can process an input speech file and automatically identify regions of the signal where the speaker exhibits irregular phonation. The algorithm is an extension of the Aperiodicity, Periodicity and Pitch (APP) Detector [18], a system that processes a speech file to give a spectro-temporal profile indicating the amount of aperiodicity and periodicity in different frequencies with time. The APP detector exploits the behavior of the Average Magnitude Difference Function (AMDF) to arrive at decisions of periodicity and aperiodicity for each frame. A complete description of the APP detector is given in [18]. In this chapter, the system is first briefly described for the sake of completeness, and then the reasons for the failure of the APP system to detect pitch in regions of irregular phonation are discussed. The output of the APP detector for different kinds of regions of speech is then discussed, and the current behavior of the APP detector for regions of irregular phonations is shown, highlighting the goals of this thesis. Finally, working parameters for using the APP system are recommended. 2.1 The APP Detector The APP detector is a time domain algorithm that, at a broad level, makes a decision about how much voiced and aperiodic energy is present in a signal. To be more precise, this algorithm estimates 1) the proportion of periodic and aperiodic energy in a speech signal and 2) the pitch period of the periodic component. While most of the algorithms used to detect aperiodicity are passive, i.e., aperiodicity is considered as the inverse or lack of periodicity, they are prone to errors in situations where the signal has simultaneous strong periodic and 24 aperiodic components. The APP detector is thus particularly useful in situations where the speech signal contains simultaneous periodic and aperiodic energy, as in the case of some voiced fricatives, breathy vowels and some voiced obstruents. Figure 2.1 shows the block diagram of the APP detector: The first block in the APP Detector is an auditory gamma-tone filter-bank that splits the channels into 60 frequency bands, with the upper-most center frequency being defined by the sampling rate. The choice of the filter-bank is in order to capture the perceptual importance of different frequencies. The outputs of the higher frequency channels (channels centered above 300 Hz) are then smoothed using the Hilbert transform to extract the envelope information and remove the finer structure. The next stage of the APP Detector incorporates frame-by-frame silence detection by applying a threshold to the signal energy in each channel. The energy is first normalized by the max energy in the speech signal, and then an empirically determined threshold of 0.0015 is used. 25 Following this, the signal in each channel is then processed to obtain the frame-wise Average Magnitude Difference Function (AMDF) to identify the amount of periodicity or aperiodicity in the signal. The signal is analyzed on a frame-by-frame basis. For each frame, a windowed portion of the signal, centered at the frame center, is used to compute the AMDF. The windowed signal is subtracted from a windowed portion of its neighbor located at a lag of k samples. The resultant signal is added to get a single number. For each such lag, the AMDF gives a number, and thus, for each frame, one obtains the AMDF as a function of the lag k. Mathematically, the AMDF γn[k] of a signal x[n] is defined as γ n [k ] = ∞ ∑ x[n + m]w[m] − x[n + m − k ]w[m − k ] m =−∞ where w[n] represents a rectangular window centered at n and having a width as required. The conceptual idea behind the AMDF is the same as that of the Auto-Correlation Function (ACF) - signals which have periodicity content will yield an ACF that shows peaks at locations that are periodic with period equal to pitch frequency. However, the ACF involves multiplication, and is thus computationally expensive when calculated for a large number of lags. The AMDF is a computationally less expensive substitute for ACF, in that it incorporates subtraction instead of multiplication. The AMDF behaves like the ACF, with AMDF AMDF k k Figure 2.2 : Sample AMDF and dip strengths for a strongly periodic (left) and aperiodic (right) frame 26 one difference – with varying lag k, the periodic signal causes dips instead of peaks at locations corresponding to pitch period and its multiples. This is quantified in the APP detector by calculating the dip strengths using the convex hull of the AMDF. Figure 2.2 above contrasts a sample AMDF function from a single channel of a strongly periodic signal with that of a strongly aperiodic signal. The vertical lines superposed on the figure represent the strength of the dips. The noteworthy features are that the AMDF shows strong dips at lags equivalent to the pitch period and its multiples in the former case, while the dips are significantly weaker and randomly distributed in the latter case. Decisions of periodicity and aperiodicity are made by summarizing this trend across all channels. For a periodic frame, it is expected that all non-silent channels will exhibit a similar trend in the AMDF dips, due to the underlying periodicity that occurs due to the glottal source. Thus, when the dips are summarized across all channels by addition, the dips will all cluster tightly together at lags equaling the pitch periods and its integer multiples, and give a significant strength of dips. If this is the case for a particular frame, then that frame is classified as being periodic (per). Channels that contribute to this periodicity profile will then be called periodic channels, and other non-silent channels are called aperiodic (aper). For an aperiodic frame, on the other hand, the dips, when summarized, will display a random behavior as a consequence of the individual channel behavior. The summary measure of periodic and aperiodic content is obtained by multiplying the frame per/aper decision by its energy and then adding it across channels. Thus, for each frame, a decision of per/aper is made for the frame, its individual channel-wise per/aper profile is produced, and the amount of aperiodic and periodic energy is also obtained. The entire speech file is processed this way, to get frame-wise output. 2.2 Dip Profiles in the APP Detector 27 (i) samples (ii) samples (iii) samples (iv) samples (v) samples Figure 2.3 : (left) Spectro-temporal profile of periodic energy (blue) and aperiodic energy (red) generated by the APP detector for a (a) periodic vowel (b) aperiodic (c) breathy vowel (d) voiced fricative and (e) creaky vowel region. The relevant regions are highlighted in the dotted box. (right) Corresponding dip-profiles for a sample frame of the respective regions. 28 The key to detection of regions of irregular phonation, and pitch correction in such regions, is the dip summary or profile in those frames. Figure 2.3 demonstrates the dip profiles for the cases of a single frame of a vowel, fricative, breathy vowel, voiced fricative and a creaky frame. It may be noted that these profiles are obtained by adding together the dip strengths across all the channels for each frame. The vowel, being a periodic signal, shows strong dip clusters at multiples of pitch period. The maximum strength of the dip profile is noteworthy, as well as the distribution of the dips. Typically, this value is greater than 10 for periodic frames, because of the fact that most channels are periodic in nature and thus show the AMDF dip in the same location, giving an added strength of at least 10. Because of the well formed strong clusters, the decision is that the frame is periodic. The channels that contribute to this dip cluster are all called periodic, while the others are called aperiodic. Since most channels in periodic channels also show a periodic AMDF, thus, the APP detector spectro-temporal output is seen to be periodic (blue) in most of the vowel regions. The fricative, on the other hand, displays a random distribution of dips, which have a very low strength even upon summary across channels – this is clearly because the dips do not cluster to add up together. Thus, the typical value of the maximum dip in the summary profile is less than 1.5. Due to the lack of strong clusters, the APP detector decision is that the frame is aperiodic, and since the channels do not show any periodic behavior that contribute to any clusters, the APP spectro-temporal profile shows mostly aperiodic (red) behavior. In the case of the breathy vowel and the voiced fricative, the dips show a mixed behavior – there is clustering of some dips (which is due to the voiced (glottal) source at the low frequencies), and there is also some randomness in distribution of some other dips (which is due to the turbulence caused by the supra-glottal source, dominating at the higher frequencies). Thus, the strength of the dips is less than that of the periodic frame, but greater 29 than that of the aperiodic frame. Typical values of the dip maximum lie between 1 and 10. The corresponding decision made by the APP detector depends on the strength of clustering of dips. If the dips are clustered strongly enough, the decision is that the frame has some periodicity, and the channels contributing to the periodicity are marked as periodic in the spectro-temporal profile. The channels that do not contribute to the periodicity are called aperiodic, and they are identified in red in the spectro-temporal profile. As is expected, most of the periodic energy in the APP output profile is located at the lower frequencies, and most of the aperiodic energy is located at the higher frequencies. Finally, the case of the creaky frame is midway between that of the periodic frame and the aperiodic frame. It lacks the tightly packed dip structure of the periodic frame, but also lacks the randomness of the dip structure of the aperiodic frame because of an inherent structure in the voicing source. In the case of breathy vowels and voiced fricatives, the clusters are not strongly prominent, but are still tight in the sense that the dips that belong to periodic channels are located in a cluster that is very narrow. This is the reason why the APP detector identifies such regions as being periodic. However, for the creaky vowel, the clusters that are formed by the dips are much broader than would be for the periodic frames (the reason for this is explained shortly), and thus, the APP detector does not identify any “periodic” (or more precisely, phonation) structure in these dip profiles. The lack of tight clusters and the weakness of the maximum dip (which is usually around 1.5 and thus comparable to that of the aperiodic frame) cause the APP detector to make a decision of aperiodic for all creaky frames. As the corresponding frames’ channels are also called aperiodic, the spectro-temporal profile in such cases would show primarily aperiodic energy (red), as shown in the figure 2.3. 30 Signal with Strong Periodicity (Modal Vowel) Signal with Irregular Periodicity (Creaky Vowel) (i) (ii) (iii) (iv) (v) Figure 2.4 : A sample modal vowel and creaky vowel compared in terms of (i) spectrogram, (ii) periodicity (blue) /aperiodicity (red) profile, (iii) time waveform, (iv) source or glottal signal exciting the vocal tract and (v) AMDF of the signal and associated dip structure. The vowels being compared are the same (/e/) The root of the behavior of the APP Detector for frames with irregular phonation can be traced to the AMDF structure of such frames. Figure 2.4 shows the time waveform, the corresponding glottal waveform obtained by the Inverse Filtering technique [22] and the corresponding AMDF obtained by the APP Detector for one frame of the modal and creaky 31 vowels shown in Figure 2.4. The most striking point here is that the AMDF closely captures the information in the glottal waveform – this is evident in the relative position of the peak of the glottal waveform and dips of the AMDF in the case of the modal vowel. This correspondence is not so exact for the creaky vowel, owing to the fact that the pitch shows some jitter and thus, the AMDF does not have exact alignment of the signal and its delayed version to give dips at exactly the pitch period. This causes the AMDF to have dips that do not cluster properly, and are not in alignment with their counterparts from other channels. Further, these dips are not as strong as in the case of periodic frames, due to the fact that the irregular vibration of the vocal folds results in a glottal pulse that has lower amplitude than modal phonation. The dips may also not be of equal strength. For example, in certain cases like diplophonia, where the glottal pulse consists of a main pulse followed by a delayed pulse reduced in amplitude, the AMDF shows two kinds of dips corresponding to each of the two different pulses – one stronger than the other, and each repeating after periods corresponding to their respective pulse. In all such cases of irregular phonation, the dips are of different strengths and at different locations. A summary measure of the dips across channels exhibits a dip profile significantly different from the modal vowel, due to the lack of periodicity and varying dip strengths at different lags. Owing to these differences in the glottal pulse of modal and irregular phonation, the summarized dip profile of irregular phonation exhibits dip clusters that are existent but not very well-defined, and at locations not equal to integral multiples of the pitch. It is not surprising that the APP Detector calls such frames aperiodic – these frames do indeed exhibit a profile that does not show tight clustering and has an overall dip strength comparable to the dip-profiles of aperiodic frames. The goal of this thesis is to modify the decision process of the APP detector in such a way that it accommodates the loose clusters 32 of the creaky vowel, but is still able to identify the fact that tight clusters imply presence of periodic (modal phonation) energy and lack of clusters implies aperiodic energy. Thus, the spectro-temporal profile of the APP detector must change from one that looks in 2.5 (b), to one that looks in 2.5 (c), where the green region represents a region of irregular phonation, as opposed to a blue region that represents periodic energy due to modal phonation and a red region that represents aperiodic energy due to turbulence. (a) (b) (c) Figure 2.5 : (a) Spectrogram of a segment of speech containing a creaky vowel, (b) Spectro-temporal profile of the old APP detector, which identifies the creaky vowel as being aperiodic, (c) Spectro-temporal profile of the modified APP detector, which identifies the creaky vowel as having irregular phonation and classifies it as being separate from aperiodicity due to turbulence 2.3 Recommended Specifications The APP detector has been designed to operate at various frame rates and window sizes as desired in potential applications. However, in order to detect creakiness, the set of specifications used should satisfy certain criteria, because of the inherent acoustic characteristics of this phenomenon. In particular, since it is known that the pitch falls 33 relatively low during creaky regions, sometimes as low as 60Hz, it must be ensured that the window size is large enough to accommodate at least one pitch period – this will ensure formation of a loose cluster that can then be used to detect creakiness. When the sampling frequency is high, the window size must be large; when low, the window must be smaller. Typically, it is useful to have a window size 1.5 times that of the minimum pitch frequency expected – this will ensure proper formation of a cluster. Creakiness usually manifests itself for durations of less than 20 msec. Typical duration of creakiness in the TIMIT corpus is about 10 msec. Thus, if it is desired to properly identify the region of creakiness and not just an instance, it is necessary to have a high frame rate. If the frame is displaced by 2.5 msec, then for a 10 msec creaky region, four frames will be called creaky and thus identify a region. If the frame were displaced by 10 msec (as is the typical frame rate for most speech processing applications), there would only be one frame which would be called creaky. This is not a very good idea, because finding just one frame as creaky does not give a method to eliminate any false alarms. If the frame rate were high, then any spurious locations of detection could be median-filtered and removed. In the approaches mentioned in the literature survey, features are extracted by moving the window over one sample, thus giving an extremely high frame rate. In the case of the APP system, a frame rate of 2.5 msec is recommended in order to reduce computational load and at the same time, ensure frames close enough to detect regions of creakiness and remove spurious false alarms. 34 Chapter 3 ACOUSTIC CHARACTERISTICS FOR DETECTION OF IRREGULAR PHONATION Detection of regions of irregular phonation is not a trivial problem, and when potential applications include processing speech from real-time corpuses, provision must be made to either counter the influences of corrupting factors, or render the algorithm invariant to such factors. In this thesis, the algorithm has been designed to work on speech from telephone data and no prior assumptions are made about knowledge of regions of voicing or characteristics of channel. The general approach to the problem has been to first detect suspect regions of irregular phonation, and then to identify if such regions do indeed exhibit voicing, and if so, if the voicing is due to creakiness (voicing might, otherwise, be in the case of voiced fricatives or breathy vowels, and show a more periodic structure). This is done by taking into account a number of acoustic characteristics that could deliver hints about voicing and creakiness. Conventional approaches, namely the zero crossing rate and energy threshold, are used to separate the voiced regions from unvoiced. These prove inadequate to make a clear distinction between the two, as demonstrated in Figure 1.7. Therefore, the dip profile is also used to make the voiced/unvoiced decision. At this stage, the dip profile is tested for whether it is due to creakiness or due to voicing in breathiness or voiced fricatives. Once creakiness is suspected, an estimate of the pitch is made from the dip profile. Following this, creakiness is reconfirmed by using the pitch estimate to identify the confidence of creakiness in that dip profile. At the end of this stage, the decision of whether the frame is creaky or not is made. This entire procedure is shown in the following flowchart: 35 APP Detector Pitch Confidence < 25% ? Y ZCR > Threshold ? Spectral Slope > Threshold ? Voicing Detector Creakiness Detector Y Cluster dips above 1000 Hz Dip Profile shows some (wide) clustering? Y Estimate Pitch from maximum dip Sufficient channels contributing to dip profile? Y Creakiness Decision : Yes Pitch Estimate Y : Yes In all decision stages, if the decision is a No, then the algorithm exits Figure 3.1 : Flowchart showing the main stages of the algorithm for detection of irregular phonation Each of the stages in this flowchart, and the motivation leading to their inclusion in the algorithm, is elaborated in this chapter. 3.1 Detecting Suspect Regions of Irregular Phonation The APP detector consists of a pitch detector that looks at the dip profiles, and then gives voiced/unvoiced decisions and pitch estimates depending on the clustering of dips in that particular frame. If the frame is judged as voiced, then the main cluster is said to be that cluster which has a maximum, and which was used to estimate pitch. The channels which contribute to the main cluster are then identified, and the (normalized) energy of all these 36 channels is added together to obtain what is called the pitch confidence. Thus, the maximum possible pitch confidence would be equal to the number of channels present. In periodic frames, the pitch confidence is typically greater than 50% of the maximum possible. In regions of creakiness, voiced fricatives and breathy vowels, where there is some voicing but clustering is not as strong as that of the periodic frames, the pitch confidence lies between 10% and 40% of the maximum possible. In creaky frames specifically, the pitch confidence is seen to be rarely greater than 25% of the maximum. For purely aperiodic frames, the pitch confidence is usually near zero and does not exceed 10% of the maximum. Thus, regions having pitch confidence greater than zero and less than 25% of the maximum possible are identified as suspect regions where creakiness could be possible. 3.2 Energy Threshold Using an energy threshold is a common strategy applied for speech detection in many speech processing applications. In real and spontaneous speech data as in the case of telephone speech corpuses, the presence of background noise and instrument noise due to telephone microphone and recording instruments cause the existence of a non-zero random signal in the recorded speech signal. This random signal component can sometimes demonstrate a dip profile that appears very similar to that of irregular phonation – this is purely a coincidence and not due to voicing. Such instances of dip profiles must be eliminated from further analysis since they may be called creaky if subject to the dip profile analysis. A simple solution would be to apply an energy threshold that would suffice to separate speech from non-speech. In the original APP detector algorithm, a threshold of 0.015 was empirically found to be apt for such separation, and it seems to suffice as a 37 threshold for creakiness as well. It was also seen that this threshold does an appreciably good job in separating some weak fricatives, glides and consonants from voiced regions. 3.3 Zero Crossing Rate Another common strategy to separate fricatives from regions of voicing is to exploit the zero crossing rate (ZCR). Fricatives, having a more noise-like random nature, have a higher ZCR than voiced regions. Thus, in typical clean speech processing applications, the ZCR proves to be a good candidate for detection of voicing. However, in case of telephone speech, ZCR is not a good candidate. It is observed that fricatives can sometimes have the same ZCR as voiced regions. This is due to the channel and device noise that rides the voiced regions, causing the ZCR to increase and approach that of the fricatives. Figure 3.2 demonstrates one such case, where the ZCR for a fricative is different from that of one voiced region, but is about the same as another voiced region. Thus, one needs a more complex analysis for using the ZCR than merely using a fixed threshold. It has been observed that in case of laughter and coughing, unlike the case of fricatives, noise is spread over all the frequencies and has an approximately flat spectrum, except at the really low frequencies (< 300Hz). Therefore, if the ZCR should be used, there should be a provision for accounting for the spectral spread of various phenomena – the noise due to fricatives (which occurs at frequencies above 2000 Hz) should be treated the same as that due to laughter (which occurs at almost all frequencies). However, if the ZCR were merely used taking all frequencies into account, the fricatives would show a lower ZCR than laughter, because laughter would consist of high frequency noise (equivalent to fricative noise) riding on a low frequency noise. In order to treat these phenomena at the same level 38 39 Figure 3.2 : Illustration of the zero crossing rate in case of the voiced and unvoiced regions. Here, the 3rd pane is the zero crossing rate (ZCR) for the signal components > 2000Hz, and the 4th for the signal components < 2000 Hz. The last pane shows the ZCR of the speech signal, with all its frequency components. It may be seen that the ZCR for the fricative in region 3.45 to 3.55 sec is the same as that of the vowel in region 3.55 to 3.6, and 4.15 to 4.25 sec. These, in turn, are significantly different from the ZCR in the vowel region between 3.85 to 3.95 secs. for ZCR processing, it is necessary to account for frequency of extent of these phenomena. A further problem that is encountered is that the ZCR of creakiness is very close to that due to fricatives. This is explained as follows: due to the significant damping of the vocal tract impulse response in creakiness, the speech signal becomes weaker in amplitude, and thus more susceptible to noise, than the modal voiced signal. Since the pulses exciting the vocal tract are placed far apart in creaky voicing, the vocal tract sometimes damps down completely and thus the signal has a speech component equaling that of the noise component. Thus, the noisy characteristics of the channel and recording device render the creaky signal having a ZCR that is nearer to that of the fricative. With these factors determining the behavior of the ZCR, a possible solution seems to split the signal into separate frequency regions and compare the ZCR in these spectral bands, rather than across the spectrum. This has a two pronged advantage. The first is that it gives a good way of comparing the ZCR due to frication to that due to laughter – for example, by looking at frequencies above 2000 Hz in both cases, these noise factors are both brought into a common platform for comparison with other phenomena like voicing etc. The second is that if the frequencies below 2000 Hz alone are used, the noise riding on the weak transients of the creaky voicing (which typically have great high frequency content, because of its “white” spectral nature) are filtered out significantly, and the resulting signal is more free of the imposing channel and device noises. Thus, noise due to laughter will now be better discernible from the noise in creaky voicing, the latter having been considerably reduced after filtering. The choice of 2000 Hz is motivated by the fact that strident fricatives show spectral energy at frequencies above 2000 Hz, and thus, the ZCR for fricatives could be attributed to being caused by spectral bands above this frequency. The temporal behavior of the ZCR obtained after filtering the signal for frequencies above 2000 Hz (called herein 40 the high-spectrum ZCR) for a fricative, modal vowel, creaky vowel and laughter, is shown in the figure 3.3. It is seen that the high-spectrum ZCR is different for the vowel regions (both modal and creaky), from that of the fricative and laughter regions. In addition, the fricative and laughter regions have a ZCR that fall in the same range, while the creaky vowel and modal vowel have a ZCR falling in the same range. An additional point to note is that the speech sample from the two figures are taken from two different speakers – the creaky samples from two different speakers are seen to be falling in the same range as well. It might be the case that the high-spectrum ZCR is a speaker-independent parameter that can be used with reasonable success for the task of separation of voiced regions from unvoiced regions. fricative modal vowel creaky vowel laughter Figure 3.3 : Spectrogram and high-spectrum ZCR of a region of speech, showing a fricative, modal vowel, creaky vowel and laughter. Approximate boundaries of these regions are marked with vertical lines. The speech file on the left and right are from two different speakers. 41 In the algorithm, motivated by the above reasons, three kinds of ZCR have been used – the full-spectrum ZCR (which gives a gross level separation from some undesirable sounds and hence proves useful sometimes), the ZCR for all frequencies above 2000 Hz (high-channels ZCR) and the ZCR for all frequencies below 2000 Hz (low-channels ZCR). A critical point here is determining the threshold that can be used to separate voiced regions from unvoiced and noisy regions. As has been observed from inspection of various speech files from different databases, such a universally applicable threshold is not valid, because of the variation of the noise (and hence the ZCR) characteristics over different channels, and the susceptibility of speech to such noise (which is a function of the speaker’s speaking strategy). Thus, a better idea is to arrive at a proper threshold for each speech signal separately. A useful strategy seems to be to store the three ZCR characteristics for the voiced and unvoiced regions and maintain their running average over frames. For each current frame, the average ZCR characteristic is calculated for the voiced and unvoiced frames that have passed and been identified until then. Then the threshold of ZCR for that specific frame is set to be the mid-point between these representative means. This procedure is followed for each of the suspected frames, using all the voiced and unvoiced regions obtained until then (including the purely periodic and aperiodic frames). If all three ZCR characteristics are classified as belonging to the voiced region, then the frame is classified as being voiced; if not, then unvoiced. It was observed that this apparently conservative decision rule in fact does not miss a significant number of creaky regions, and actually allows a leakage of unvoiced regions into voiced regions. In essence, it could be said that the false alarm rate is high and the false rejection rate is low using this strategy. 42 3.4 Spectral Slope While the above steps allow some separation between the voiced and unvoiced frames, they are still not adequate to ensure false alarm rate is low. Indeed, the above characteristics have been conditioned in such a way as to ensure that creaky regions are not rejected in the decision process, even if that allows certain unvoiced regions to the next stages of processing. A further issue is that in spontaneous speech, where co-articulation effects are significantly high on one hand and articulation of phonemes may not be accurate due to speaker tendencies on the other, there can exist other regions of speech where the consonants could take a form that closely matches that of irregular phonation. For example, there exist many instances where stops show a phonation pattern (and therefore, dip profile) very similar to that of a creaky vowel [12], and they would therefore be allowed by the ZCR constraints. Further, in case of some voiced stops, the voicing pattern also seems very similar to that of the creaky voicing, and could easily be mistaken for irregular phonation. (In fact, for certain cases, it is not clear whether to call such stops as having regular or irregular voicing. Such cases are classified as false alarms in this thesis, though there is reason to question if they are regular phonation. Please see chapter 4 for more discussion). There is a need to eliminate the acceptance of such consonants into the following stages of the algorithm. To achieve this, a useful parameter to use is the Spectral Tilt. Vowels are produced by a source signal whose spectrum has a falling slope, exciting a vocal tract whose spectrum contains several peaks (formants), which causes the speech spectrum to essentially have a falling slope. However, consonants are produced by forming constrictions in the front part of the vocal tract. This either increases the resonant frequencies of the vocal tract and causes 43 the speech spectrum to have a rising slope (like in case of fricatives), or causes the spectrum to be flat like in cases of stops which have an impulse like nature. As an illustration, Figure 3.4 shows the spectral behavior of three different kinds of phonemes – a creaky vowel, a stop and a fricative. It can be seen that in the frequency region of 0 to 2000 Hz, the creaky vowel has a falling slope, while the stop and the fricative show a relatively flatter slope. At frequencies above 2000 Hz, the vowel still shows a slope though not as pronounced, and the stop shows a more or less flat slope. The fricative, however, shows a rising slope in this frequency range. We could thus exploit the spectral slope to separate voiced sounds from unvoiced sounds. Indeed, all voice qualities – breathy, creaky or modal – exhibit a falling slope, albeit at different slopes, thus characterizing voicing. Figure 3.4 : Comparison of spectral tilt of a creaky vowel (top left), voiced stop (top right) and fricative (bottom left). The slope changes from negative to near flat to positive 44 Important considerations for using this parameter then are – the range of frequencies over which to calculate the spectral slope, and the threshold value that can be used to separate out the voiced and unvoiced regions. One important detail to be considered is that this parameter should be independent of, or at least robust to, the channel being used or the recording device. The spectral characteristics of these devices can significantly influence the spectral tilt and thus render the parameter unusable. In order to make the parameters least susceptible to such effects, the range of frequencies over which the tilt is calculated is restricted to a small range – 100 to 2000 Hz. The motivation is that the range should include voicing information and thus include the lower frequencies, and yet contain a significantly large spectral band to ensure consistency of the parameter across that spectral range. The cutoff of 2000 Hz ensures that the spectral characteristics of various phonemes remain consistent in that range. The operating bandwidth of most devices also includes this range of frequencies, although the lower cutoff frequency may be slightly higher than 100 Hz sometimes. Thus, it can be expected that this parameter will be sufficiently robust to any channel variations. The spectral tilt parameter is extracted by calculating the FFT of the windowed signal, and then calculating the slope of the spectrum between frequencies 100 to 2000 Hz by fitting a line using Minimum Mean-Square Error (MMSE) criterion. The threshold for using spectral tilt to separate voiced and unvoiced regions was found using a linear support vector machine (LSVM). The spectral tilt parameter was calculated for 1000 frames each of voiced and unvoiced speech, from different databases including both clean and telephone speech, and this parameter was then fed to a LSVM for classification of the unvoiced and voiced. The entire set was used for both training and testing, and the minimum false rejection rate for the voicing class was obtained at an 45 effective threshold of -16 dB/octave for the spectral tilt. Thus, -16 dB/octave is used as the threshold for detection of voiced/unvoiced to eliminate frames that have consonants. 3.5 Dip Profile – Eliminating Voiced Fricatives and Breathy Vowels Once the suspected creaky frames have been identified and any possible unvoiced frames have been eliminated, the first test for creakiness is performed. In this stage, the dip profile of the frame is analyzed for the pattern that should be characteristic of creaky regions – namely, loosely formed clusters that yield a high pitch period. It was observed that by this stage, the majority of the frames detected had irregular phonation, but some breathy vowels and voiced fricatives were also being detected. This can be explained by the fact that similar to the irregular phonation, the breathy vowels and voiced fricatives have some underlying periodic energy because of voicing, and this causes them to be identified as voiced, which is not an error after all. However, as had been pointed earlier, these frames have a dip profile that looks very similar to that of the creaky voicing, because of the voicing accompanied by the noise. If these frames are directly checked for their dip profile properties, they would also be called creaky. To eliminate the breathy sounds and voiced fricatives, the fact that can be exploited is that the voicing in these two cases will often be in frequencies 0 – 1000 Hz, while irregular phonation is expected to show its characteristic irregularity at all frequencies. This is because as mentioned in Chapter 1, the spectral roll-off of breathy voice is high, and thus, the source signal is not very strong at the higher frequencies. Even for modal voicing, the roll-off is higher than the creaky voicing. For the creaky voicing, which typically shows distinct impulse-like glottal pulses, the glottal spectrum is usually flatter and has strong spectral harmonics even at higher frequencies. This fact is also apparent from the 46 spectrogram of a creaky vowel, where the vertical striations arising from the glottal pulses show a strong presence at the higher frequencies as well. Thus, the dip profiles are obtained by summarizing across only those channels that span frequencies above 1000 Hz. The next step is to characterize the dip-profile that is unique to irregular phonation. This is done by finding all the local maxima in the dip-profile, and eliminating all those maxima that lie too close to each other – this is done in order to ensure that the corresponding pitch estimate remains below 150 Hz, as is expected for irregular phonation. Local maxima in the dip clusters are then identified and defined as cluster centers, and clusters are formed around these centers that are loose, in the sense that they are wider than expected for periodic clusters. For periodic clusters, a typical width of five lags on either side of the cluster center ensures more than 50% contributing to the cluster. In the case of irregular phonation, the width of the cluster is increased from 11 to 31, by allowing 10 lags on either side. Out of the total number of channels contributing to the dip cluster, some will contribute to the clusters thus formed, while others have dips lying outside the cluster. A score is made identify the channels which have a majority of their dips contributing to the clusters – a score of 1 is given if not more than one dip falls outside the clusters, and 0 otherwise. The cluster center is then recalculated, using only those channels that have been identified above to have confidence 1. The motivation for this is that by eliminating the dips lying outside the clusters and allowing only channels that contribute to clusters, any potential aperiodic (arising from voiced fricatives & breathy vowels) channels are not allowed to contribute to the process. Redefining the cluster centers using only these channels, the channel scores of all the channels are calculated again. The channel scores are then added across frequency, and if at least 50% of the channels show a score of 1, the frame is declared to have the characteristic 47 dip-profile. It is obvious that this process ensures that aperiodic frames are eliminated – if a frame is aperiodic, it would have a large number of channels with their dips lying outside the clusters, and thus, the required channel threshold of 50% would not be crossed. The possibility of capturing turbulence is further reduced by conditioning the ratio of number of dips present within the cluster, to the number of non-zero dips, to exceed a threshold of 40%. For frames with irregular phonation, the aperiodicity is not random, and therefore the number of non-zero dips lying inside the cluster will be higher compared to the cases of breathiness and voiced fricative, which show random aperiodicity and hence will have a large number of dips outside the cluster. Thus, at the end of this stage, a frame is classified as being either irregular or aperiodic. If the frame has been classified as having irregular phonation, it is then used for pitch detection in that frame. 3.6 Pitch Estimation The next step would be to identify regions of voicing and then label these regions as creaky, if certain pitch constraints are satisfied. One of the most obvious solutions for detection of creaky regions in speech is to identify regions with significantly low or irregular pitch. Thus, pitch correction is a very useful step in the detection of creaky regions. The APP detector has a pitch detector algorithm that relies on the AMDF dip clusters to estimate the pitch in the voiced regions. However, since the APP detector does not see creaky regions as being periodic or voiced, it does not give any pitch estimates for such regions. This is not an uncommon problem with pitch detection algorithms – most pitch detection algorithms fail to detect the pitch correctly in regions of irregular phonation, due to various algorithmic 48 constraints, like in-built pitch memory, maximum allowed change in pitch etc. If these constraints would not be placed, then the pitch detection algorithm would not perform a good job in cases of spontaneous speech due to noise and distortion, and would be susceptible to pitch halving and doubling errors. However, enforcing such constraints does not allow detection of creaky regions (since absence of pitch implies absence of voicing, thus detection of creakiness and estimation of pitch must go hand in hand). If it were possible to detect voicing by some other means, and then declare that frame as being voiced, then it would be possible to either estimate the pitch first and then detect creakiness in that frame, or call that frame creaky because of presence of voicing and absence of pitch, and then proceed to find the pitch estimate. However, such a voicingdetection scheme must be robust to channel distortions, noise due to various sources and other phenomena encountered in spontaneous speech, like coughing, laughter etc., in order to be put to practical use. Typical voicing detection techniques, like pitch estimation, zero crossing rate, zero crossing rate in different channels (frequency bands), energy thresholding, rate of energy change in different channels etc. do not perform up to expectations in case of spontaneous speech. Figure 1.7 illustrated this difficulty in the case of a speech file from the NIST 98 corpus, which consists of telephone speech. The speech signal in question contains instances of modal voicing, laughter, breathy vowels and creaky vowels. It may be seen that there is no possible combination of all of the above mentioned features that could lead to a separation of the voiced region from the unvoiced region. Furthermore, a comparison of the same features for the laughter, and the creaky region, prove the above mentioned parameters to be totally inadequate for separation of voiced and unvoiced regions. In fact, in spite of an extensive literature survey by the author, there have been no methods that proved efficient to separate voiced regions from unvoiced regions in such cases. 49 It becomes clear thus that pitch estimation is indeed a very difficult problem and can not be solved easily. Indeed, the nature of the problem is in itself rather tricky – what is needed is a pitch detection algorithm for creakiness detection, but such pitch detection can be done only when voicing (i.e., creakiness) is detected. In order to solve this problem, the approach taken in this work is to use the channels of the auditory filter-bank to make a vote about whether the particular frame is voiced or not. This is done by first considering the dip profile of the frame and forming loose clusters of dips and identifying how many channels contribute to the cluster. If the number of channels exceeds a certain threshold (40% of total channels contributing to the dip cluster), an initial guess is made about the frame being creaky. Then, conditioned on the satisfaction of some more acoustic properties detailed in the next chapter, the maximum of the cluster with most dip contribution is estimated to be the dip corresponding to the pitch period. The pitch frequency is calculated accordingly. Thus, an estimate of the pitch is made in suspected creaky frames. Once the pitch estimate is made, the confidence of creakiness is recalculated using this estimate. In this second pass, slightly narrower clusters are formed near this pitch period estimate, and the required threshold for the contributing channels is increased to 60%. If the frames exhibit a dip cluster confidence that exceeds this threshold, then the frame is judged to be creaky and the pitch estimate is retained. If the dip cluster fails to satisfy the required threshold of confidence, the pitch estimate is disposed and the frame is adjudged to be unvoiced. Thus, creakiness is used to first estimate pitch, and this pitch is then used to recalculate the confidence of creakiness of the frame. While calculation of creakiness happens to be a deciding factor in pitch estimation, it may be noted that this is not the only criterion, and that there are a significant number of other properly motivated acoustic characteristics that are also used to confirm creakiness, before pitch is estimated. Indeed, 50 pitch estimation is a consequence of those steps and is merely used to reconfirm the confidence of calling the current frame creaky. The following figure shows the pitch estimates obtained in the case of the same speech file as shown in Figure 3.5. (i) (ii) (iii) (iv) (v) Figure 3.5 : Pitch tracking accuracy of three algorithms in creaky regions (boundaries marked by vertical red lines). (i) waveform, (ii) spectrogram, (iii) pitch tracking using the ESPS software, (i)pitch track from the modified APP detector, (v)pitch track from the original version of the APP detector 51 It may be seen that the ESPS pitch tracker, as well as the original version of the APP detector, fail to get the correct estimate of the pitch in creaky regions, while the modified APP detector gives the correct estimate. The ESPS pitch tracker fails in most creaky frames, and in one instance confuses the modal voice for creaky voice, yielding a low pitch. A comparison of the number of creaky frames whose pitch has been correctly identified by the original APP detector with that by the modified APP detector, shows that the pitch detection accuracy has increased. Especially in the last creaky region, the original APP detector almost completely loses the pitch track, but the modified APP detector is able to find a pitch track in some of the frames. In addition, the number of frames where pitch (i) (ii) (iii) (iv) Figure 3.6 : Pitch tracking of the new APP detector algorithm in creaky regions (identified by vertical red lines). (i) waveform, (ii) spectrogram, (iii) pitch tracking using ESPS software, (iv) pitch track from the modified APP detector. The instantaneous pitch tracking accuracy of the algorithm may be noted. 52 values are given when there actually is no voicing is also very small. Further, it may be seen in the second creaky region of Figure 3.5 (region between 1.2 and 1.3 sec) that the instantaneous changes in the pitch are being tracked very accurately. Figure 3.6 also shows that the pitch tracker is able to track instantaneously the irregular changes in pitch, while at the same time also maintaining a consistent pitch track with no halving or doubling errors in other regions. The algorithm does not necessitate the use of a memory for pitch estimation, as do most pitch detection algorithms – this is what allows pitch tracking even in regions of irregular pitch. It is also noteworthy that pitch estimates in the voiced regions, and unvoiced regions where there is no reason to suspect creakiness, are left unchanged. Thus, by modifying the pitch estimation algorithm, the pitch estimates can only be expected to get better and not be harmed in any way (in the voiced regions). This improvement is demonstrated in the figures above. However, a consequent problem is that in regions where the APP detector has also made pitch detection errors like other pitch trackers, no corrective measures could be taken since the pitch confidence was sufficiently high not to cause the pitch correction algorithm to be applicable there. An illustrative example is seen in Figure 3.7. Between the region of 0.94 sec to 1.0 sec, it may be seen that the pitch period is 0.004 sec, and therefore the pitch estimate should be around 250 Hz. However, both the ESPS pitch tracker and the APP detector (modified) show a pitch track around 110 Hz. This kind of pitch halving error is probably due to the modulation of the envelope of the speech signal in that region. However, because the pitch confidence is high in that region (the pitch estimate is at twice the actual pitch period, and since there are dips there as well, its strength might override that of the first cluster of dips), the modified version does not try to correct the pitch there. 53 The next chapter describes the actual creakiness detection algorithm that precedes pitch estimation. (i) (ii) (iii) (iv) Figure 3.7 : Failure to correct pitch in certain regions, due to high pitch confidence. (i) waveform, (ii) spectrogram, (iii) pitch tracking using ESPS software, (iv) pitch track from the modified APP detector. The pitch period of 0.004 sec (highlighted region in yellow) may be compared to the estimated pitch frequency of 109.59 Hz, which is a pitch halving error. This is because of high pitch confidence at twice the pitch period. 54 3.7 Final Decision Process : Combining Current Pitch Information and Dip Profile Decisions When a frame has been determined to be creaky, and the pitch has been estimated, it is rechecked for creakiness. This is done by first checking if the pitch estimate obtained is below the “normal” or modal pitch estimate for that speaker. For this, a running average of the pitch values in modal voicing is made for all the frames that have been processed so far. For each frame, all the pitch values for all the regions that have been identified as being voiced but not creaky are stored in memory, and their mean is taken as a threshold. If the current pitch estimate is less than some percentage (empirically set to be 75%) of this “normal pitch”, then the current frame could possibly be a creaky frame, as are characteristic of creakiness in particular and irregular phonation in general. Further, in some cases of irregular phonation, the pitch may not fall low, but the pitch track may be irregular, and show a big difference (as high as 30Hz) from its immediate previous value. Such changes are also tracked, by maintaining a running memory that stores the past 10 pitch values, and detects regions where the pitch has deviated significantly from the stored values. If the current pitch estimate exhibits such a behavior, then too the frame is classified as being possibly creaky. Only if either one of the two above conditions related to pitch is satisfied, the next creakiness check is made. The dip profile is again checked to verify if it shows the dip-profile behavior with the current pitch estimate. In this second pass, tighter clusters are formed (8 lags on either side of the pitch period estimate) and the channel scores for this cluster are calculated. If the total number of channels contributing to the cluster now exceeds 60% of the total number of channels, then the frame is finally adjudged to be creaky, and the pitch 55 estimate obtained above is declared to be the pitch for that frame. Once the frame is called creaky, the energy of all the channels that have shown a score of 1, i.e., which contribute to the creakiness profile, are added together to give what is called the creakiness energy in the frame. This is the measure used to numerically characterize creakiness in that frame. Thus, the creakiness detection algorithm is a combination of various knowledgebased acoustic features built upon an Aperiodicity, Periodicity and Pitch detector, rendering the APP system now capable of distinguishing aperiodicity due to turbulence from that due to irregular phonation. The next chapter discusses the performance of the Irregular Phonation detector, some failure modes, and its application to speaker ID experiments. 56 Chapter 4 EXPERIMENTS & RESULTS The algorithm for detection of irregular phonation has been described in the preceding chapters. This chapter details the experiments that were performed to validate the performance of the algorithm, and an application of the creakiness parameter in speaker ID experiments. The first set of experiments is to verify the pitch detection accuracy of the modified APP detector. This is a first step validation of creakiness detection. The next set of experiments deals with finding the detection rate and false alarm rate of the creakiness detection algorithm. The final set of experiments demonstrate the improvement in the recognition rates in speaker ID experiments upon inclusion of the creakiness parameter 4.1 Database Used For The Experiments In order to find the pitch estimation accuracy, a reference can be obtained from the standard pitch detection software available. In this thesis, the reference pitch has been obtained using the ESPS Wavesurfer software using the autocorrelation method. To find the creakiness detection accuracy, however, there is a need for a reference or ground truth which can be used to evaluate the performance. However, there is no available standard database that can be used for describing voice quality and testing the performance of the algorithm. Thus, ground truth needs to be marked by hand in one of the standard speech databases. In this thesis, the TIMIT database has been used for the first two experiments, as hand-marked transcription for irregular phonation in those speech files was available from Dr. Slifka at MIT. Since there is no clearly defined reference that marks irregular phonation, and since the boundaries and occurrences of such phonation are difficult to define, we have called all 57 those locations marked in the reference, as well as those identified by our algorithm but missed in the reference (confirmed by visual inspection by the author), as the total number of instances of irregular phonation. Specifically, the “test” subset of the TIMIT database was used. The number of files processed was 895, and it included 65 male and 45 female speakers from eight dialect regions (dr1 through dr8). The total actual number of instances of irregular phonation was 1,400. Since the long term goal of this work is to use the algorithm in real world speech applications, the algorithm has also been tested on spontaneous speech from telephone data. The NIST 98 database was used for performing the speaker ID experiments. The first two experiments were also performed on the NIST 98 database, to ensure that the algorithm works well in spontaneous speech. The “test” subset was used for the first two experiments, and 100 files of the NIST 98 database were hand marked for irregular phonation by the author, yielding 200 instances of irregular phonation, 100 for each gender. For both these tasks, the test session of the NIST 98 data was used, and the first 5 sec were extracted for each file and treated as one speech file. For the task of speaker identification, a subset of the database containing 250 male speakers and 250 female speakers was chosen. The training utterances are taken from the train/s1a/ directory and the testing utterances are taken from the test/30/ directory of the database. There is no handset variation between the training and the test utterances. The length of each training utterance is approximately 1 minute and the testing utterances are about 30 seconds in duration. An energy threshold was used to remove the silence portion (which sometimes has low amplitude background noise) from the speech. This resulted in training utterances of about 30 to 40 seconds and testing utterances of about 10 to 20 seconds. 58 A short comparison of the two databases would be in place here. The TIMIT database is clean speech sampled at 16 kHz, and contains speech from the same speaker. The speech is controlled, in that the phonemes are uttered clearly and except silence, there is little other non-speech in this database. On the other hand, the NIST 98 database is spontaneous speech sampled at 8 kHz. The speech is subject to telephone channel distortion that corrupts the amplitudes of the lower harmonics (up to around 300 Hz) and contains background noise. Further, effects like laughter, coughing etc are also seen in some files. Thus, this latter case presents a tougher challenge than the former. 4.2 Experiment 1 – Accuracy of Pitch Estimation This experiment was performed on both the databases. As mentioned, the reference pitch is obtained from the ESPS Wavesurfer software using the autocorrelation method. The frame rate of both pitch tracks is the same, namely 2.5 msec. The window size for Wavesurfer was set to be the default (optimal). Pitch detection and all experiments with the APP detector are performed with a window size of 20 msec. As mentioned in section 2.3, the APP detector makes some errors and detects pitch in the unvoiced regions (even before the incorporation of the creakiness detection algorithm). Thus, the voiced regions and unvoiced regions have been compared separately. Comparison in the Voiced Regions The mode of comparison is as follows: in case of voiced regions, only those regions where the reference pitch detector has non-zero pitch estimates have been considered. The voiced regions include regions of both regular and irregular phonation. If, for a frame, the pitch values of the APP detector and the reference vary by more than 15 Hz or 10% of the 59 current pitch, whichever is minimum, then the pitch track of the APP detector for that frame is declared to be in error. It may be noted that the reference pitch detection is not always very accurate. It has been shown in Figure 2.5 that the pitch tracker fails to track the correct pitch in regions of creakiness. However, such regions have not been separated out in this task, since there can be no automatic way of determining if the reference pitch track is indeed correct, and it is manually very difficult to visually verify this for each of the errors. Thus, the numbers given below give a representative idea of the general pitch tracking performance. The following table shows the results for this test: # of frames # of frames where pitch was detected correctly Before modification After modification 20,000 19,080 19,660 (NIST) (95.4%) (98.3%) 8,000 7,582 7,795 (TIMIT) (94.8%) (97.4%) Table 4.1 : Pitch Estimation Accuracy of the modified APP detector, as compared to the earlier version Approximately 50 sec of voiced speech, from both male and female speakers, was taken from the NIST database. In the case of the TIMIT database, about 20 sec of voiced speech was taken, from speakers of both genders. It may be seen that the pitch detection accuracy has increased by about 3% in case of both databases. Recall that the reference tracker sometimes fails in tracking the correct pitch in creaky regions. This might mean that some mismatch can happen when the APP detector tracks the pitch correctly, but the reference is wrong. It could also mean that the APP detector pitch track is not accurate. 60 Assuming that the reference pitch track is always accurate and correct, the performance of the modified APP detector is then increased by about 3% for both databases. This implies in turn that there is at least a 3% improvement (and possibly more) in the actual pitch tracking accuracy. Comparison in the Unvoiced Regions The mode of comparison is as follows: in case of unvoiced regions, it is required to find out how many times the modified APP detector has detected a region as voiced and given a pitch track, in addition to what the earlier version of the APP detector used to do. Using the reference pitch tracker to identify regions of unvoiced speech, about 50 sec of unvoiced speech in NIST and 20 sec in TIMIT was used for testing. The following table shows the results: # of frames # of unvoiced frames where pitch value was given Before modification After modification 20,000 1,360 1,440 (NIST) (6.8%) (7.2%) 8,000 256 272 (TIMIT) (3.2%) (3.4%) Table 4.2 : Insertion of voiced decisions for unvoiced speech It is seen that the number of insertion errors has increased by a very small amount (<1%) for a large number of frames. This proves that the voiced / unvoiced decision made by the modified APP detector is almost as efficient as the original, and the number of frames which are judged as being voiced and for which pitch estimates are given, does not increase significantly. Once again, it is possible that the region marked as unvoiced by the reference 61 pitch tracker is in fact creaky, but has been marked unvoiced due to the tracker’s failure to give a pitch reading in that case. If that were the case for any frame, then the region is actually voiced and thus, the insertion error is less than what is reported above. Thus, it could be said that the maximum possible increase in the insertion error rate is about 0.4%. These experiments indicate that the pitch tracking algorithm of the APP detector has been improved by the modifications described in chapters 2 and 3, at the expense of little debilitating effects. The next set of experiments details the creakiness detection accuracy. 4.3 Experiment 2 – Detection of Irregular Phonation The following Figure 4.1 shows a sample run of the creakiness detector on a speech file. The areas identified as being creaky are marked in black on the spectrogram. It may be seen the creaky areas have all been successfully detected. The creakiness detection performance is evaluated using two metrics: the detection accuracy and the false alarm rate. Each of these will be presented below. Detection Accuracy of Irregular Phonation Two kinds of measures are used to measure the performance of the creakiness detector: the percentage of instances of creakiness detected, and the relative number of frames that have been correctly identified. These are defined below: Instance detection accuracy = Frame detection accuracy = Total number of creakiness instances detected × 100% Total number of hand-marked instances Total number of frames of creakiness detected ×100% Number of frames in the hand-marked instances 62 63 Figure 4.1 : Creakiness detection results. (top pane) Spectrogram showing the regions of creakiness in the original signal; (bottom pane) Same spectrogram as above, with the creaky regions identified by the APP detector overlaid in black The following tables give a statistical summary of the performance of the creakiness detection algorithm, for the TIMIT database: Total # of instances # Identified Percentage Identified Male + Female 1,400 1285 91.8% Female 584 543 93.0% Male 816 742 90.9% Table 4.3a : Accuracy of detection of creakiness instances in the TIMIT database Total # of frames # Identified Percentage Identified Male + Female 26,408 23,564 89.2% Female 13,454 12,243 91.0% Male 12,954 11,321 87.4% Table 4.3b : Accuracy of detection of creakiness frames in the TIMIT database The following tables give a statistical summary of the performance of the creakiness detection algorithm, for the NIST 98 database: Total # of instances # Identified Percentage Identified Male + Female 200 183 91.5% Female 100 94 94.0% Male 100 89 89.0% Table 4.4a : Accuracy of detection of creakiness instances in the NIST 98 database 64 Total # of frames # Identified Percentage Identified Male + Female 4,291 3,746 87.3% Female 2,337 2106 90.1% Male 1,954 1,649 84.4% Table 4.4b : Accuracy of detection of creakiness frames in the NIST 98 database The creakiness instances detection accuracy is seen to be slightly above 90% for each of the two datasets, and is about the same for both genders. Both female and male samples are handled equally well by the algorithm, which confirms that irregular phonation possesses acoustic features that do not depend on gender. Further, the algorithm performs well even with the NIST 98 database, in spite of debilitating conditions. Errors were more for male speakers than female speakers, and this is attributed to the low pitch of male speakers, which may often cause the dips to scatter from clusters and thus manifest the sound as being similar to irregular phonation. Also, the inherently low pitch in males causes some of the creaky regions to be missed because creakiness detection is conditioned on the pitch falling below a certain threshold. In terms of identification of number of frames, the performance goes down slightly owing to the possibility that not all frames may be detected by the algorithm, as marked in the reference. Considering the possibility that some of the instances identified as creaky may be marked so for a larger number of frames than in the reference, this implies that the actual frame detection accuracy may be even lesser than the above reported numbers. 65 False Alarm Rate The false alarm rate can also measured in terms of two measures: percentage of instances inserted, and percentage of frames inserted. Each of these terms is defined below: False Instances Rate = False Frames Rate = Total number of instances of false alarms caused × 100% Total number of actual creaky instances Total number of frames of false alarms × 100% Total number of frames - Total number of creaky frames In reality, it is the second term that actually defines the false alarm rate correctly. However, for this set of experiments, because the total number of frames is too large a number compared to the number of false alarm frames, this second measure becomes a number too small to be of significance. Thus, the second measure is redefined in the following way: False Frames Rate = Total number of frames of false alarms × 100% Total number of creaky frames The first and redefined second measures have been used to describe the false alarms. The false instances rate has been found to be 12.8% for the NIST 98 database, and 17.4% for the TIMIT database. Surprisingly, the false alarms are more for the cleaner database (TIMIT) than the one which has noise and distortions. Of all these false triggers, 35% were due to voiced fricative /sh/. About 40% of the false detections were due to stops being called irregular, while the remaining 25% of the false triggers were due to stops wherein the vowel preceding the stop was identified as creaky. Though we have currently included the detections in the latter category as false detections, studies have shown [11, 12] that there do exist cases of stops in both American English and other languages, where both voiced and unvoiced stops may be accompanied by irregular phonation. Further, it is possible that the end of vowels preceding such stops may also exhibit irregular phonation. Thus, it may 66 actually be the fact that the algorithm is capturing such instances of irregular phonation too, in which case the false triggering rate will actually be considerably low. In addition, it has also observed that we are identifying what we suspect are some glottal squeaks (4 instances). However, due to the lack of a standard reference, it is not possible to confirm these speculations at this juncture, and this count has been added to false triggers. Typically, each false alarm instance is about 3 to 4 frames long, as compared to actual creaky regions which are typically more than 10 frames long. However, a significant percentage (about 10%) of creaky regions are also short (of the same duration as the above false alarm instances) and thus, it is not possible to set a duration threshold for elimination of false alarms. The false instances rate of 12.8% translates to a false frame rate of 3.4% for the NIST 98 database, and the TIMIT false instances rate of 17.4% translates to a frame rate of 5.6% (which confirms that the false alarm instances are about one fourth of the duration of a creaky instance). 4.4 Speaker Identification using Acoustic Parameters One of the potential applications of this parameter is to use it in speaker identification tasks. The creakiness parameter is a parameter that characterizes the source information of the speaker, and should hence be able to help in speaker identification. To verify if this parameter can help in speaker identification and if so, how it affects performance, this parameter has been used in combination with seven other acoustic parameters described in section 1.6. These parameters are the four formants, the amount periodic and aperiodic energies, and the spectral tilt. The speaker ID performance for the seven acoustic parameters was seen to be comparable with that of the standard MFCCs in 67 [6]. The creakiness parameter is added as an eighth parameter to this set, and the speaker ID performance of the seven APs is compared with that of the eight APs. The speaker models were constructed using the Gaussian Mixture Models (GMM) and were trained using maximum-likelihood parameter estimation [5]. The MIT-LL GMM system was used for constructing the speaker models. Various model orders were tested and it was empirically determined that the 32-mixture GMM gave the best performance. The test utterance was identified with the speaker whose model yields the highest likelihood for the test utterance. The accuracy of the system was computed using the identification errors made by the system. To obtain the accuracies for different population sizes, the 250 speakers of each gender were divided into groups where the number of speakers in each group is the population size. The accuracy for the particular population size is the average of the accuracies over all the groups. The following table gives the error rates of the seven APs versus the eight APs when used for speaker identification: Pop. Gender Size (# of test utt) 50 100 125 Seven APs Eight APs Female (1379) 29.0 28.0 Male (1308) 27.8 27.6 Average 28.4 27.8 Female (1104) 33.1 31.7 Male (1093) 29.1 29.1 Average 31.1 30.4 Female (1379) 34.6 32.6 Male (1308) 30.0 30.0 68 250 Average 32.3 31.3 Female (1379) 38.2 36.7 Male (1308) 34.3 33.9 Average 36.3 35.3 Table 4.5 : Error Rates for Speaker ID without and with the creakiness parameter The results in this table are shown in the following plot in Figure 4.2: Figure 4.2 : Improvement in speaker ID performance upon introduction of creakiness as the eighth parameter – the error rate has gone down or remained the same in all cases, for all population sizes and both genders It is seen that for various populations, the speaker identification error rate goes down slightly, upon introduction of the creakiness parameter. Though the improvement is typically 69 about 1% for female speakers and less than 1% for male speakers, this is because of the fact that creakiness occurs very sparsely and hence offers very less training and testing data. Indeed, the creakiness parameter consists of zeros in most regions and hence, when modeled using the GMM, should show a prominent Gaussian at zero for models of all speakers, thus rendering the parameter not significantly useful for speaker ID. However, if there existed a supervised approach that could exploit the creakiness parameter more appropriately for speaker ID, the creakiness parameter would hold some promise as a good acoustic parameter. This is demonstrated by the fact that speaker ID performance improves more for female speakers than for male speakers, though the creakiness detection accuracy does not differ as significantly. This might be explained by the fact that females, having a higher pitch, are thus being more distinctive of their identity whenever they exhibit creakiness. This information is probably captured by the creakiness parameter, which thus helps to characterize effectively some of the female speakers. Therefore, using a supervised approach to exploit the creakiness information could be one possible direction of future research in applications of this parameter. 4.5 Error Analysis and Discussion Missed Detections The creakiness detection algorithm does not give perfect detection accuracy. Almost 10% of creaky instances are not identified, and therefore there is still scope for improvement of the algorithm. The failure of the irregular phonation detector in such cases could be explained by several possible reasons. The first is when the location of irregular phonation is at the very beginning of the file – in this case, the algorithm fails to detect creakiness because 70 the AMDF can be computed only after a few initial frames. Also, even when creakiness is not at the very beginning but is in the first few seconds, it might still be missed on some occasions. This is because the voiced/unvoiced thresholds, which are determined adaptively using the values stored in memory are not yet well adjusted to the proper values. If creakiness occurs before adequate voiced and unvoiced regions are seen, the representative values for both these regions are not yet obtained and thus, the threshold calculated is not accurate to make a correct decision. Thus, some false insertions or false rejections could occur. In one case, if creakiness occurs after voiced regions, with no unvoiced regions until that point, then the unvoiced region threshold is biased towards the voiced threshold, and thus, the creaky region is called voiced. This might also sometimes result in some weak fricatives or voiced fricatives being called voiced – they need not be called creaky at the subsequent stages but they might be called voiced at this stage. In the contrast case, if creakiness follows an unvoiced region, the voiced threshold is not set appropriately. This causes the creaky region to be called unvoiced and causes a missed detection error. The next possible reason of missed detection is when the pitch falls so low that the analysis window used cannot capture even one full cycle – when that happens, the AMDF structure cannot capture the characteristic dip profile. A solution for this is to adaptively change the analysis window size. A fourth reason for false rejection could be due to some instances of creakiness which are not significantly different from modal voicing, or which have most of their voicing energy in the lower frequencies. Because only the higher frequency channels are used to cluster the dips, such creaky regions may be rejected since they do not have much voicing energy in the higher regions. 71 False Alarms False alarms are mainly due to stops, and about one third of them are due to fricatives. Typically, when these kinds of consonants are adjacent to vowel regions, their characteristics get influenced by their neighboring phonemes due to co-articulation. Thus, a possible reason for these false alarms could be that the values of spectral tilt, ZCR etc. of these regions are near what would be expected for periodic regions, and this is what allows such regions past the voiced/unvoiced decision stage. The effect of voicing in the coarticulated phonemes could exist in such frames, which possibly gives them a dip profile looking similar to that of the creaky frame. Another possible reason for false alarms could be as mentioned above – the voiced/unvoiced detector has not had enough number of samples to make representative thresholds for the voiced and unvoiced regions and thus, some unvoiced sounds are called voiced. While reasons for missed detections have been elaborated upon, it is hard to find appropriate solutions for some of these causes. For example, the voiced/unvoiced decisions are always hard in case of spontaneous speech, much more so in channels causing distortions. There is always an error in even the current state-of-the-art systems, and voicing detection accuracy is never 100% accurate. In such cases, the best solution is to rely on the optimal trade-off between false alarms and missed detections. In this work, though a ROC curve has not been used explicitly to determine the most optimal operating point, the parameters of the system were varied around the currently existing parameters, and the performance did not vary significantly in terms of missed detection and false alarm performances. A matter of consideration is also the performance of the algorithm when multiple speakers are present. It may be recalled that the creakiness detection algorithm relies on a 72 memory track of the pitch in voiced regions, in order to decide if a given region is creaky or not. When more than one speaker is present in a speech file, and their pitches differ significantly, then the average pitch threshold that is used for creakiness detection may become unusable to detect creakiness. For example, if a conversation has a male speaker followed by a female speaker, then the pitch memory would have significant number of low (around 100 Hz) and high (around 200 Hz) pitch values. In such a case, the pitch detection threshold would be lower than what it should be if the speech file had a single female speaker. In such cases, when the female speaker goes creaky, it might not be identified by the algorithm, because the threshold has been reduced by the presence of low pitch values due to the male voice. Similarly, a female speaker preceding a male speaker could cause false alarms during the male speaker’s turn for the same reason. Although this problem is not handled in this work, it is of noteworthy importance and will be an improvement in a future version of the system. 73 Chapter 5 SUMMARY AND FUTURE DIRECTIONS This thesis has focused on the development of an algorithm that will automatically detect and numerically parameterize regions of irregular phonation (creakiness) in speech. This work is significantly different from other research efforts in this direction, in that it processes all regions of speech and not just the voiced regions. Further, it is meant to handle spontaneous speech with effects like channel distortions, non-speech like laughter etc. Thus, the scope and applications of this work are more far-reaching than other parallel efforts. The algorithm is an extension of the Aperiodicity, Periodicity and Pitch (APP) detector. The APP detector is now incorporated with the irregular phonation detector, and gives a periodicity, aperiodicity, creakiness profile that identifies the three regions respectively in speech. The pitch estimation algorithm in the APP detector has also been improved to correctly identify the pitch frequency in regions of creakiness. The creakiness detection performance is high and matches that of the best performance of other algorithms, and the pitch detection accuracy has also shown improvement. The number of false insertions, both in creakiness detection and in pitch estimation, is low, and counted framewise, is less than 5%. This is a set of encouraging results and proves that the creakiness detection algorithm can be used for speech technology on real speech data. 5.1 Future Research Directions Improving performance of the algorithm The creakiness detection algorithm shows about 10% missed detection rate, and this implies that there is still some work to be done to improve the detection rate. However, the problem happens to be very difficult, as creakiness and irregular phonation are not discrete 74 steps over modal phonation, but are actually a part of a continuous spectrum with breathy phonation on the other end. As such, it is difficult to define the occurrences of irregular phonation and associated acoustic properties in various cases. Further, because of various effects like co-articulation and channel distortions, some of the acoustic features that characterize creakiness do not manifest themselves in some instances. In order to improve the detection accuracy, it is thus necessary to investigate for further acoustic cues that would help identify creakiness, and incorporate them into the APP detector. Further, some of the signal processing strategies of the APP detector should also be modified in order to catch instances that have been missed. This should, however, be done in a way that the current performance of the APP detector is not deteriorated, which is not an easy task considering the fact that pitch values from 60Hz to 350 Hz should all be detected and recognized. Thus, an adaptive signal processing scheme should be incorporated into the APP detector, which toggles between various sets of signal processing parameters (like window size, frame rate etc) to find the optimal set to process a particular region of speech. In addition, the voiced/unvoiced detector can also be improved by incorporating additional conditions, to reduce the false insertion rate. Finally, the degradation in performance that is expected when multiple speakers are present in the speech file should also be handled. One possible way to do this is to rely on creakiness to identify turn taking in the speech signal (or rely on other methods) and then set separate threshold for each speaker, to identify creakiness in the following regions. Application to speaker ID In terms of applications, the creakiness parameter has shown to be of use for speaker identification. The improvement in performance may not be marked, but it proves that the parameter is helping in speaker recognition by modeling some of the speakers’ inherent 75 voice qualities and speaking strategies. However, the creakiness parameter is zero for most of the frames, since creakiness is not a very common occurrence. This, when modeled in usual statistical frameworks like the Gaussian Mixture Models (GMM)(, gives a large Gaussian centered at zero, which might override the effects of other peaks nearby. However, the GMM does not have the provision to supply non-applicable data, and therefore, the performance of speaker ID is not optimal when creakiness is used as described above. Thus, there is a need to find a better scheme of modeling the data, and to use a semi-supervised approach, wherein the creakiness parameter is used only when applicable (i.e., non zero). A possible modeling scheme could be the Hidden Markov Model (HMM), or a more general approach like the Dynamic Bayesian Networks (DBN) [23]. In either case, it remains true that modeling this parameter should be done in a more useful and applicable way. Application to ASR The application of creakiness to speaker ID motivates its use for ASR as well. In landmark-based ASR [ref], the features that are used are acoustically motivated, as the ones that have been used in this thesis for the speaker identification experiments. During the extraction of such parameters, creakiness might cause some hindrance because of the difference in phonation that generates slightly different acoustic cues than expected. Detection of creakiness before extraction of such features could help reduce such confusions, by provision of alternative features for such regions, or compensating the effect that creakiness has. Further, identifying regions of creakiness can also help in identifying turn-taking in spontaneous speech, as speakers often tend to get creaky when they pass the turn to the next speaker. Application to diagnosis of voice disorders 76 An important application of the irregular phonation detector is to help identify voice disorders by non-intrusive methods. The algorithm runs purely on the speech signal produced by the speaker, by looking at specific acoustic cues that would manifest themselves because of the activity at the vocal folds. This motivates the idea that the activity at vocal folds can be understood in an implicit way, and the specific cues found in the speech produced might give an idea about the specific physiological disorder that the speaker has, at the vocal folds. 77 BIBLIOGRAPHY [1] R P Lippman, “Speech recognition by machines and humans”, Speech Communication 22, pp. 1–15, 1997. [2] S. Davis and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences”, IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 28, pp. 357-366, 1980. [3] H. Hermansky and N. Morgan, “Rasta processing of speech”, IEEE Transactions on Speech and Audio Processing, 2(4), pp. 578–589, 1994. [4] L.R. Rabiner, “A tutorial on Hidden Markov Models and selected applications in speech recognition”, in Proceedings of IEEE, volume 77, pages 258–286, 1989. [5] D.A. Reynolds and R. Rose, “Robust text-independent speaker identification using gaussian mixture speaker models”, IEEE Transactions on Speech and Audio Processing, 3, 1995. [6] C.Y. Espy-Wilson, S. Manocha, and S. Vishnubhotla, “A new set of features for textindependent speaker identification”, in Proceedings of 9th International Conference on Spoken Language Processing (ICSLP - INTERSPEECH), 2006. [7] K. N. Stevens, “Acoustic Phonetics”, MIT Press, Cambridge, Massachusetts, 1998. [8] S. Vishnubhotla & C.Y. Espy-Wilson, “Automatic detection of irregular phonation in continuous speech”, in Proceedings of 9th International Conference on Spoken Language Processing (ICSLP - INTERSPEECH), 2006. [9] D. Klatt and L. Klatt, “Analysis, synthesis and perception of voice quality variations among female and male talkers,” J. Acoust. Soc. Am. 87, 820-857, 1990. [10] Bruce R. Gerratt and Jody Kreiman, “Toward a taxonomy of nonmodal phonation”, Journal of Phonetics, 29, 365-381, 2001. [11] Matthew Gordon & Peter Ladefoged, “Phonation types: a cross-linguistic overview”, Journal of Phonetics, 29, 383-406, 2001. [12] L Redi & S. Shattuck-Hufnagel, “Variation in the realization of glottalization in normal speakers”, Journal of Phonetics, 29, 407-429, 2001. [13] A. Juneja, “Speech recognition based on phonetic features and acoustic landmarks”, Ph.D. thesis, University of Maryland, College Park, December 2004. [14] C. T. Ishi, H. Ishiguro, N. Hagita, “Proposal of. acoustic measures for automatic detection of vocal fry”, Proc. Eurospeech 2005, 481-484, 2005. [15] K. Surana and J. Slifka, “Acoustic cues for the classification of regular and irregular phonation”, in Proceedings of 9th International Conference on Spoken Language Processing (ICSLP - INTERSPEECH), 2006. [16] T. Yoon, J. Cole, M Hasegawa-Johnson, and C. Shih. “Detecting Non-modal Phonation in Telephone Speech”, UIUC ms. [17] P., Taylor, “Analysis and synthesis of intonation using the tilt model.”, J. Acoust. Soc. Am. 107(3): 1697-1714, 2000. 78 [18] O. Deshmukh, C. Y. Espy-Wilson, A. Salomon, and J. Singh, “Use of Temporal Information: Detection of Periodicity, Aperiodicity, and Pitch in Speech”, IEEE Transactions on Speech And Audio Processing, 13(5), 776-786, 2005. [19] M. Plumpe, T. Quatieri, & D. Reynolds, “Modeling of the glottal flow derivative waveform with application to speaker identification”, IEEE Trans. Speech and Audio Proc., vol. 1, no. 5, pp. 569-586, 1999. [20] D. Lewis, “Vocal resonance,” J. Acoust Soc. Am. 8, 91-99, 1936. [21] J. Sundberg, “Articulatory interpretation of the ‘singing formant’,” J. Acoust. Soc. Am. 55, 838-844, 1974. [22] E, Moore and M. Clements "Algorithm for Automatic Glottal Waveform Estimation Without the Reliance on Precise Glottal Information”, Proc. IEEE ICASSP 2004, pp. 101-104, 2004. [23] K. Murphy, “Dynamic Bayesian Networks”, (Draft) To appear in “Probabilistic Graphical Models”, ed. M. Jordan, 2002. 79