Speech Metadata in Broadcast News by Rishi R. Roy Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degrees of Bachelor of Science in Electrical Engineering and Computer Science and Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology May 21, 2003 Copyright 2003 Rishi R. Roy. All rights reserved. The author hereby grants to M.I.T. permission to reproduce and MASSACHUSETTS INSTITUTE OF TECHNOLOGY distribute publicly paper and electronic copies of this thesis JUL 3 0 2003 andffapothers the right to do so. LIBRARIES Author - epartment of Electrical Engineering and Computer Science May 21, 2003 Certified by_ Douglas A. Reynolds Thesis Co-supervisor Certified by_ Marc A. Zissman ,heRiCo-supervisor Accepted by '-Arthur C. Smith Chairman, Department Committee on Graduate Theses BARKER Speech Metadata in Broadcast News by Rishi R. Roy Submitted to the Engineering and Computer Science of Electrical Department May 21, 2003 In Partial Fulfillment of the Requirements for the Degree of Bachelor of Science in Electrical Engineering and Computer Science and Master of Engineering in Electrical Engineering and Computer Science Abstract With the dramatic increase in data volume, the automatic processing of this data becomes increasingly important. To process audio data, such as television and radio news broadcasts, speech recognizers have been used to obtain word transcriptions. Of late, new technologies have been developed to obtain speech metadata information, such as speaker segmentation, emotions, punctuation, et cetera. This thesis presents the Massachusetts Institute of Technology Lincoln Laboratory's (MITLL) unsupervised speaker segmentation system. The goal of the system is to produce a list of segments and speaker labels given an arbitrary broadcast news audio file. Each set of segments attributed to a speaker label is similar in both foreground speaker and background noise conditions. The system is made up of four components: the acoustic change detector, the segment labeler, the segment clusterer, and the resegmenter. Of these four, the segment labeler and the resegmenter are based on the MITLL Gaussian mixture model speaker identification system. Using the 1996 Hub4 data from the Linguistic Data Consortium for training, the unsupervised segmentation system is used to segment six, 10-minute broadcast news audio files. Various aspects of the component systems are tested and trends are reported. The final result of a 20% speaker segmentation error rate is obtained, which appears to be promising for the above applications. Finally, an analysis of system errors and proposals for improvements are presented. Thesis Co-supervisor: Douglas A. Reynolds Title: Senior Member of Technical Staff, Information Systems Technology Group, MIT Lincoln Laboratory Thesis Co-supervisor: Marc A. Zissman Title: Associate Group Leader, Information Systems Technology Group, MIT Lincoln Laboratory 2 Acknowledgments The work for this thesis was conducted at the Massachusetts Institute of Technology Lincoln Laboratory. I would like to extend my gratitude to the members Speech Group, for their help and support over the past year. I would like to thank my thesis co-advisors, Dr. Douglas A. Reynolds and Dr. Marc A. Zissman, for giving me the opportunity to work on this project. I would especially like to thank Dr. Reynolds for his guidance and expertise. He has not only given me the freedom to explore my ideas, but also the feedback to keep me on track. He has also provided a great deal of help in the editing process. One of the best aspects of my time at MIT has been my friends. Most importantly, I would like to thank Pamela Bandyopadhyay for helping me survive. Mili, LYLM. I would also like to thank Gary Escudero for helping me to relax and Saumil Gandhi for helping me to study (Nitin and Kong, anyone?). Finally, I want to thank my family for its love and support. To my sister, thank you for being a friend. To my parents, I owe you everything. This is a result of the opportunities you have always given me. Thank you for stressing the importance of education and motivating me to achieve my goals. Jai Hanumanji Ki Jai! 3 Table of Contents A b stract ............................................................................................................................... 2 Acknowledgments ............................................................................................................... 3 List of Figures ..................................................................................................................... 7 List of Tables ...................................................................................................................... 8 1 Introduction ...................................................................................................................... 9 1.1 Background ............................................................................................................... 9 1.2 Applications ............................................................................................................ 10 1.2.1 Autom atic Speech Recognition ........................................................................ 10 1.2.2 Readability ....................................................................................................... 12 1.2.3 Speaker Searching ............................................................................................ 13 1.3 Overview ................................................................................................................. 13 1.3.1 System Overview ............................................................................................. 13 1.3.2 Thesis Overview .............................................................................................. 15 2 System Description ........................................................................................................ 16 2.1 Acoustic Change Detector ...................................................................................... 16 2. 1.1 Single Point Change Detection ........................................................................ 17 2.1.2 M ultiple Point Change Detection ..................................................................... 19 2.2 Segm ent Labeler ..................................................................................................... 19 2.3 Clusterer .................................................................................................................. 21 2.4 Resegmenter ............................................................................................................ 23 2.5 System Flow ........................................................................................................... 24 4 3 The Speaker ID System .............................................................................................. 25 3.1 General System Overview ...................................................................................... 25 3.2 Likelihood Ratio D etector ................................................................................... 26 3.3 The MITLL GM M -U MB Speaker D System ........................................................ 27 3.3.1 Front-end Processing ................................................................................... 27 3.3.2 G aussian M ixture M odels ............................................................................. 28 3.3.3 The Post-processing...................................................................................... 29 3.4 The Program Flow .................................................................................................. 31 4 Creating Category M odels.......................................................................................... 32 4.1 The D ata.................................................................................................................. 32 4.1.1 Term inology.................................................................................................. 32 4.1.2 1996 H ub4 D ata Corpus................................................................................ 34 4.1.3 Generating Segm ents from Raw D ata.............................................................. 35 4.2 Models .................................................................................................................... 41 4.2.1 Three-M odel System .................................................................................... 42 4.2.2 Three-M odel Filtered System ...................................................................... 45 4.2.3 Five-M odel System ...................................................................................... 47 5 Experim ents, Results, and D iscussion ........................................................................ 51 5.1 D ata......................................................................................................................... 51 5.2 51 sMetrics ........... ...................................................................................... 5.3 Acoustic Change Detector..................................................................................... 52 5.4 Segment Labeler ......................................................................... 54 5.4.1 Model Performance ................................................................. 55 5 5.4.2 Filter Level ....................................................................................................... 55 5.5 Clusterer .................................................................................................................. 57 5.6 Resegm enter ............................................................................................................ 59 6 Conclusion ..................................................................................................................... 61 6.1 System Overview .................................................................................................... 61 6.2 Perform ance Summ ary ........................................................................................... 62 6.3 Future Experim ents ................................................................................................. 63 A Transcript Description ................................................................................................... 66 A. I The SGML tags and their attributes ....................................................................... 66 A.2 The Transcription ................................................................................................... 69 A.3 The Annotation form at ........................................................................................... 70 A.4 Show ID Letters ..................................................................................................... 71 B Segmentation Exam ple .................................................................................................. 72 B.1 Transcript File ........................................................................................................ 72 B.2 Initial Segm entation Files ...................................................................................... 73 B.3 Result Segmentation ............................................................................................... 74 References ......................................................................................................................... 76 6 List of Figures Figure 1: System Flowchart .......................................................................................... 14 Figure 2: Exam ple of Speaker Clustering...................................................................... 22 Figure 3: Category Breakdown...................................................................................... 40 Figure 4: Segmentation Process.................................................................................... 75 7 List of Tables Table 1: O verlap C lassification. .................................................................................... 34 Table 2: D ata Breakdow n ............................................................................................ 41 Table 3: Three Model Show Classification. .................................................................. 43 Table 4: Three Model Labeling ................................................................................... 43 Table 5: Three Model Segmenting. .............................................................................. 44 Table 6: Filtered Three Model Labeling Scored Against Filtered Segments.. ............. 46 Table 7: Filtered Three Model Segmenting Scored Against Filtered Segments.. ......... 46 Table 8: Filtered Three Model Labeling Scored Against Unfiltered Segments. .......... 46 Table 9: Filtered Three Model Segmenting Scored Against Unfiltered Segments. ......... 46 Table 10: Five Model Labeling...................................................................................... 50 Table 11: Five Model Segmenting................................................................................ 50 Table 12: Change Detection Maximum Duration......................................................... 53 Table 13: Change Detection BIC Weighting................................................................. 53 Table 14: Five Model - Speech vs. Non-speech.. .......................................................... 55 Table 15: Segment Labeler Filter. ................................................................................ 56 T able 16: C lusterer Param eters 1.................................................................................... 58 Table 17: C lusterer Param eters 2.................................................................................. 58 Table 18: Passes Through the Resegmenter. ................................................................ 59 8 CHAPTER 1 Introduction 1.1 Background Language is the basic means of communications for civilizations worldwide, and speech is its most natural form of expression. It is often necessary to transform speech to another medium in order to enhance our understanding of language. Examples of such translation include turning speech into a varying voltage signal transmitted through telephone lines, or turning speech into a sequence of ones and zeros in digital recordings. The goal of speech-to-text transcription is to transform acoustic speech signals into a readable format. The technology to do this has been around for many years, and has historically emphasized the transcription of audio signals into words. Although this process is still a very active aspect of research, new emphasis has been placed on the process of metadata extraction. Unlike transcription, metadata extraction involves deriving non-lexical information about the speech, which includes (but is not limited to) determining punctuation, emotion, proper nouns, speaker identification, and diarization. This thesis addresses the latter two types of information. Namely, it seeks to address how one goes about creating a system that uses speaker ID techniques to conduct diarization. The goal of speaker identification is to determine who, among a given set of candidate speakers, has spoken a segment of audio. Diarization, also called segmentation, involves segmenting an audio file and associating each of these segments with a speaker. 9 The goal of this thesis is to present the unsupervised segmentation system developed at the Massachusetts Institutes of Technology's Lincoln Laboratory (MITLL). The unsupervised segmentation system takes, as input, a broadcast news audio file and, through a number of processes, outputs a segmentation file, where a segmentation file is a list of segments and speakers. Since there is no a priori information about the speakers or content of the audio file, the task is unsupervised. 1.2 Applications There are many possible applications for this type of technology. Three of the more common ones are described below. The first application is in automatic speech recognition. This is one of the initial areas for which diarization was intended, and will, therefore, be described extensively. The second is in improving speech recognizer output readability. The third is in enhancing audio searching via speaker tags. 1.2.1 Automatic Speech Recognition Automatic speech recognition refers to the process of transcribing audio into text. There are many advanced systems that carry out this task, but they all have one thing in common - they make mistakes. These errors, measured by a metric called word error rate (WER), occur when the speech transcription system incorrectly transcribes a word. Mistakes can occur for a variety of different reasons, one of the most important being variations in audio signals. There are three main types of variations that occur: inter-personal, intra-personal, and "noise." Inter-personal variations are due to the uniqueness of a person's speech. People make use of the fact that a person's voice contains qualities that indicate his or her 10 identity. These differences in speech are caused by the physical speech apparatus and by pronunciation differences and can make it very difficult for an automated system to transcribe what is being said. speaker's speech. Intra-personal variations are deviations within one Speech is largely a product of what a person does as opposed to fingerprints, DNA, or retinal patterns which are what a person is [1]. This means that speech qualities do not remain consistent within the same speaker. These alterations can be the result of many different causes, such as illness. "Noise" variations are caused by any interfering or ambient acoustic signal that is not part of the speech of interest. Noise sources come in the forms of unintelligible background speech, foreign speech, music, white noise, and many other random sounds present in everyday life. Unsupervised segmentation addresses each of these three types of variation. The problems that arise from the transcription of regions of pure noise variation are twofold. First, since no one is talking during these segments, this attempted transcription produces words where they should not exist. This creates word errors by default. Second, since many of the transcription systems used are causal, where what happens in the past affects the future, these regions can adversely affect later speech segments. Inter-personal and intra-personal variations cause problems because the process used for modeling the speakers is not robust enough. In the transcription process, it would be ideal to have one model that can be applied to an entire audio file, and yield a perfectly accurate transcript result. Currently, however, this does not occur: the transcription process cannot accurately accommodate different speakers and internal 11 variance. In order to improve the WER, it becomes necessary to improve the modeling process so that the models used can accommodate these variations. Unsupervised segmentation addressed both problems. As outlined in Section 1.3.1, the segmentation process includes four distinct steps. One step is a segment labeler. Its goal is to detect and then remove non-speech regions or pure noise variations from the audio file. By only passing on regions of speech to the rest of the process, only audio containing speech is diarized. Then using this diarization, automatic speech recognizers can only operate on speech regions. For dealing with personal variations, one of the most common solutions is to adapt a speaker-independent model for each speaker. The ideal approach is to adapt a model for each homogeneous set of segments. Homogeneity refers to audio that sounds like the same speaker in the same acoustic environment. This delineation of segments is exactly what the unsupervised segmentation system does. The first step, acoustic change detection, marks continuous, homogeneous segments of audio. Then, the clustering process, the fourth step, groups non-contiguous homogenous segments together. 1.2.2 Readability One of the problems with the output of automatic speech recognizers is that it consists solely of a stream of words. The text is, for all practical purposes, unreadable. One of the potential uses of diarization is in helping to fix this problem. By parsing through text with the output segmentation file, all words transcribed from a given segment can be grouped together and labeled with the segment speaker. This process helps transform a stream of words into a pseudo-script. Production of capitalization and punctuation in the text is addressed by other metadata processes [2]. 12 1.2.3 Speaker Searching Diarization can also be used for speaker searching. Once the audio file has been processed by the unsupervised segmentation system, each of the generated output speaker labels can be matched to a known speaker from a speaker database using speaker ID methods. Once this is accomplished, instead of the segments being attributed to generic speaker labels, they can be attributed to real speakers. These segmentation files can then be translated to a database. Then it becomes possible to search for clips of audio associated with a given speaker. This application can be extended even further. If not only the segment times but also the actual transcript of what was send in those segments were associated with each speaker, it then becomes possible to search for specific topics of conversation. Now, not only can one search for clips of Bill Clinton, he/she can also obtain clips of Bill Clinton talking about tax cuts. These types of experiments are detailed in [3]. 1.3 Overview A general overview of the unsupervised segmentation system will be given first. Following this is an overview of the thesis. 1.3.1 System Overview The unsupervised segmentation system is comprised of four main parts, as shown in Figure 1. The first part of the system is the acoustic change detector. Its role is to provide the initial segments for the rest of the system. Since these are the bases of the segments that are to be clustered, it would be ideal to have these segments be 13 homogeneous. The process for this, as well as the other three components, is described in Chapter 2. The second piece of the system is the segment labeler. It performs two distinct tasks. The first is to extract regions of silence from the initial segments. The second is to uses a speaker ID system (described in Chapter 3) to label the segments as being either speech, music, noise, speech and music overlap, or speech and noise overlap. Then, the non-speech (music and noise) segments are discarded and the filtered segments are passed onto the clusterer. The third component, the clusterer, groups similar segments together, associating each with a single speaker. The final component is the resegmenter. Using the speakers and associated segments generated from the clusterer as a reference, the resegmenter uses a speaker ID system to create a new segmentation file. The idea is that, by training models for each speaker and then generating segments for each of them, segment boundaries can be refined over the initial segmentation. Category Models AuA ie AudDo File changec n I Initial Segments Segment Labeler Speech Segments 1 Clusterer Detector Music Noise Segments Segments Silence Speaker-Labeled Segments Segments Resegmenter Figure 1: System Flowchart. This is a diagram of the unsupervised segmentation system. indicated, there are four main components to the process. Each will be described in detail. 14 As 1.3.2 Thesis Overview Chapter 2 provides a detailed description of each of the components of the unsupervised segmentation system. First described is the acoustic change detector, then the segment labeler, the clusterer, and finally the resegmenter. Chapter 3 describes the speaker ID system that is used both in the segment labeler and the resegmenter. It describes the basic steps and principles on which the system is based. Chapter 4 describes the evolution of the five category models of speech, music, noise, speech and music overlap, and speech and noise overlap used in the labeler. It begins with an explanation and results from the basic three-model system, then continues to the filtered-three-model system, and concludes with the five-model system. Chapter 5 presents the diarization experiments that were run, their associated results, and a discussion of those results. Results and analysis are presented for each system component. Finally, Chapter 6 presents a discussion of the general conclusions that can be drawn from the experiments. It also presents some future research that can be conducted to improve the performance of the unsupervised segmentation system. 15 CHAPTER 2 System Description The following chapter provides an in-depth description of the unsupervised segmentation system. The discussion of the system is broken into four sections - one for each of the four main components outlined in Section 1.3.1. The first component is the acoustic change detector, which provides the initial segments. The second is the segment labeler, which removes silence and non-speech segments. The third is the clusterer, which groups like segments together. The final component described is the resegmenter, which refines the results. 2.1 Acoustic Change Detector The goal of the acoustic change detector is to segment an audio file into a sequence of homogeneous regions. These segments provide the initial inputs to the rest of the system. Ideally, each segment should have the same conditions within it: including speaker, background noise, and sound levels. There are a number of ways to do change detection. In this work a statistical change point detection algorithm based on the Bayesian Information Criterion (BIC) [4]. This process of acoustic change detection will be discussed in two parts. The first part is a base case, in which an audio file is processed to determine whether there is a single acoustic change or not. If an acoustic change occurs, the base case will also determine its location. The process will be discussed first in the general terms and then in the BIC context. The second part expands this to detect an unspecified number of change points. 16 2.1.1 Single Point Change Detection For this section, let the audio file be of length N frames, where each frame, fi has an associated feature vector, xi. feature vectors, X ie [1,2,...,N]. ={x The audio file can be represented by a sequence of these 1 ,x,...,x ,where each vector, x , is taken at a discrete frame The base case scenario is viewed in the context of a likelihood-ratio detector. The two hypotheses that are being tested, HO and H 1, are as follows: Ho: The sequence of feature vectors, X = {xiE R', i = 1,...,N}, is represented by a single model, M . HI: The sequence of feature vectors, X 1 = {xje Rd, i = 1,...,t} and X 2 = {xiE Rd, I = t+1,....,N} are represented by two models, M x and MX2 , respectively . Combining these two produces the maximum likelihood ratio statistic, R(t), [5]: t N R(t) = log p(x, lMx)i=1 N log p(x,|Mx, ) - log p(x, |Mx2) (1) i=1i=t+1 After computing this value for all t E (1, N), if R(t) 0 for all t, then no acoustic change is present. If, on the other hand, max, R(t) > 0, then a change is marked at the t where this maximum occurs. Unfortunately, this process does not work as desired. In almost every case, two models will represent the data better than one model, since there is twice the number of parameters, in the first case, available for modeling. To compensate for this phenomenon, a penalty for complexity is introduced. There are a number of ways to do this. The method for acoustic change detection chosen for this work is BIC. 17 BIC introduces a penalty that is correlated to a model's complexity. This penalty, 1 P(t), equals -#(M)-logC, 2 where A is the weight of the penalty (generally set to 1), #(M) is the number of parameters in the model, and C is the number of frames used to create the model. The penalties for each model are added to equation 1. They are then combined to create: BIC(MX )M, + Mx,)=R(t) - I A[#(M 2 ) -#(M,) -#(M )] -log(N) (2) For this work, Gaussians are used for modeling. Looking at the complete audio , and full covariance matrix, Ix , are extracted. file represented by X, the mean vector,, These parameters define a Gaussian model M(ux ,Ex). Extending this to the sequences Mx, = M(,u.,Ex,), and Mx, = M(p.,,1x,). Using X1 and X2 gives: Mx = M(pl xY), these models, Equation 2 becomes [4]: BIC(MX I',MX, + M X2) R(t) - P(t) where: R(t) = N-log Ex I- tolog Z, j-(N - t)-log (3) 1 1 P(t)=-A d+-d(d+1) logN 2 1 2 and d is the dimension of the feature space. If Equation 3 is positive at frame t, than the model of two Gaussians is favored, with the change between the two models occurring at t. If there are no positive-valued frames, then there is no change. 18 2.1.2 Multiple Point Change Detection The next step is to extend the detection of one change to the detection of many changes. The following algorithm is presented by Chen and Gopalakrishnan [4]. (1) (2) (3) (4) initialize the interval [a, b]: a=1; b=2. detect if there is one changing point in [a, b] via BIC. if (no change in [a, b]) let b = b+1; else let i be the changing point detected; set a=i+1; b=a+1; end go to (2). Using this algorithm, the change points are marked. Segments are then created from these marks. These segments are the basis of the rest of the unsupervised segmentation task. Since each segment is homogeneous, if initial labeling is done (as described in the next section), regions of pure noise and music can be identified and eliminated. Then the remaining, homogeneous speech segments can be grouped together and attributed to speakers. 2.2 Segment Labeler The segment labeler is used to detect and then remove silence and non-speech audio segments. This is motivated by the way that clustering works. Simply stated, clustering groups similar segments together. Ideally, this would mean that segments of speech that sound like they have the same speaker are combined. In practice however, noise, music, and silence (the three types of non-speech signal that are present) can become the point of similarity between segments. This means that, rather than segment A and segment B being paired together because they have the same sounding speaker, they are paired 19 together because they both contain music. Since the latter is undesirable, initial labeling is conducted to remove these sections. Initial labeling can be further broken down into two processes. First is the detection and removal of silence, and second is the detection and removal of noise and music. An energy based activity detector detects silence. This mechanism is described in detailed in Section 4.1.3. It involves the generation of a frame-by-frame decision of silence, which is then processed to generate segments of silence. Finally, these segments are filtered against the initial segment. All segments of speech that fall into a given range of silence are spliced out, leaving only non-silence audio. The process of removing noise and music is approached as a speaker ID task (described in Chapter 3). The categories of speech, music, noise, speech and music overlap, and speech and noise overlap will be treated as individual classes. The motivation for using these five categories and the creation of their respective models is addressed in Section 4.2.3. Using the models for each of these, segments can be labeled based on the respective log-likelihood scores. All non-speech segments are removed, and the speech segments are passed to the clusterer. Combining these steps yields the following labeling procedure: - Energy based activity detection is used to generate a list of silence segments. - The initial segments from the acoustic change detector are filtered against silence. - The speaker ID system is run to test the audio file against the input category models of speech, music, overlap, speech and music overlap, and speech and noise overlap. 20 - The speaker ID system output is processed by the labeling post-processing script (described in Section 3.3.3). - The non-speech segments are extracted and saved to a file. The same is done for the speech segments. The speech file is passed on to the clusterer. 2.3 Clusterer Clustering is a process by which single items can be arranged together in order to forms groups that have similar characteristics. The clustering software used in this thesis is that which is employed at MITLL. The MITLL clustering software is based on the agglomerative clustering algorithm [6]. Clustering begins with a number of single-element clusters called "singletons." For the case at hand, these are the segments generated in the aforementioned steps. A distance is computed between each of the possible cluster pairings. This distance is calculated in such a way as to represent the similarity between the candidate clusters. Then, an iterative process begins where, for each iteration, the two "closest" clusters are merged to form a new cluster. New distances are then computed. They must be calculated for all of the pairings that contained one of the two pre-merged clusters. There are a number of ways to determine the new distance between the new merged cluster and every other cluster. For the MITLL clustering system, the distance between two clusters is equal to the distance between the two models used to represent each cluster. In the standard algorithm, this iterative joining continues until only one large cluster exists. The agglomeration does not proceed this far for the speaker clustering that is performed. Rather, it is continued until a given set of circumstances is met, at which point the agglomeration stops. Figure 2 shows an example of agglomerative clustering. 21 At the base of the clustering "tree" are the singletons, or leaves. The earlier processes generated these segment singletons. Each leaf is its own cluster. As each iteration of the clustering algorithm proceeds, the two "nearest" clusters are joined. This should theoretically continue until one cluster, called the root, remains. But, in order to generate a meaningful set of speakers, clustering is continued until a stopping criterion is met; be it a certain metric obtained or the proper number of speakers generated. For the MITLL clustering system, the clustering stopping criteria is based on ABIC [4]. The point at which the algorithm stops determines the number of speakers. This results in a number of unique speakers, each of whom have a given set of segments attributed to him or her. Each segment appears under one and only one speaker, and should have similarities with the other segments in its cluster. Ideally, this similarity will be the sound of the speaker's voice, but it could not be. That is why the filtering of non-speech is likely to improve results. Root Iteration 7 Iteration 6 Iteration 5 _________________B rach Cut Iteration 4 Iteration 3 Iteration 2 Iteration 1 Leaf Nodes S1 [22 3 S5 S4 S6 S7 Time S8 Figure 2: Example of Speaker Clustering. The leaf nodes represent different segments in a given audio file. As the clustering algorithm proceeds, segments are combined based on the distance between them. The clustering is stopped before the theoretical root cluster is reached, when a given set of circumstances are met. This gives a set of speakers: in this case four speakers. 22 2.4 Resegmenter As mentioned earlier, the ideal solution to speaker clustering is the generation of clusters with member segments that have the same speaker. As much as removing non-speech elements may improve the results, impurities remain. In addition, the overlapping of speech and music and speech and noise remain in the segmentation, biasing the clustering. Both of these cases make misclassifications inevitable. Re-segmentation attempts to remedy this situation. First, a script makes the output of the speaker clustering (a list of segments and speakers) into a set of input files for the speaker ID system (described in Chapter 3). Then, the system trains models for each speaker. In the testing phase, these models are used to segment the audio file. After the generation of the input files, the whole process is nearly identical to that used in the segment labeler. Combining these steps yields the following re-segmentation procedure: - New lists are generated from the speaker clustering output file. This list includes a list of speakers present in the output and a set of files for each speaker. - The speaker ID system is run to train models for each of the speakers. - The speaker ID system is run to test the audio file against the newly created speaker models. - The speaker ID system output is processed by the segmentation postprocessing script. o The non-speech segments recorded in the segment labeler (Section 2.3) are used as a block filter for the segmentation file, so that all of the times 23 that appear in the resulting segmentation do not contain non-speech regions. - The output list is fed back into the re-segmentation script as many times as is wanted. 2.5 System Flow Combining these steps yields the following final experimental outline for the unsupervised segmentation task: - The input audio file is segmented using acoustic change detection. This generates a list of initial segments for the following steps. - Energy based speech activity detection is used to generate silence segments. - The initial segments are filtered against the silence segments. This produces a list of segments that do not contain silence. - The input audio file is then tested against the five models that are used in the segment labeler to produce an output file. - The non-silence segments and the above output file are passed onto a script that labels each segment with one of the five category IDs. - The non-speech labeled segments are removed, leaving a list of segments that contain neither pure music, pure noise, or silence. - The segments are passed onto the clustering software. This produces a speaker label for each of the segments. - The output is re-segmented multiple times, producing as output a list of segments and speaker labels. 24 CHAPTER 3 The Speaker ID System Speaker ID is the task of deciding who, among many candidate speakers, has spoken a given sample of audio. As stated above, the tasks of labeling the initial segments of audio and the re-segmentation of the results will both be conducted from a speaker ID approach. Although the categories of speech, music, et cetera in the segment labeler are not speakers in the classic sense, this method will prove to be appropriate. 3.1 General System Overview All speaker ID systems contain two phases: training and testing. In the training phase, the speaker models are created from input audio segments. The second step, the testing, involves using these models to identify unknown audio. In the segment labeler, the training phase occurs only once, after which the generated models are used for all audio files. On the other hand, each of the iterations through the resegmenter requires training new models based on the input segmentation file, and testing the audio file with those new models. The training is made up of two processes. The first is the front-end processing, where the feature vectors are extracted from the audio signal. The second is the use of these vectors to create the appropriate speaker models. The testing phase has three sub-processes. The first is the same front-end processing used for training. The second is the scoring of these feature vectors against the models created in training. The third and final step is post-processing. In this step, 25 the scores are smoothed and normalized to increase stability and improve edge detection. The scores are then processed in one of two ways: labeling or segmenting. The nature of the experiment determines which of the two is conducted. The labeling or segmenting process yields the results that determine how well the experiment performed. 3.2 Likelihood Ratio Detector In order to understand this or nearly any experimental apparatus used in speaker ID, it is necessary to discuss the basis of the system, the likelihood ratio detector. It will be examined in the context of matching an audio segment to a speaker. Given a segment of speech, Y, and a set of hypothesized speakers, S = {si, S2,...,sn}, the task is to determine which speaker has spoken utterance Y. Let a speaker from S, s, be represented by a speaker model A r, where Ar contains the characteristic that makes Sr unique. Then, using Bayes' rule: P(ArIY A,( p(A,) (4) p(Y) where p(A r) is the a priori probability of speaker s, being the unknown speaker and p(Y) is the unconditional probability density function (pdf) for an observation segment. Speaker s, is then the hypothesized true speaker if it has the highest probability, shown by: P(YA) P()>(Y IA,) p(Y) p(A,)where s N (r# s) p(Y) This rule is simplified by canceling p(Y) and assuming that p(Ar)=1/N, r=l,...,N [7]. This results in: P(Y IA,) > P(Y|) where s = I,.., N (r # s). 26 This means that the speaker identification decision is reduced to evaluating each speaker's pdf for the observed speech segment Y, and choosing the maximum value. 3.3 The MITLL GMM-UMB Speaker ID System The differences between speaker ID systems are how the speakers are modeled and how these models are applied. The speaker identification system used in this thesis is called the Gaussian Mixture Model-Universal Background Model (GMM-UBM) Identification System. conducted. Speaker It is used at MITLL, where the research for this thesis has been This system provides the necessary computations for both phases of the speaker ID task. For the training phase, when passed a set of speaker names, appropriate segments for each speaker, and an audio file, the GMM-UBM system will create a Gaussian Mixture Model for each speaker containing all of its identifying characteristics. The generation of these speaker segments requires a significant amount of work, as discussed in Chapter 4. In the testing phase, when passed an audio file and a set of speaker models, the system will apply each model to each audio file. The output for each audio file will be a file of frame-by-frame log-likelihood score for each model. The loglikelihood is log[p(Y IA,)], where the likelihood is p(Y output will be discussed in Section 3.3.3. I ,.). The exact format of the What follows now in an overview of the GMM-UBM system. 3.3.1 Front-end Processing The goal of front-end processing is to analyze the given speech signal and extract a salient sequence of features that convey the speaker-dependent information. The output from this stage is a series of feature vectors that characterize the test sequence, 27 X =x ,x2,...,x1 , where each vector, x , is taken at a discrete time t e [1, 2,..., T]. In the training phase of the identification process, these feature vectors are used to create a model A in the feature space of x that characterizes the model's respective speaker. In the testing phase (when a segment is matched with a speaker), these feature vectors are used, in conjunction with the models, to determine the conditional probabilities needed to determine the log-likelihoods. The first step in processing and producing the feature vectors is the segmentation of the complete speech signal into 20-ms frames every 10 ms. Next, mel-scale cepstral feature vectors are extracted from the speech frames. The mel-scale cepstrum is the discrete cosine transform of the log-spectral energies of the speech segment Y. The spectral energies are calculated over logarithmically spaced filters with increasing bandwidths. Next, the zeroth value cepstral coefficient is discarded and the remaining are used for further processing. Next, delta cepstra are computed using a first order orthogonal polynomial temporal fit over ± 2 feature vectors (two to the left and two to the right over time) from the current vector. Finally, linear channel convolution effects are removed from the feature vectors using RASTA filtering. These effects are additive biases, since cepstra features are used. The result is channel normalized feature vectors [8]. 3.3.2 Gaussian Mixture Models The next step is the creation of the models for the speakers. In addition to providing an accurate description of the vocal characteristics of the speaker, the model of choice will also determine how the actual log-likelihoods will be calculated. For text-independent 28 speaker recognition, where there is no prior knowledge of what the speaker will say, the most successful likelihood function has been Gaussian mixture models. A Gaussian mixture is the weighted combination of several component Gaussian distributions. For each speaker, a Gaussian mixture is created from the feature vectors that represent his or her frames of speech. Given the complexity involved, the actual creation of these models is beyond the scope of this thesis.. After the model is created the likelihood becomes: M where M is the number of mixtures, x is a D-dimensional random vector, bi (x), i = 1,..,M are the component densities, and pi, i = 1,..,M are the mixture weights satisfying the condition ip, =1. Each component density is a D-variate Gaussian function of the form: bx=(27r) x-M 2nep- j JYm Lx- where u is the mean vector and Z1 is the covariance matrix. Reynolds provides an indepth discussion of all the aspects of not only of GMMs, but also of the whole GMMUBM system, in his thesis [7]. 3.3.3 The Post-processing In the testing phase, the GMM-UBM system processes the audio file on a frame-by-frame basis. For each of these frames, the feature vectors are evaluated against the speaker models, producing a log-likelihood ratio for each model for each frame. All of these 29 score are aggregated and stored in a file. The post-processing step involves going through the frame-by-frame output file and generating an output related to the speaker information contained in the test data. There are two methods by which to accomplish this processing. The first involves labeling segments, which is of obvious use in the segment labeler. The second involves generating segments, which is of use to the resegmenter. Both of these steps operate on the output from the GMM-UBM system. The first method involves passing the pre-determined segments into a script. For each segment, the script extracts the appropriate frames (those that fall within the upper and lower time-bounds of the segment definition), and averages the log-likelihood ratios for each model across frames. The model with the maximum log-likelihood ratio is determined to be the speaker of the segment. The second method involves parsing through the complete dump-scr to generate segments. The first step in the process is to average the frames over a given time window. This serves two purposes. The first purpose is to remove random fluctuations. Since there are periods of non-speech in any speech segment, this step ensures that these non-speech periods are not counted as the wrong speaker. The second reason for this smoothing window is to soften the edges between segments. This makes the transitions between categories more gradual, and, again, has the purpose of delineating more precisely the transitions between speakers. The next step is to determine how these log-likelihood scores are used to determine who is speaking during each frame. The method used for this setup is a maximum-value method. The model with the highest log-likelihood score is determined to be the speaker. Using this rule, the post-processing script labels each frame with its 30 associate speaker. Finally, these frame-by-frame speaker labels are coalesced to form segments. 3.4 The Program Flow Combing these steps to obtain the final experimental outline yields the following speaker ID task: - The GMM-UBM system is given three inputs. The first is a list of speakers. The second is a list of audio files - either for training or testing, depending on which step is undertaken. The third is a list of segments for each speaker for each audio file (if training) or a set of speaker models (if testing). o If training, the GMM-UBM system runs its training subroutines and creates models for each speaker based on the training data. o If testing, the GMIM-UBM system runs its testing subroutines using the input models on the testing data to create an output file for each test file. - The testing output files are processed with the appropriate post-processing task: either labeling or segmenting, depending on the desired outcome. 31 CHAPTER 4 Creating Category Models The segment labeler takes five category models as input: a model for speech, music, noise, speech and music overlap, and speech and noise overlap. Although the models are created in a process independent of the actual unsupervised segmentation task, the process of creating good models is important and merits discussion. The GMM-UBM speaker ID system was used to both create and test the GMMs that were used to model each of the categories. The first step in creating the models was in processing the data that was available for use. The first section describes both the data and the process used to format it for use by the speaker ID system. The second part of the chapter discusses the three sets of models that were created, and the qualities of each set. 4.1 The Data Before the data is discussed, the terminology used to describe it will be presented. After that, the raw data will be described. Last comes the procedure for parsing and formatting the data. 4.1.1 Terminology The terminology that follows is taken from the Linguistic Data Consortium (LDC). The terms are, therefore, consistent with those used throughout the field. The following definitions apply when providing a macroscopic description of the data. A show is defined as a radio or television broadcast production. For instance, ABC 32 Nightline is a show. An episode refers to a particular taping of a show. An example is ABC Nightline for July 21, 1996. For the Hub4' 1996 data, each separate audio file recorded an episode of a show. The following definitions apply when discussing a specific episode. A segment is a contiguous section of audio that maintains a consistent set of pre-determined criteria. Speech marks the segments in which someone in the foreground is talking in an intelligible manner. Each distinct person speaking is referred to as a speaker. Background is acoustical activity that is neither speech nor music. It represents two subcategories. The first is unintelligible or background speech, which shall be referred to as background-speech. The second encompasses all other sounds that occur that are not speech or music - helicopters, gunshots, white noise, et cetera. This shall be called background-noise, or noise. Finally, the term overlap will describe the segments that contain more than one of the above categories (speech, music, or noise). Looking at the earlier definitions, a segment can be labeled with three types of labels - speech, music, or noise. By the methodology of tagging described in Appendix A.1, the tags music and noise cannot overlap. Therefore, there are only two types of overlap (it is assumed that label a overlapping with label b is the same as label b overlapping with label a). The regions that include an overlap of speech and music will be labeled "speech+music" and the segments of speech and noise will be called "speech+noise." The results can be seen below in Table 1. It is important to note that, for the majority of the time, this document will refer to background-speech as an overlap condition between speech and noise. 1Hub4 refers to a particular evaluation data set the LDC distributes for use. 33 Labels Speech Music Noise Speech - Speech+Music Speech+Noise Music Noise Speech+Music Speech+Noise - Table 1: Overlap Classification. This table shows the possible overlap scenarios and the manner in which they are labeled in the processing of the raw data. 4.1.2 1996 Hub4 Data Corpus The corpus is made up of 172 episodes that come from eleven shows. The 1996 Hub4 data represents over 104 hours broadcasts from ABC, CNN and CSPAN television networks and NPR and PRI radio networks. The LDC provides 172 audio files, one for each episode, and their respective transcripts 2 Each transcript provides a manual recounting of the audio file associated with it. Each transcript is hand-marked to denote not only what is said by each speaker, but also speaker information including gender, phrasal level time boundaries, boundaries between news stories, and background conditions. The full spectrum of tags (i.e. what is marked in the transcription process) can be viewed in Appendix A, along with the formatting of the transcript itself. It is helpful to highlight an important part of the tagging/transcription process. All tags, except for the "Background" tag, contain a start time and an end time. This means that, in order to generate segments of any of these tags, only the opening tags must be viewed, since they contain all of the important time information. The "Background" tag is unique because it is the only tag that provides a start time but no end time. This means that every time the "Background" tag is seen, the Information about the Hub4 data set can be found at http://www.ldc.upenn.edu/Catalog/Catalo-Entry.isp?catalogld=LDC97S44. 2 34 previous "Background" segment has ended and a new "Background" segment has begun. This results in a non-spanning behavior of the tag. There cannot be two types of background happening simultaneously. To this extent, there are three type of background categorized by the LDC. They are "music," "speech," and "other." This document refers to these categories as music, background-speech, and background-noise, respectively. In the case where a "Background" segment has ended and a new one has not begun, the transcript simply labels this as starting a new "Background" segment with "Level" = "Off." 4.1.3 Generating Segments from Raw Data Based on the categories of interest, segmentation data needs to be created for each category - speech, music, noise, speech+music, and speech+noise. These tags will determine which segments to extract for training. They will also provide truth markings for testing purposes for both labeling and segmenting. Before outlining the data segmenting process, it is important to describe the file naming convention that is used throughout the data sets. The original transcript is stored as "SYYMMDDP.txt," where S is the show (one of eleven letters, a-k, see Appendix 3.4 for exact pairing). YY equals the year. Since this is the 1996 corpus, the YY value is always 96. MM is the month of the episode and ranges from 04 (April) to 07 (July). DD is the date of the broadcast and ranges from 01 to 31. P is the partition of the broadcast. If the entire broadcast is contained in one audio file, then P equals "_". If the broadcast is in multiple audio files, it is partitioned into a number of audio files (two to four), and each part is label sequentially starting from "a". called the root. 35 The string "SYYMMDDP" will be Sphere files are labeled as root.sph. They are the actual audio files containing the broadcasts. The script used to process the raw data creates new sphere files that are symbolically linked to the original file. The script creates five such audio files - one for each category of interest. Each file is labeled root-category.sph, where category is speech, music, et cetera. When processed by the script, the original transcript generates a number of files that keep track of segments. There are two types of these files: segmentation files and mark files. Segmentation files are denoted by rootdescription.seg. "Description" indicates what types of segments are contained within the file. The segmentation files not only keep track of the final segments for each of the aforementioned five categories, but also keep track of a number of intermediate segments. The mark files are generated from the final set of five segmentation files. They are labeled root_category.sph.mark. The only difference between the mark and segmentation files is the format of each segment. In the segmentation files, each segment is of the type "starttime endtime Label". The mark files, in contrast, are formatted as "Label starttime duration". Each transcription file is parsed through, producing 21 files of data. Each set is associated with one audio file. The full process may be outlined as follows: - The transcript, root.txt, is read in and all "Segment" tags are extracted. - Each tag is translated into a label that contains the speaker name, the start time, and the end time. This information is stored as root-speech.seg. - The transcript is read in again and all "Background" tags are extracted. 36 The background tags are reformatted to contain a type, start time, and an end time. The type is the one of three provided by the LDC - music, speech, and other. The end time was determined as described at the end of Section 4.1.2. Each type of background is saved in a segmentation file: rootmusic.seg, rootback-speech.seg, and rootback-other.seg. None of these three files have any regions of overlap, since, as described, the "Background" tag is nonspanning. Commercials and regions of "Noscore" are extracted from root.txt, processed, and saved as rootcommercial.seg. Commercials maintain an inconsistent set of annotation among the transcripts. All transcripts flag commercials. Problems arise from the fact that, in some cases, commercials are transcribed like any other section, and in other cases they are not. In a pro-active step to remedy this potential problem, all commercials are marked and filtered out. Cross talk is then run against the audio file itself to determine regions of silence. Cross talk is a program written at Lincoln Laboratory that determines regions of low acoustical energy, when compared to an input threshold. The program breaks the audio signal into a number of 20-ms frames every 10 ms. For each frame, the total energy in the signal is computed and compared to an input threshold. The output is a number of ones and zeros - based on whether the energy was lower (0) or higher (1) than the threshold for each frame. A script is run which processes the output of the cross talk program. It translates the frame-by-frame ones and zeros into a segmentation file, where regions of silence are marked, and where silence is defined as those frames 37 with an energy level below the threshold. This file is saved as rootsilence.seg. Next, the regions of overlap are extracted from speech, music, and background-noise. The three main categorical segmentation files, root-speech,seg, rootmusic.seg, and rootback-other.seg, are read in and the segments are aggregated into one list. As mentioned earlier, rootback- speech.seg is considered speech+noise overlap and is therefore not included in this step. o This list is filtered by root-silence.seg, rootcommercial.seg, and root back-speech.seg where each of the times associated with these segments are extracted from the list. o The list is then processed segment-by-segment so that all regions of overlapping acoustical descriptions are removed. For instance the segments "0 10 speech", "2 3 music", "8 11 music" becomes "0 2 speech","3 8 speech", and "10 11 music". o All segments with the same category label are extracted and saved in a file. Speech is saved in rootspeech-only.seg, music in rootmusiconly.seg, and noise in rootnoise-only.seg. Each of these files contains the start times and end time of segments that are purely from that category. The combined file is processed in the opposite direction and all regions of overlap are found. Speech+music 38 overlap is saved in and rootspeech musicoverlap.seg speech+noise overlap is saved in rootspeech noiseoverlap.seg. A mark file is created from each of the five final segmentation files as well as an associated sphere file, which is linked to the original sphere file. That is root-speech-only.seg, for root-speech-only.sph and root-speech-only.sph.mark are created. All of the mark files generated from root.txt are trimmed so that the total amount of time in each category equals the amount of time in music. There must be roughly equal amounts of training data per model (in seconds) per file being used to train the system, in order to prevent over-training any one category. o In each mark file, the segments are sorted by segment duration. o The list is split in half at the median length segment. Then, a segment is selected from above and below the median. These segments are added to a list. This process of selection continues until the total amount of data in the list first exceeds the amount of music for root. The final product is 21 files: five final segmentation files, five mark files, five sphere files, and six intermediate files. Figure 3 shows a flowchart of the breakdown of the categories. 39 Speech-Noise Overlap labelled. All Back-speech segments Speech and Back-other combined Pure Noise All Back-other labelled segments Speech, Back-other, and 1996 Hub4 Trasciptio All Speech Music combined Pure Speech Speech and Music Pue Music labelled segments All Music labelled segments Speech-Music Overlap Figure 3: Category Breakdown. This shows the division, recombination, and output of the above outlined processes. Appendix B contains an intuitive graphical representation of what occurs during the segmentation process. Even though there may be some microscopic discrepancies in the graphical version when compared to the script outline, it provides a fair description of the macroscopic steps involved in the processing of the transcript files. Once the script is run, the total amount of data for each category is summed. Table 2 shows the breakdown by categories of the 104 hours of 1996 Hub 4 data. 40 Type Time in Sec Time in Hours Speech 191541.27 3192.35 Music 8288.46 138.14 Speech+Music Noise 14191.45 1893.91 236.52 31.57 Noise+Music 60926.34 1015.44 Total 276841.43 4614.02 Table 2: Data Breakdown. Total amount of data in each of the five categories. The data processing procedure described above generates segments for five categories of acoustical signal. For two sets of the speaker ID experiments, there are only three categories of interest - speech, music, and speech+music overlap. Although these three-model systems were developed earlier, it is easiest to think of the pre-processing step for them as a simplification of that used for the five-model system. The process is almost identical, with the exception being that rootback-other.seg and rootbackother.seg are combined to rootback.seg in the three-model system. Then, in the following steps, only speech and music are combined to produce the overlap and nonoverlap segments. Finally, the filtering that is done on the combined speech and music list is against the segments from rootback.seg, in addition to the previously used rootsilence.seg and rootcommercial.seg for one set of models and only the later two for the other set. 4.2 Models The experiments that were run in this section fall into three evolving sets of models. They will be presented with the final version last. The first is a three-model system that attempts to detect "speech", "speech+music", and "music". The second is an extension of the original system, where background is filtered from the audio used to create the 41 models. The third is the five-model system that detects "speech", "music", "noise", "speech+noise", and "speech+music." Each section that follows will provide the major characteristics of the experiment and then provide the results. In all of these experiments, the breakdown was as follows. The data was trained against all of the episodes for two shows, and tested against all of the eleven shows. This was done for two reasons. The first addresses the nature of music in broadcasts. Music, as mentioned before, is played as a transition between stories in a broadcast. Most notably, music marks the introduction of the episode itself. This music is the same within shows, and is usually an identifying characteristic of the show. Therefore, if models were trained on all the shows, the possibility of over-training the models may arise. This means that instead of detecting music, which is wanted, the models detect the theme songs from the eleven broadcasts. Also, if training occurs off only one show, the model may fall short due to a shortage of training data. 4.2.1 Three-Model System For this experiment, the data was pre-processed with three models and no background filtering in mind. This experiment uses the simplest of the models and is meant to be a first-pass test. Only the segments marked as speech and music are collected and processed according to the description provided in Section 3.1.3. The two types were processed for overlapping and exclusive regions while being filtered against the commercials and silence. Again, as mentioned above, background is ignored. Table 3 shows the results of the first experiment. One of the scoring methods that the GMM-UBM system provides, in addition to the frame-by frame dump-scr option, is a segmentation file based output. When passed a file of segments, the GMM-UBM system 42 scores the audio denoted by all of the given segments and provides a single score for each model. This option made sure that the created models had potential before further processing. All of the audio files were tested with each of their three respective speaker segmentation files. truth\hypothesis Speech Speech+Music Music Sp eech #Shows % 100.0% 172 0.0% 0 0.0% 0 Speech+Music #Shows % 0.0% 0 100.0% 172 0 0.0% Music #Shows % 0.0% 0 0.0% 0 100.0% 172 Table 3: Three Model Show Classification. This shows the results of the three model classification on the episode level. Looking at Table 3, one can see that the results are as desired. All of the shows were categorized as expected. This leads to the next step: labeling segments. As mentioned earlier, this step involves the post-processing of the dumpscr with a list of segments. The list used for this experiment is just the aggregation of all of the category segments for all of the episodes. The log-likelihoods for each model are averaged across the duration of the given segment, and the maximum score determines the hypothesized speaker. Table 4 shows the results of this process. There are a total number of 19962 segments scored against. truth\hypothesis Speech Speech+Music Music Sp eech % # Segs 16284 95.6% 44.7% 815 3.5% 39 Speech+Music % # Segs 665 3.9% 47.6% 869 4.4% 48 Music % # Segs 84 0.5% 7.7% 141 92.1% 1017 Table 4: Three Model Labeling. This shows the results of the three model labeling. Results show the number of segments attributed to each class and the percentage correct. 43 The final step is the segmentation of the dump-scr files. A 201 frame (two second) averaging window removed spikes and smoothed the data. Then, on a frame-byframe basis, each frame was labeled with the category producing the maximum loglikelihood score. Finally, all neighboring frames with the same label were collapsed together. A total of 278036.4 seconds were scored against. The results are shown in Table 5. truth\hypothesis Speech Speech+Music Music Speech # Secs % 96.5% 246578.30 48.0% 6881.02 303.97 3.7% Speech+Music % # Secs 2.9% 7453.66 6064.40 42.3% 545.70 6.7% Music % # Secs 0.6% 1489.21 9.7% 1394.48 89.6% 7325.75 Total 255521.10 14339.90 8175.42 Table 5: Three Model Segmenting. This shows the results of the three model segmentation. Results are in the number of seconds attributed to each category. Given the fact that the speech and music model were trained on only two shows, these results support the claim that music has distinct and unique identifying characteristics. It also shows that a speaker ID setup can successfully flag these regions. There is, overall, an approximately 90% accuracy rate for classifying pure music. Since the purpose of these models is to flag regions of speech, the classification of speech as music is not desirable. Therefore, the less than -0.5% error of falsely identifying speech as music is promising. The only problem that arises is in speech and music overlap. The results from the non-filtered three model system are not good for the classification of speech+music. The quality of speech+music for nearly all cases is loud, foreground speech over soft, background music. If the speech model was adulterated with something that had dynamics similar to this, then these segments could pull speech+music towards speech, thereby leading to misclassifications, as seen. 44 Even though music and noise are not similar per se, they are alike in the fact that they are not speech. Therefore, since there is no speech+music in the speech model, the speech training data could be contaminated with speech+noise. This leads to the next set of models. 4.2.2 Three-Model Filtered System As mentioned earlier, the three-model system summarized above did not take background into account. The logical next step is to filter out these segments, and see if any changes occur. The marked background segments are filtered out of the speech and music segments. After training, the models should be more tuned to their respective category, since the training data is more pure. The same two types of results as above are shown again below. For the first set of labeling and segmenting results, Table 6 and Table 7 respectively, the results were obtained by comparing the post-processing outputs to the truth determined from the above filtered segmentation files. As seen from the tables, there are 24521 segments being labels and 275814.2 seconds being segmented. The second set, Table 8 and Table 9, are obtained by comparing the dump-scr to the truth determined by the segmentation files created in Section 4.2.2. These are the non-filtered segments for speech, music, and their overlap. There are 19962 segments labeled and 273602.5 seconds segmented. 45 truth\hypothesis Background Speech Speech+Music Music Speech % # Segs 71.9% 3452 15925 94.8% 32.1% 609 2.1% 22 Speech+Music % # Segs 19.0% 914 770 4.6% 60.5% 1147 41 4.0% Music # Segs % 9.1% 435 0.6% 99 7.4% 140 93.9% 967 Total 4801 16794 1896 1030 Table 6: Filtered Three Model Labeling Scored Against Filtered Segments. This shows the results of the three model labeling using the models created with the filtered segmentation files. Results show the number of segments attributed to each class and the percentage correct. These are scored against the filtered segmentation files. truth\hypothesis Background Speech Speech+Music Music Speech # Secs % 77.3% 48062.52 96.6% 184998.80 36.1% 5127.89 1.7% 132.15 Speech+Music % # Secs 19.4% 12086.49 6061.51 3.2% 54.3% 7699.38 7.6% 610.08 Music % # Secs 3.3% 2024.55 402.44 0.2% 9.6% 1365.76 90.7% 7242.62 Total 62173.56 191462.80 14193.03 7984.85 Table 7: Filtered Three Model Segmenting Scored Against Filtered Segments. This shows the results of the three model segmentation using the models created with the filtered segmentation files. Results are in the number of seconds attributed to each category. These are scored against the filtered segmentation files. truth\hypothesis Speech Speech+Music Music Speech % # Segs 15459 90.8% 31.5% 574 32 2.9% Speech+Music % # Segs 8.6% 1469 61.3% 1119 4.8% 53 Music % 0.6% 7.2% 92.3% # Segs 105 132 1019 Total 17033 1825 1104 Table 8: Filtered Three Model Labeling Scored Against Unfiltered Segments. This shows the results of the three model labeling using the models created with the filtered segmentation files. Results show the number of segments attributed to each class and the percentage correct. These are scored against the non-filtered segmentation files. truth\hypothesis Speech Speech+Music Music S peech # Secs % 92.4% 232402.70 36.1% 5127.89 1 .7% 132.15 Speech+Music % # Secs 7.0% 17579.32 7698.66 54.3% 7.7% 618.63 Music % # Secs 0.6% 1435.63 9.6% 1364.27 90.6% 7243.27 Total 251417.70 14190.82 7994.05 Table 9: Filtered Three Model Segmenting Scored Against Unfiltered Segments. This shows the results of the three model segmentation using the models created with the filtered segmentation files. Results are in the number of seconds attributed to each category. These are scored against the nonfiltered segmentation files. 46 The changes in times and number of segments is what is expected. First, the number of seconds segmented is smaller, since sections that were previously called speech are being extracted as background. Also, the number of segments has increased, since the previous segments of speech are broken up into two or more segments when the background was removed. The results from the first set of experiments confirm what is expected. Background does seem to be throwing off the results. Speech+music classification improves by more than 10%, when the models are created with background removed. The reason for this improvement is an improved speech model. As stated earlier, no background appears with music. With the removal of background from the segments, speech+noise is removed from the speech training data, and therefore, the model represents more pure speech. One might be led to believe the changed results may be the result of the changed truth marks. This is tested by scoring the filtered models against the unfiltered truth. The same 10% improvement remains. Unfortunately, this improvement comes at the cost of missed speech speech+noise is filtered out of the models. This is not desirable. Therefore, the natural extension of the process is to create models for speech+noise and noise, test with them, and see how the results change. 4.2.3 Five-Model System The last set of experiments deal with the five-model system. These five models are the final set, and are used in the segment labler. The five models of speech, music, noise, speech+music, and speech+noise are created and scored against. The results are shown in Tables 10 and 11. The same two methods described above are applied, except 47 expanded to five models. There are 24813 segments labeled and 229566.2 seconds segmented. The models for the music and speech+music are the same as those in the two, three-model systems. The same speech model as in the filtered-three-model system (different than the original three-model system) is generated. The results that are obtained are less than ideal. Speech+noise and speech have misclassifications, as well as noise and music. Speech+music is misclassified as speech+noise (which further substantiates what was said earlier regarding the speech and speech+music confusion), and so is noise as speech+noise. What follows are some potential explanations for these inaccuracies. Noise adds problems because of its non-uniformity. Noise/background does not have any characteristics that make it identifiable, except for the fact that it is not speech or music. One of the biggest problems is also the fact that noise volume varies significantly. It can be quite loud at times and barely audile at others. This variability helps explain the problems with speech and speech+noise. The speech+noise model is based on data that has this "quiet" noise. If the noise can barely be heard, then the segment is speech, for all practical purposes. On the other hand, speech has all type of background noises that are not marked as noise in the transcript. This means that there is a common type of audio segments modeled by both models. This of course, leads to erroneous classifications. The music and noise confusion may be due to the fact that some of what is called noise has characteristics similar to some of the music. This similarity could be based on 48 similar frequency distributions or acoustical dynamics. Regardless, whatever the reason for the similarity, it also explains the misclassification of speech+music as speech+noise. The errors between noise and speech+noise can be explained by the same reasoning that explains the music and speech+music overlap. These are just distributions problems. In these regards, the models are accurate for all of the classes. 49 truth\hypothesis Speech Speech+Music Speech+Noise Speech # Segs % 9438 62.9% 266 14.8% 1396 23.1% 6.6% 1.2% Noise Music Speech+Music # Segs % 271 1.8% 749 41.7% 368 6.1% 54 49 6.0% 4.6% 59 13 Speech+Noise # Segs % 4771 31.8% 627 34.9% 3808 62.9% 16.3% 3.1% 146 33 Noise # Segs % 484 3.2% 32 1.8% 381 6.3% 540 104 60.3% 9.8% Music # Segs % 44 0.3% 123 6.8% 97 1.6% Total 15008 1797 6050 97 863 896 1062 10.8% 81.3% Table 10: Five Model Labeling. This shows the results of the five model labeling. Results show the number of segments attributed to each class and the percentage correct. 0 Speech Speech+Music Music Noise Speech+Noise % # Secs Total truth\hypothesis % # Secs % # Secs % # Secs % # Secs Speech Speech+Music Speech+Noise 24.9% 13.4% 17.5% 38516.75 1594.18 9580.38 24.0% 18.6% 21.3% 37036.87 2208.07 11680.61 23.8% 23.6% 38.7% 36732.30 2807.92 21184.61 10.1% 28.1% 9.3% 15573.47 3341.80 5077.39 17.3% 16.3% 13.2% 26726.26 1933.20 7207.99 154585.70 11885.17 54730.98 Noise Music 6.7% 10.3% 106.16 699.57 51.4% 14.4% 813.07 974.95 21.8% 12.7% 344.93 859.34 7.2% 11.6% 113.16 788.56 12.9% 51.0% 204.44 3460.25 1581.76 6782.67 Table 11: Five Model Segmenting. This shows the results of five model segmentation. Results are in the number of seconds attributed to each category. CHAPTER 5 Experiments, Results, and Discussion This chapter discusses the experiments run with the unsupervised segmentation system. First, a short description of the training and testing data used is given. Next, a description of the five measures of performance appears in the result tables. Finally, the experiments and results are presented by each tested system component. 5.1 Data There are two sets of data that are used. One set is the Hub 4 1996 data from the LDC. This is described in detail in Chapter 4. These data files consist of audio files and corresponding transcript files. The transcript files were processed in order to create training data for the creation of the category GMM models. The second set of data is from the dry run experiments of the RT-03 3 speaker evaluation. It contains a set of six 10-minute audio files. Also present in the data set is a list of reference segmentations. These are the six files that are each passed to the unsupervised segmentation system. After the system is run, the output segments are collected and compared to the provided reference segmentations. 5.2 Metrics The output segmentation files are scored against the provided truth using a scoring script provided by the Nation Institute of Standards and Technology's (NIST's) Speech Group. When passed the truth and hypothesized segmentation files, the script outputs a score 3 This is a data corpus provided by NIST for their annual speaker evaluations. 51 report for the hypothesized file. The six numbers within the report are missed speech, false alarm speech, missed speaker time, false alarm speaker time, and speaker error time. The significance of each number is as follows. In general, missed speech is those regions that are marked as speech in the truth segmentations, but are not marked as speech in the test segments. False alarm speech is the opposite: marked speech in test, but not in truth. Each of these times in seconds is then normalized in order to produce the percentages seen in the tables. If these times are normalized by the total amount of scored audio, the results are "missed speech" and "falarm speech." If the normalization factor is the total amount of speech in the audio, then the results are "missed speaker time" and "falarm speaker time." The scoring software then finds the best mapping of hypothesized speakers to real speakers for each audio file. Once this mapping occurs, the amount of time in seconds misclassified is normalized by the total amount of speech to produce the "speaker error time." The three errors of missed speaker, falarm speaker, and speaker error are summed to produce the total error. 5.3 Acoustic Change Detector The acoustic change detector was not tested in isolation. Rather, the effects of changing various detector parameters on the clustering process are noted. None of these results contain re-segmentation. Two parameters were perturbed in isolation. The first, scd_max, dictates the maximum duration of a created segment in a number of frames. Since clustering can only combine segments, the segmentation process can breakdown if the initial segments are too long in duration. Therefore, capping the maximum length of segments avoids this problem. Column one in Table 12 "baseline" measures the performance of the initial set- 52 up. In this case, scd_max is set to 3500 frames. Since one frame equals 10 ms, this means that no created segment can be longer than 35 seconds. Other values for scd_max are 1500 frames, 4500 frames, and 6000 frames. Missed Speech Falarm Speech Missed Speaker Time Falarm Speaker Time Speaker Error Time Total Error baseline ( 0 4.8 0.4 8.9 10.6 19.86 sod_max = 1500 (%) 0.1 4.8 0.4 9.0 11.2 20.67 scdmax = 4500 (%) 0 4.8 0.4 8.9 10.0 19.27 scd-max = 6000 (%) 3.1 4.7 6.1 8.8 9.8 24.68 Table 12: Change Detection Maximum Duration. This table shows the results of changing the maximum length of a segment produced by the acoustic change detector. The second parameter tested was the weighting of the BIC penalty (A in Section 2.1.1). Although the strict definition of BIC states that A= 1, by altering this factor the potential for segmenting changes. The baseline result of 2 = 1 is shown in Table 12. The first experiment set A = 0.8. The results are shown in Table 13, column 1. Next, A was set to 0.9, 1.1, 1.2, and 1.3. Table 13 shows the respective results. Missed Speech Falarm Speech Missed Speaker Time Falarm Speaker Time Speaker Error Time Total Error A=0.8 (% 8.4 3.7 16.1 7.0 7.7 30.8 Table 13: Change Detection BIC Weighting. weight in the BIC penalty. A=0.9 (% 0.1 4.8 0.4 9.0 11.2 20.61 A=1.1 %) 3.1 4.5 6.1 8.4 8.9 23.47 A=1.2 (%) 0.0 4.9 0.4 9.2 11.3 20.87 A=1.3 (% 0.0 5.0 0.4 9.4 13.2 23.05 This table shows the results of changing the penalty 53 Table 12 shows that clustering performance is dependent upon the maximum duration of segments. performance drops. As expected, if the initial segments becomes too long, What was not expected was the drop in performance when the maximum duration was limited to 15 seconds and the increase in performance when the duration was increased to 45 seconds. Theoretically, the clustering system should be able to rejoin any of the segments that are broken up by a length restriction. Therefore, using more segments should not lower performance. Unfortunately, this does not occur. Most likely due to the increase in the number of segments, the clustering process did not succeed in recombining adjacent segments. The alteration of the BIC penalty weight has various unexpected results. As A decreased, the errors increased. By decreasing the BIC penalty weight, the detector placed less weight on complexity, thereby increasing the likelihood of a change. This meant that there were more segments of shorter duration. This is consistent with the decrease in maximum segment length, although the resulting placement of changes in the two processes was most likely different. As 2 increased, the error increased for each when compared to the baseline, but not with one another. Given the above reasoning, it can be concluded that the clusterer did not like fewer, longer segments. 5.4 Segment Labeler The experiments of the labeler can be broken into two groups. The first set of experiments is concerned with the accuracy of the models used to perform the labeling. The second set shows the value of adding various levels of post-label filtering to the diarization process. 54 5.4.1 Model Performance The description and performance of these models are shown in Section 4.2.3. The data used to test this aspect of the labeler are different from that which was used to test the rest of the components. Although the tables in Section 4.2.3 show the results by category, it is more meaningful to see the performance of the models based on the speech versus nonspeech distinction. Since this is the reason why the labeler is used, it makes sense to highlight this aspect of performance. Using the results of Table 10, Table 14 shows the results of speech and non-speech classification. Speech, in this case, is represented by any of the three speech-containing categories from above: speech, speech+noise, and speech+music. Non-speech refers to noise and music. truth\hypothesis Speech Non-Speech Speech % # Segs 94.9% 21694 354 18.1% Non-speech % # Segs Total 5.1% 1161 22855 1958 1604 81.9% Table 14: Five Model - Speech vs. Non-speech. This table shows the results of speech and nonspeech, using the five model segmentation results. Results show the number of segments attributed to each. As seen in Table 14, the results are satisfactory. There is a 5% misclassification of speech as non-speech. There is also the problem of roughly 20% of the non-speech being missed. 5.4.2 Filter Level The second aspect of labeler experiments considered and determined the level of filter that should be completed after the segments are labeled. All of these tests were run with the baseline parameters for clustering, as described in the following section. The first 55 step taken was to run the minimal blind segmentation system on the data to see what the performance was like. The minimal system entails the initial segmentation and the clustering of those segments. The results of this process are shown in Table 15, in the column "no filter." This label is applied because the segments are not filtered prior to clustering. As mentioned earlier, one of the major reasons for which the labeler was developed is that it removed areas of non-speech, causing lower errors. Since there are two types of non-speech, the first step taken was to remove all of the segments that were labeled as music or noise, as described in Section 2.3. The results are shown in the column labeled "pure music and noise filter". The next step was to remove all of the silence from the above filtered segments. The process for doing this is also described in Section 2.3. The results for this process are shown in the column labeled "all non-speech filter." Finally, all of the initial segments that are not considered pure speech were removed. This was done by changing the post-labeling step to reject all of the segments that were not labeled pure speech. The results are in the "All music and noise filter" column. Missed Speech Falarm Speech Missed Speaker Time Falarm Speaker Time Speaker Error Time Total Error no filter (%) 0 6.2 0.4 11.7 11.7 23.68 pure music and noise filter (%) 0 4.8 0.4 8.9 10.6 19.86 all non-speech filter (%) 1.6 3.3 3.3 6.2 9.3 18.73 all music and noise filter (%) 21.1 1.6 39.8 2.9 7.3 49.97 Table 15: Segment Labeler Filter. This table shows the results of various types of filtering. 56 The results from this stage are mostly as expected, but with some surprises. The removal of all non-speech provided the best results, dropping the total error by nearly 5%, with the removal of noise and music coming in second, dropping the error by nearly 4%. There is an increase in the amount of missed speech, as the evaluation includes filtering silence. This occurs because much of the silence occurs in a speaker's speech. Whenever the speaker pauses to take a breath, think, et cetera, silence is created. This silence is not represented in the truth marks, and therefore, when these times of silence are not included, it shows up as missed speech. When only pure speech was kept, the models performed exceptionally well. Unfortunately, all of the overlapped speech was neglected, and therefore the missed speech was tremendously high. The surprise comes in the area of missed speech for the pure music and noise filter, and as a byproduct the non-speech filter. If the results in Table 14 are an estimate of performance, a roughly 5% missed speech rate should appear for both experiments. This does not happen. Rather, there is no missed speech. 5.5 Clusterer Various parameters dealing with the clusterer were altered in isolation, and their effects were recorded in Table 16 and Table 17. The baseline results (the same as in Table 12), are reported again in Table 16. The first experiment used a RASTA channel normalization process prior to clustering. Channel normalization attempts to remove background variations in the signal. The results are shown in column "RASTA" in Table 16. The final two columns in Table 16 report the next set of experiments. The mixorder denotes the order of the mixture model used to help resolve the speakers. The baseline value of this is set to 128 mixtures. The experiments tested 256 and 64 mixtures. 57 Missed Speech Falarm Speech Missed Speaker Time Falarm Speaker Time Speaker Error Time Total Error RASTA ( 0 4.8 0.4 8.9 12.2 21.5 baseline S 0 4.8 0.4 8.9 10.6 19.86 mixorder = 256 (%) 0 4.8 0.4 8.9 12.6 21.86 mixorder =64 (%) 0 4.8 0.4 8.9 17.5 26.76 Table 16: Clusterer Parameters 1. This table shows the results of various parameter changes. Table 17 shows the effects of altering some of the feature processing parameters. In the audio file, there is a mix of two type of speech: wideband (8 kHz) and narrowband speech (4 kHz). These experiments are looking at the effect of bandlimiting the data to either 8 kHz or 4kHz. The second aspect of these tests is whether or not a linear or melscale filterbank (described in Section 3.3.1) should be used. The baseline is a 4 kHz cutoff using a linear filterbank. The experiments tested the other three combinations: 4 kHz mel-scale, 8 kHz linear, and 8 kHz mel-scale. Missed Speech Falarm Speech Missed Speaker Time Falarm Speaker Time Speaker Error Time Total Error 4 kHz linear (baseline) (%) 0 4.8 0.4 8.9 10.6 19.86 4 kHz melscale (%) 0.0 4.8 0.4 8.9 14.2 23.55 8 kHz linear (%) 0.0 4.8 0.4 8.9 14.4 23.66 8 kHz melscale (%) 0.0 4.8 0.4 8.9 14.2 23.53 Table 17: Clusterer Parameters 2. This table shows the results of various parameter changes. None of the parameter changes showed an improvement from the initial set-up. By attempting to do a RASTA channel normalization, the cluster is losing some of the information about ambient conditions that help clustering. From the results, using 256 58 mixtures to model is too much and using 64 mixtures is too little. Finally, the use of the lowest common bandwidth, 4 kHz, and linear values performs the best. 5.6 Resegmenter The final set of experiments dealt with the re-segmentation process described in Section 2.4. The output file from the run containing the original parameter values with all nonspeech filtered out was used. This file was passed to the re-segmentation script, and the results are shown in the "1st reseg" column of Table 18. The output associated with the first pass is sent to the re-segmentation script to generate a new output list. The scores from that process are shown in the "2nd reseg" column. The process continues until the fifth pass through the re-segmentation script. The results are shown below. Missed Speech Falarm Speech Missed Speaker Time Falarm Speaker Time Speaker Error Time Total Error base (nonspeech filtered) (%) 1.6 3.3 3.3 6.2 9.3 18.73 1st reseg (%) 0.1 4.8 0.5 9.0 9.4 18.98 2nd reseg ( 0.1 4.8 0.5 9.0 9.5 19.03 3rd reseg ( 0.1 4.8 0.5 9.0 9.4 19 4th reseg () 0.1 4.8 0.5 9.0 9.4 18.99 5th reseg (%) 0.1 4.8 0.5 9.0 9.4 19.01 Table 18: Passes Through the Resegmenter. This table shows the results of resegmenting the original output of the blind segmentation system. The results of the re-segmentation processes are somewhat different that expected. The resegmenter did not decrease the total error of the segmentations. This could have occurred for one of two reasons. One possibility is that there were no errors in the acoustic change detection and clustering processes. The other is that the re- segmentation process did not work as expected. The first two cannot be ruled out just 59 because the system performance suggests errors in the aforementioned components. Although the output from the clusterer did have errors when compared to the truth, these errors probably have no correlation to the accuracy of the acoustic matches. The problem lies in the fact that the truth is segmented by individual speakers, whereas the hypothesized is segmented by acoustic speakers. Those segments that have completely different background conditions with the same foreground speaker are marked as the same speaker in truth, but different speakers in hypothesis. Unfortunately, the resegmenter did not compensate for this discrepancy. The first reason may explain the results, but in all likelihood it is the third. The lack of significant results is most likely due to the fact that the testing of the speaker models is occurring from the same data that was used to train them. Since all of the feature vectors that are presented in testing are already in one of the speaker models, it becomes near impossible for reassociations of speakers and segments. 60 CHAPTER 6 Conclusion 6.1 System Overview The MITLL unsupervised segmentation system used to perform the diarization task has four main components: the acoustic change detector, the segment labeler, the clusterer, and the resegmenter. Each performs a unique function in the diarization task. The change detector provides the initial list of homogeneous segments used. The labeler removes all of the silence regions. Then it uses a speaker ID system to label all of the segments and, finally, it removes the non-speech ones from the list. Next the clusterer groups the remaining segments together to provide a list of segments and speaker tags. Finally, the resegmenter smoothes the results by using a speaker ID system to train speaker models for each speaker and identify region where the speakers are talking. This creates another segmentation file that can be fed into the resegmenter as many times as is desired. The speaker ID system is called the GMM-UBM Speaker ID System. It performs two main functions: training models and testing audio files with models. For the training phase, the system needs a list of speaker and corresponding audio segments to be modeled. It produces a Gaussian Mixture Model for each speaker. These models can then be used in the testing phase, where an audio file is scored with each model on a frame-by-frame basis. One of two scripts is then used to process these frame-by-frame scores, producing either segments labels or a segmentation file. 61 In the case of the segment labeler, five models are created beforehand: one for each category of acoustic signal - pure speech, pure noise, pure music, speech and music overlap, and speech and noise overlap. These five category models are developed from the training data. When the labeler is run, these models are passed on to the speaker ID testing phase to develop the labels. On the other hand, the re-segmentation task uses both aspects of the speaker ID system to first create speaker models, and then use those speaker models to produce segments. 6.2 Performance Summary At first glance, the results seem to show an adequate, but lacking, unsupervised segmentation system. The 20% error rate is much higher than is desired. But, this error rate is somewhat misleading. It measures the deviation of the hypothesized segmentation from the truth segmentation. As mentioned earlier, the hypothesized marks are created with the goal of homogeneity in mind. The truth, on the other hand, has no such standard. Therefore, higher error rates are inevitable, and very difficult to reduce, given the fact that there are so many acoustic variations (see Section 1.2.1). It also leads to the question: is matching the truth exactly what is desired? The truth marks state that a speaker, Bill Clinton for instance, has spoken at specific times. Unfortunately, it says nothing about the quality of audio signal contained within those times. On the other hand, the hypothesized speaker marks say that speaker A, matched to Bill Clinton in the scoring, has spoken at specific times. When what is marked as speaker A is compared to what is marked as Bill Clinton, it may be that speaker A is actually Bill Clinton with no background noises. It may also be that speaker B and speaker C in the hypothesized segments are actually Bill Clinton over music and 62 Bill Clinton over a roaring crowd, respectively, but, in scoring, these speakers are not recognized as Bill Clinton. Currently, there is no way of telling if this is the case, but given the design of the diarization process, it could be extremely likely. It is, therefore, critical to keep in mind the specific purpose and context of the diarization process. If the objective is very speaker-identity specific, then there is much room for improvement. On the other hand, if the desired result is homogeneous speakers, then the system in place may be very accurate. Unfortunately, there is no method of testing the accuracy of this aspect. 6.3 Future Experiments Future experiments will be broken up into four parts, one for each component of the system. In terms of the acoustic change detector, the means of penalizing model complexity may present opportunities to improve results. Currently, the prevalent BIC method is used. Other possible methods include, but are not limited to, the Akaike Information Criterion (AIC), Hurvich and Tsai's corrected AIC, and the minimum description length criterion (MDL). On the other hand, the use of statistics for change detection may not prove the best results. The segment labeler can be improved two ways. The first involves refining the models and the modeling process. The second involves using the log-likelihood scores for the models in a more meaningful way. In term of the modeling, one of the first steps would be creating more models. In the transcripts annotation description, it describes how "Fidelity" and "Level" tags are maintained to qualify the quality of the audio signal. For the five-model system, these tags are ignored. If they are not, and instead used to create more categories, such as high fidelity speech, low fidelity speech, et cetera, many 63 of the classification errors that occurred may be eliminated. This would mean that more speech was retained, more non-speech was dropped, and the confusion between speech and overlap regions could be eliminated. Along these terms, the other area of exploration would be developing a more elegant technique by which to use the log-likelihood scores produced by these models to determine the speaker. Currently, the model with the highest log-likelihood is declared the speaker. It might be possible to use various combinations of these scores and threshold them in order to determine speakers. For instance, the noise model may be subtracted from both the music and speech. Then, if either score is greater than a given threshold, it is labeled as that. If both scores are greater than the threshold it is labeled as overlap, if neither than noise. The clusterer could profit from the use of a different clustering algorithm. Currently a bottom-up algorithm is used. This means the process begins with many small segments, and, as the process proceeds, similar segments are combined. One option would be to change to a top-down algorithm. This starts with one large segment, and breaks it up into smaller segments. If this proves to be less effective, then there are also some changes to the bottom-up algorithm that can be attempted. For instance, different methods of determining distances may be tried. Another set of experiments involving the clusterer could involve improving the RASTA channel normalization that was used in one set of experiments. This could have advantages in improving the accuracy of speaker based evaluation (as opposed to homogeneity base evaluation). Since background noises are normalized, the emphasis can be shifted away from matching acoustic conditions to matching speaker voices. 64 Finally, experiments with the resegmenter should involve developing a way so that all of the data tested does not also show up in the training data. If the assumption that the regions of overlap are where the errors happen and the regions of pure speech are properly clustered, then one possible solution would be to train only on pure speech and test only on overlap. Unfortunately, this may prove to be difficult, given the fact that this would entail the ability of detecting speakers over music and noise, which has not been demonstrated. Another solution may be in using some of the previously mentioned techniques in the resegmenter. For instance, if all of the segments for a given speaker are combined to form one large segment, then a top-down clustering algorithm could be used to break it down. Then, certain parts of the remaining segments could be used to train models, and the remaining tested on. 65 APPENDIX A Transcript Description The following is a verbatim (minus some formatting) excerpt from a document provided by the LDC as an ancillary document to the corpus data. It explains the tags and formatting of the transcript files that are parsed in order to create the data files that are necessary for many of the steps in the experimental process. A.1 The SGML tags and their attributes Episode - <Episode> is a spanning tag, terminated by </Episode>. It spans all of the annotation and transcription information associated with a particular episode, and it may contain <Section>, <Background> and <Comment> tags within its span. The attributes associated with each Episode are: Filename: The name of the file containing the episode's audio signal. Scribe: The name of the transcriber who produced the annotation and the transcription. Program: The name of the program that produced the episode. (E.g., "NPRMarketplace") Date: The date and time of the episode broadcast, in "YYMMDD:HHMM" format. (E.g., "960815:1300".) Version: The version number of the annotation of this episode, starting with "1". Each time the annotation is revised, the version number is incremented by 1. VersionDate: The (last) date and time of annotation/transcription input to this annotation. Section - <Section> is a spanning tag, terminated by </Section>. It spans all of the annotation and transcription information associated with a particular section of an episode, and it may contain <Segment>, <Background> and <Comment> tags within its span. It must be contained within the span of an <Episode>. The attributes associated with each Section are: S_time: The start time of the Section, measured from the beginning of the Episode in seconds. E_time: The end time of the Section, measured from the beginning of the Episode in seconds. Type: One of the labels "Story", "Filler", "Commercial", "WeatherReport", "TrafficReport", "SportsReport", or "LocalNews". For the current Hub-4 effort, Commercials and SportsReports will not be transcribed and will therefore contain no Segments. Sections of all other Types will be transcribed and will be included in the evaluation. Topic: An identification of the event or topic discussed in the Section. For example, "TWA flight 800 disaster". Topic is optional and is not currently being supplied by LDC. (Future use and value of Topic will require additional guidance on how to define it.) 4 The data corpus is available at the LDC website previously listed. 66 Segment - <Segment> is a spanning tag, terminated by </Segment>. It spans all of the annotation and transcription associated with a particular Segment, and it may contain <Sync>, <Background> and <Comment> tags within its span, as well as the transcription text. The <Segment> tag must be contained within the span of a <Section>. (Segment information is allowable only for the PE.) The attributes associated with each Segment are: S_time: The start time of the Segment, measured from the beginning of the Episode in seconds. E_time: The end time of the Segment, measured from the beginning of the Episode in seconds. Speaker: The speaker's name. Mode: One of the labels "Spontaneous" or "Planned". Fidelity: One of the labels "High", "Medium" or "Low. Sync - <Sync> is a non-spanning tag that provides transcription timing information within a Segment. It is positioned within the transcription and gives the time at that point. The <Sync> tag must be contained within the span of a <Segment>. Sync is a side-effect of the transcription process and is being provided for potential convenience. Sync has a single attribute, namely Time: Time: The time at this point in the transcript, measured from the beginning of the Episode in seconds. Comment - <Comment> is a spanning tag, terminated by </Comment>. It spans a free-form text comment by the transcriber, but no other SGML tags. The <Comment> tag must be contained within the span of an <Episode>. Background - <Background> is a non-spanning tag that provides information about a particular (single) background signal, specifically regarding the onset and offset of the signal. This information is synchronized with the transcript by positioning the Background tag at the appropriate point in the transcription. (<Background> tag locations and times will be positioned at word boundaries so that the word within which the background noise starts or ends will be included in the span of the background noise.) The <Background> tag must be contained within the span of an <Episode>. The attributes associated with each Background tag are: Time: The time at this point in the transcript, measured from the beginning of the Episode in seconds. Type: One of the labels "Speech", "Music" or "Other". Level: One of the labels "High", "Low" or "Off'. This attribute indicates the level of the background signal after Time. Thus High or Low implies that the signal starts at Time, while Off implies that the signal ends at Time. Foreign (non-English) speech will be labeled as background speech and not transcribed, even if it appears to be in the foreground. The exception to this is that occurrences of borrowed foreign words or phrases, when used within English speech, are transcribed. Overlap - <Overlap> is a spanning tag, terminated by </Overlap>. It is used to indicate the presence of simultaneous speech from another foreground speaker.5 This information is synchronized with the transcript by positioning the Overlap tag 67 at the appropriate point in the transcription. (<Overlap> tag locations and times will be positioned at word boundaries so that the word within which the overlap starts or ends will be included in the span of the overlap.) The <Overlap> tag must be contained within the span of a <Segment>. The attributes associated with each Overlap tag are: S_time: The start time of the Overlap, measured from the beginning of the Episode in seconds. E_time: The end time of the Overlap, measured from the beginning of the Episode in seconds. For example: Speaker A: ... It was a tough game <Overlap S-time=101.222 Etime=102.111> # but very exciting # </Overlap> Speaker B: <Overlap S-time=101.230 Etime=102.309> # Yes it was # </Overlap> In this example, Speaker B broke into Speaker A's turn. Note that the Overlap times don't coincide exactly because they have been time-aligned to the most inclusive word boundaries for each speaker Segment involved in the overlap. Expand - <Expand> is a spanning tag, terminated by </Expand>. It is used to indicate the expansion of a transcribed representation, such as a contraction, to a full representation of the intended words that underlie the transcription. The <Expand> tag spans the word(s) to be expanded and must be contained within the span of a <Segment>. Expand has a single attribute, namely Ejform: E_form: The expanded form of that portion of the transcription spanned by the Expand tag. To illustrate, here is a simple example: I <Expand Eform="do not"> don't </Expand> think <Expand Eform="he is"> he's </Expand> lying. Note that the transcribed words remain unchanged, while the attribute Eform indicates the correct expansion of the spanned words. Note also that Eform resolves potential ambiguity (such as whether "he's" should be expanded to "he is" or "he has"). Noscore - <Noscore> is a spanning tag, terminated by </Noscore>. It is used to explicitly exclude a portion of a transcription from scoring. The <Noscore> tag spans the word(s) to be excluded and must be contained within the span of a <Segment>. Noscore has 3 attributes: Reason: Short free-form text string containing an explanation of why the tagged text has been excluded from scoring. The string must be bounded by double quotes. Stime: The start time of the excluded portion, measured from the beginning of the Episode in seconds. Etime: The end time of the excluded portion, measured from the beginning of the Episode in seconds. For example: <Noscore Reason="Mismatch between evaluation index and final transcript" S_time=1710.93 Etime=1711.71> ... text to be excluded ... </Noscore> 68 A.2 The Transcription Character set and line formatting The transcription text will consist of mixed-case ASCII characters. Only alphabetic characters and punctuation marks will be used, along with the bracketing characters listed below. Line breaks may be inserted within the text, to keep lines less than 80 characters wide and to separate the transcription text from SGML tags. (Transcription text will not be entered on the same line with any SGML tag.) Numbers, Acronyms and Abbreviations Numbers are transcribed as words (e.g. "ninety six" rather than "96"). Acronyms are transcribed as upper-case words when said as words (e.g., "AIDS"). When said as a sequence of letters, acronyms are transcribed as a sequence of space-separated uppercase letters with periods (e.g., "C. N. N."). Except for "Mr." and "Mrs.", abbreviations are not used. However, words that are spoken as abbreviated (e.g., "corp." rather than "corporation") are spelled that way. Special Bracketing Conventions Single word tokens may be bracketed to indicate particular conditions as follows: ** indicates a neologism - the speaker has coined a term. E.g., **Mediscare**. +...+ indicates a mispronunciation. The intended word is transcribed, regardless of its pronunciation. E.g., +ask+ rather than +aks+. (Variant pronunciations that are intended by the speaker, such as "probly" for the word probably, are not bracketed.) [...] indicates (a one-word description of) a momentary intrusive acoustic event not made by the speaker. E.g., [gunshot]. {...} indicates (a one-word description of) a non-speech sound made by the speaker. E.g., {breath}. Sequences of one or more word tokens may be bracketed to indicate particular conditions as follows: ((...)) indicates unclear speech, where what was said isn't clear. The parentheses may be empty or may include a best guess as to what was said. # ... # indicates simultaneous speech. This occurs during interactions when the speech of two people who are being transcribed overlap. The words in both segments that are affected are bounded by # marks. Other notations @ indicates unsure spelling of a proper name. The transcriber makes a best guess and prefixes the name with the @ sign. E.g., Peter @Sprogus. - indicates a word fragment. The transcriber truncates the word at the appropriate place an appends the - sign. E.g., bac-. Punctuation With the exception of periods ("."), normal punctuation is permitted, but not required. Periods are used only after spelled out letters (N. 1. S. T.) and in the accepted CSR 69 abbreviations (Mr., Mrs., Ms.). They may not be used to end sentences. Instead, sentences may be delimited with semicolons. Non-English speech Speech in a foreign language will not be transcribed. This speech will be indicated using the "((...))" notation. However, for foreign words and phrases that are generally understood and in common usage (such as "adios"), these words will be transcribed with customary English spelling and will be treated as English. A.3 The Annotation format With the exception of <Comment>, the beginning mark of each spanning tag will be presented alone and complete on one line. The corresponding ending mark will also appear alone on a subsequent line. The <Comment> units are often brief, but they are free to extend to multiple lines. Within the beginning marks of spanning tags, all attribute value assignments will be bounded by spaces (except the last, where a space isn't needed before the closing ">"). Attributes containing spaces or other non-alphanumeric characters must be enclosed in quotes. Here is an example of annotation: <Episode Filename=f960531.sph Scribe=StephanieKudrac Program=CNNHeadlineNews Date="960531:1300" Version=1 VersionDate="960731:1730"> <Section Stime=0.28 Etime=105.32 Type=Commercial> </Section> <Background Time=1 11.27 Type=Music Level=High> <Section Stime= 116.55 Etime=124.92 Type=Filler Topic="lead-in"> <Segment Sjtime= 117.61 Etime=121.06 Speaker=Announcer_01 Mode=Planned Fidelity=High> Live from Atlanta with Judy Forton </Segment> <Segment Sjtime=121.95 Etime=124.92 Speaker=JudyForton Mode=Spontaneous Fidelity=High> Lynn Vaughn is off today; Thanks for joining us; </Segment> </Section> <Section Stime=124.92 Etime=299.79 Type=Story Topic="U.S. - Israeli politics"> <Segment Sjtime=124.92 Etime=139.20 Speaker-JudyForton Mode=Planned Fidelity=High> President Clinton has congratulated Israel's next <Sync Time=127.74> leader <Background Time=128.30 Type=Music Level=Off> and has invited him to the White House to talk about Middle East <Sync Time=131.03> peace strategies {breath I President Clinton called Benjamin Netenyahu just minutes <Sync Time=135.04> after he was declared the winner over Prime Minister Shimon Peres {breath} Fred Saddler reports 70 </Segment> <Background Time= 139.65 Type=Other Level=Low> <Comment> background noise and people </Comment> <Segment Sjtime=141.32 Etime=154.88 Speaker=FredSaddler Mode=Planned Fidelity=Medium> Never doubting that he would win, Benjamin Netenyahu came out on top </Segment> </Section> </Episode> A.4 Show ID Letters The mapping of show-id letters to show titles is as follows: a = ABC Nightline b = ABC World Nightly News c = ABC World News Tonight d = CNN Early Edition e = CNN Early Prime News f = CNN Headline News g = CNN Prime Time News h = CNN The World Today i = CSPAN Washington Journal j = NPR All Things Considered k = NPR Marketplace" 71 APPENDIX B Segmentation Example The following appendix will trace out the steps involved in converting a transcript data file into a usable format for the other system in the experiments. See Section 3.3 for a step by step explanation of the process. B.1 Transcript File This is the transcription file that is provided by the LDC as part of the corpus data set. The transcription below is and example created to illustrate the segmentation process and therefore will neither contain the details involved in a real transcript nor the actual transcription of the words said. As mentioned earlier all that is important are the tags. The transcript has been formatted to increase readability. <<test.txt>> <Episode Filename=test.sph Program="AppendixB" Scribe="rishi-roy" Date="030521:0330" Version=1 VersionDate=030521> <Comment> This is a test case derived for Appendix B </Comment> <Section Stime=0 Etime=6 Type=Filler> <Segment Stime=O Etime=3 Speaker=Speakerl Mode=Planned Fidelity=High> <Background Time=0 Type=Music Level=High> <Background Time=1 Type=Other Level=High> <Background Time=2 Type=Speech Level=Low> <Background Time=4 Type=Speech Level=Off> </Segment> <Background Time=4 Type=Music Level=High> <Background Time=5 Type=Music Level=Off> </Section> <Section Stime=6 Etime=9 Type=Commercial> <Segment Stime=6 Etime=8 Speaker-Speaker2 Mode=Planned Fidelity=High> </Segment> </Section> <Section Stime=9 Etime=25 Type=Filler> <Background Time=9 Type=Other Level=High> <Background Time=10 Type=Music Level=High> <Segment Stime=12 Etime=17 Speaker=Speaker3 Mode=Planned Fidelity=High> <Background Time=14 Type=Music Level=Off> <Background Time=15 Type=Other Level=High> <Background Time=16 Type=Other Level=Off> </Segment> 72 <Segment Stime=18 Etime=19 Speaker=Speakerl Mode=Planned Fidelity=High> <Background Time=18 Type=Speech Level=Low> </Segment> <Background Time=19 Type=Other Level=High> <Background Time=20 Type=Other Level=Off> <Segment Stime=20 Etime=22 Speaker=Speaker2 Mode=Planned Fidelity=High> </Segment> <Background Time=23 Type=Other Level=High> <Background Time=24 Type=Other Level=Off> <Segment Stime=24 Etime=25 Speaker=Speakerl Mode=Planned Fidelity=High> </Segment> </Section> B.2 Initial Segmentation Files The following are the initial segmentation files generated by the processing script for the transcript. The first five are all obtained by processing the SGML tags given in the transcript. The final segmentation file, testsilence.seg, is obtained by processing the output of cross talk, as described in Section 3.3. <<test-speech.seg>> 0 3 Speech 6 8 Speech 12 17 Speech 18 19 Speech 20 22 Speech 24 25 Speech <<testmusic.seg>> 0 1 Music 4 5 Music 10 14 Music <<testback-other.seg>> 1 2 Other 9 10 Other 15 16 Other 19 20 Other 23 24 Other <<testback-speech.seg>> 2 4 BS 14 15 BS 18 19 BS 73 <<testcommercial.seg>> 6 9 Commercial <<testsilence.seg>> 3 4 Silence 5 6 Silence 9 10 Silence 22 23 Silence B.3 Result Segmentation These are the output files from the overlap-removal and overlap-obtaining processes as described in Section 3.3 and seen in Figure XX. <<testspeechonly.seg>> 16 17 Speech 20 22 Speech 24 25 Speech <<test_musiconly.seg>> 4 5 Music 10 12 Music <<test_noise_only.seg>> 19 20 Other <<test-speech musicoverlap.seg>> 0 1 speech+music 12 14 speech+music <<test-speech noiseoverlap.seg>> 1 2 speech+overlap 24BS 14 15 BS 15 16 speech+overlap 18 19 BS These files are then converted to mark format and saved according to the given convention. These marks can now be associated with their respective sphere files. 74 Time MM ig \'ii\9combine d 0 S+M 1 kW%%%%%%%%\1%MS+O filtere d S+M S+O nooverla p 2 3 S 4 M M M M M M S+M S+M M M 5 10 16 11 17 12 18S 13 kWAA1ik%% % S M S+M S+M 19 14 0S kWAAAAAAM 16 kMWAAA\ 15 S+O S S+O S S 0 S 0 17 18 kWAAAAAAM 19 20 kWWAAAANS 21 kWAAAA M S 0WWW\MN O S S 0 0 S S S 2 23 24 kW%'%%%% kWWWWAA S 0 S Figure 4: Segmentation Process. This is a graphical representation of what is occurring in the segmentation process. Although some of these columns do not appear as represented, and the steps involved is a bit more complicated than this, it is helpful to think of the segmentation process in this way. The first group of three columns contains the three main categorical segmentation files - speech, music, and noise. When these are appended together and sorted, column four - "combined" is the result. The second set of three columns contains the set of filtered files. When column four is filtered by these three, column five - "filtered" is the result. The last set of the group contains the final segmentation files for the five categories of interest. These are obtained by extracting the appropriate labels out of the filtered column. See earlier sections of Appendix B to get more detailed, complete, and accurate description of the process. References [1] A. Martin, M. Przybocki, G. Doddington, and D. Reynolds, "The NIST 1998 Development Evaluation of Speaker Recognition on Multi-Speaker Telephone Channels", http://www.nist.gov/speech/publications, 1999. [2] J. Kim, Named Entity Recognitionfrom Speech and Its Use in the Generationof Enhanced Speech Recognition Output. August 2001. PhD Thesis, Cambridge University, [3] NIST Special Publication 500-242: The Seventh Text REtrieval Conference (TREC 7), http://trec.nist.gov/pubs/trec7/t7 proceedings.html. [4] J. Ajmera, H. Bourlard, and I. Lapidot, "Improved Unknown-Multiple Speaker Clustering Using HMM," IDIAP Research Report, pp. 1-6, September 2002. [5] S. Chen and P.S. Gopalakrishnan, "Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion," Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, February 1998. [6] L. Wilcox, F. Chen, D. Kimber, and V. Balasubramanian, "Segmentation of Speech Using Speaker Identification," ICASSP, pp. 161-164, 1994. [7] D. Reynolds. A Gaussian Mixture Modeling Approach to Text-Independent peaker Identification. PhD Thesis, Georgia Institute of Technology, September 1992. [8] D. Reynolds and R. Rose, "Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models," IEEE Trans. on Speech and Audio Processing,Vol. 3, No. 1, pp. 72-83, January 1995. 76