Speech Metadata in Broadcast News

advertisement
Speech Metadata in Broadcast News
by
Rishi R. Roy
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degrees of
Bachelor of Science in Electrical Engineering and Computer Science
and Master of Engineering in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
May 21, 2003
Copyright 2003 Rishi R. Roy. All rights reserved.
The author hereby grants to M.I.T. permission to reproduce and
MASSACHUSETTS INSTITUTE
OF TECHNOLOGY
distribute publicly paper and electronic copies of this thesis
JUL 3 0 2003
andffapothers the right to do so.
LIBRARIES
Author
-
epartment of Electrical Engineering and Computer Science
May 21, 2003
Certified by_
Douglas A. Reynolds
Thesis Co-supervisor
Certified by_
Marc A. Zissman
,heRiCo-supervisor
Accepted by
'-Arthur C. Smith
Chairman, Department Committee on Graduate Theses
BARKER
Speech Metadata in Broadcast News
by
Rishi R. Roy
Submitted to the
Engineering and Computer Science
of
Electrical
Department
May 21, 2003
In Partial Fulfillment of the Requirements for the Degree of
Bachelor of Science in Electrical Engineering and Computer Science
and Master of Engineering in Electrical Engineering and Computer Science
Abstract
With the dramatic increase in data volume, the automatic processing of this data becomes
increasingly important. To process audio data, such as television and radio news
broadcasts, speech recognizers have been used to obtain word transcriptions. Of late,
new technologies have been developed to obtain speech metadata information, such as
speaker segmentation, emotions, punctuation, et cetera.
This thesis presents the Massachusetts Institute of Technology Lincoln
Laboratory's (MITLL) unsupervised speaker segmentation system. The goal of the
system is to produce a list of segments and speaker labels given an arbitrary broadcast
news audio file. Each set of segments attributed to a speaker label is similar in both
foreground speaker and background noise conditions.
The system is made up of four components: the acoustic change detector, the
segment labeler, the segment clusterer, and the resegmenter. Of these four, the segment
labeler and the resegmenter are based on the MITLL Gaussian mixture model speaker
identification system.
Using the 1996 Hub4 data from the Linguistic Data Consortium for training, the
unsupervised segmentation system is used to segment six, 10-minute broadcast news
audio files. Various aspects of the component systems are tested and trends are reported.
The final result of a 20% speaker segmentation error rate is obtained, which appears to be
promising for the above applications. Finally, an analysis of system errors and proposals
for improvements are presented.
Thesis Co-supervisor: Douglas A. Reynolds
Title: Senior Member of Technical Staff, Information Systems Technology Group, MIT
Lincoln Laboratory
Thesis Co-supervisor: Marc A. Zissman
Title: Associate Group Leader, Information Systems Technology Group, MIT Lincoln
Laboratory
2
Acknowledgments
The work for this thesis was conducted at the Massachusetts Institute of Technology
Lincoln Laboratory. I would like to extend my gratitude to the members Speech Group,
for their help and support over the past year.
I would like to thank my thesis co-advisors, Dr. Douglas A. Reynolds and Dr.
Marc A. Zissman, for giving me the opportunity to work on this project.
I would
especially like to thank Dr. Reynolds for his guidance and expertise. He has not only
given me the freedom to explore my ideas, but also the feedback to keep me on track. He
has also provided a great deal of help in the editing process.
One of the best aspects of my time at MIT has been my friends.
Most
importantly, I would like to thank Pamela Bandyopadhyay for helping me survive. Mili,
LYLM. I would also like to thank Gary Escudero for helping me to relax and Saumil
Gandhi for helping me to study (Nitin and Kong, anyone?).
Finally, I want to thank my family for its love and support. To my sister, thank
you for being a friend. To my parents, I owe you everything. This is a result of the
opportunities you have always given me.
Thank you for stressing the importance of
education and motivating me to achieve my goals.
Jai Hanumanji Ki Jai!
3
Table of Contents
A b stract ............................................................................................................................... 2
Acknowledgments ............................................................................................................... 3
List of Figures ..................................................................................................................... 7
List of Tables ...................................................................................................................... 8
1 Introduction ...................................................................................................................... 9
1.1 Background ............................................................................................................... 9
1.2 Applications ............................................................................................................ 10
1.2.1 Autom atic Speech Recognition ........................................................................ 10
1.2.2 Readability ....................................................................................................... 12
1.2.3 Speaker Searching ............................................................................................ 13
1.3 Overview ................................................................................................................. 13
1.3.1 System Overview ............................................................................................. 13
1.3.2 Thesis Overview .............................................................................................. 15
2 System Description ........................................................................................................ 16
2.1 Acoustic Change Detector ...................................................................................... 16
2. 1.1 Single Point Change Detection ........................................................................ 17
2.1.2 M ultiple Point Change Detection ..................................................................... 19
2.2 Segm ent Labeler ..................................................................................................... 19
2.3 Clusterer .................................................................................................................. 21
2.4 Resegmenter ............................................................................................................ 23
2.5 System Flow ........................................................................................................... 24
4
3 The Speaker ID System ..............................................................................................
25
3.1 General System Overview ......................................................................................
25
3.2 Likelihood Ratio D etector ...................................................................................
26
3.3 The MITLL GM M -U MB Speaker D System ........................................................
27
3.3.1 Front-end Processing ...................................................................................
27
3.3.2 G aussian M ixture M odels .............................................................................
28
3.3.3 The Post-processing......................................................................................
29
3.4 The Program Flow ..................................................................................................
31
4 Creating Category M odels..........................................................................................
32
4.1 The D ata..................................................................................................................
32
4.1.1 Term inology..................................................................................................
32
4.1.2 1996 H ub4 D ata Corpus................................................................................
34
4.1.3 Generating Segm ents from Raw D ata..............................................................
35
4.2 Models ....................................................................................................................
41
4.2.1 Three-M odel System ....................................................................................
42
4.2.2 Three-M odel Filtered System ......................................................................
45
4.2.3 Five-M odel System ......................................................................................
47
5 Experim ents, Results, and D iscussion ........................................................................
51
5.1 D ata.........................................................................................................................
51
5.2
51
sMetrics
...........
......................................................................................
5.3 Acoustic Change Detector.....................................................................................
52
5.4 Segment Labeler .........................................................................
54
5.4.1 Model Performance .................................................................
55
5
5.4.2 Filter Level ....................................................................................................... 55
5.5 Clusterer .................................................................................................................. 57
5.6 Resegm enter ............................................................................................................ 59
6 Conclusion ..................................................................................................................... 61
6.1 System Overview .................................................................................................... 61
6.2 Perform ance Summ ary ........................................................................................... 62
6.3 Future Experim ents ................................................................................................. 63
A Transcript Description ................................................................................................... 66
A. I The SGML tags and their attributes ....................................................................... 66
A.2 The Transcription ................................................................................................... 69
A.3 The Annotation form at ........................................................................................... 70
A.4 Show ID Letters ..................................................................................................... 71
B Segmentation Exam ple .................................................................................................. 72
B.1 Transcript File ........................................................................................................ 72
B.2 Initial Segm entation Files ...................................................................................... 73
B.3 Result Segmentation ............................................................................................... 74
References ......................................................................................................................... 76
6
List of Figures
Figure 1: System Flowchart ..........................................................................................
14
Figure 2: Exam ple of Speaker Clustering......................................................................
22
Figure 3: Category Breakdown......................................................................................
40
Figure 4: Segmentation Process....................................................................................
75
7
List of Tables
Table 1: O verlap C lassification. ....................................................................................
34
Table 2: D ata Breakdow n ............................................................................................
41
Table 3: Three Model Show Classification. ..................................................................
43
Table 4: Three Model Labeling ...................................................................................
43
Table 5: Three Model Segmenting. ..............................................................................
44
Table 6: Filtered Three Model Labeling Scored Against Filtered Segments.. .............
46
Table 7: Filtered Three Model Segmenting Scored Against Filtered Segments.. ......... 46
Table 8: Filtered Three Model Labeling Scored Against Unfiltered Segments. .......... 46
Table 9: Filtered Three Model Segmenting Scored Against Unfiltered Segments. ......... 46
Table 10: Five Model Labeling......................................................................................
50
Table 11: Five Model Segmenting................................................................................
50
Table 12: Change Detection Maximum Duration.........................................................
53
Table 13: Change Detection BIC Weighting.................................................................
53
Table 14: Five Model - Speech vs. Non-speech.. ..........................................................
55
Table 15: Segment Labeler Filter. ................................................................................
56
T able 16: C lusterer Param eters 1....................................................................................
58
Table 17: C lusterer Param eters 2..................................................................................
58
Table 18: Passes Through the Resegmenter. ................................................................
59
8
CHAPTER 1
Introduction
1.1 Background
Language is the basic means of communications for civilizations worldwide, and speech
is its most natural form of expression.
It is often necessary to transform speech to
another medium in order to enhance our understanding of language. Examples of such
translation include turning speech into a varying voltage signal transmitted through
telephone lines, or turning speech into a sequence of ones and zeros in digital recordings.
The goal of speech-to-text transcription is to transform acoustic speech signals
into a readable format. The technology to do this has been around for many years, and
has historically emphasized the transcription of audio signals into words. Although this
process is still a very active aspect of research, new emphasis has been placed on the
process of metadata extraction.
Unlike transcription, metadata extraction involves
deriving non-lexical information about the speech, which includes (but is not limited to)
determining punctuation, emotion, proper nouns, speaker identification, and diarization.
This thesis addresses the latter two types of information.
Namely, it seeks to
address how one goes about creating a system that uses speaker ID techniques to conduct
diarization. The goal of speaker identification is to determine who, among a given set of
candidate speakers, has spoken a segment of audio.
Diarization, also called
segmentation, involves segmenting an audio file and associating each of these segments
with a speaker.
9
The goal of this thesis is to present the unsupervised segmentation system
developed at the Massachusetts Institutes of Technology's Lincoln Laboratory (MITLL).
The unsupervised segmentation system takes, as input, a broadcast news audio file and,
through a number of processes, outputs a segmentation file, where a segmentation file is a
list of segments and speakers. Since there is no a priori information about the speakers
or content of the audio file, the task is unsupervised.
1.2 Applications
There are many possible applications for this type of technology. Three of the more
common ones are described below.
The first application is in automatic speech
recognition. This is one of the initial areas for which diarization was intended, and will,
therefore, be described extensively. The second is in improving speech recognizer output
readability. The third is in enhancing audio searching via speaker tags.
1.2.1 Automatic Speech Recognition
Automatic speech recognition refers to the process of transcribing audio into text. There
are many advanced systems that carry out this task, but they all have one thing in
common - they make mistakes. These errors, measured by a metric called word error
rate (WER), occur when the speech transcription system incorrectly transcribes a word.
Mistakes can occur for a variety of different reasons, one of the most important being
variations in audio signals.
There are three main types of variations that occur: inter-personal, intra-personal,
and "noise."
Inter-personal variations are due to the uniqueness of a person's speech.
People make use of the fact that a person's voice contains qualities that indicate his or her
10
identity. These differences in speech are caused by the physical speech apparatus and by
pronunciation differences and can make it very difficult for an automated system to
transcribe what is being said.
speaker's speech.
Intra-personal variations are deviations within one
Speech is largely a product of what a person does as opposed to
fingerprints, DNA, or retinal patterns which are what a person is [1]. This means that
speech qualities do not remain consistent within the same speaker. These alterations can
be the result of many different causes, such as illness. "Noise" variations are caused by
any interfering or ambient acoustic signal that is not part of the speech of interest. Noise
sources come in the forms of unintelligible background speech, foreign speech, music,
white noise, and many other random sounds present in everyday life. Unsupervised
segmentation addresses each of these three types of variation.
The problems that arise from the transcription of regions of pure noise variation
are twofold.
First, since no one is talking during these segments, this attempted
transcription produces words where they should not exist. This creates word errors by
default. Second, since many of the transcription systems used are causal, where what
happens in the past affects the future, these regions can adversely affect later speech
segments.
Inter-personal and intra-personal variations cause problems because the process
used for modeling the speakers is not robust enough. In the transcription process, it
would be ideal to have one model that can be applied to an entire audio file, and yield a
perfectly accurate transcript result.
Currently, however, this does not occur: the
transcription process cannot accurately accommodate different speakers and internal
11
variance. In order to improve the WER, it becomes necessary to improve the modeling
process so that the models used can accommodate these variations.
Unsupervised segmentation addressed both problems.
As outlined in Section
1.3.1, the segmentation process includes four distinct steps.
One step is a segment
labeler. Its goal is to detect and then remove non-speech regions or pure noise variations
from the audio file. By only passing on regions of speech to the rest of the process, only
audio containing speech is diarized.
Then using this diarization, automatic speech
recognizers can only operate on speech regions.
For dealing with personal variations, one of the most common solutions is to
adapt a speaker-independent model for each speaker. The ideal approach is to adapt a
model for each homogeneous set of segments. Homogeneity refers to audio that sounds
like the same speaker in the same acoustic environment. This delineation of segments is
exactly what the unsupervised segmentation system does. The first step, acoustic change
detection, marks continuous, homogeneous segments of audio.
Then, the clustering
process, the fourth step, groups non-contiguous homogenous segments together.
1.2.2 Readability
One of the problems with the output of automatic speech recognizers is that it consists
solely of a stream of words. The text is, for all practical purposes, unreadable. One of
the potential uses of diarization is in helping to fix this problem. By parsing through text
with the output segmentation file, all words transcribed from a given segment can be
grouped together and labeled with the segment speaker. This process helps transform a
stream of words into a pseudo-script. Production of capitalization and punctuation in the
text is addressed by other metadata processes [2].
12
1.2.3 Speaker Searching
Diarization can also be used for speaker searching.
Once the audio file has been
processed by the unsupervised segmentation system, each of the generated output speaker
labels can be matched to a known speaker from a speaker database using speaker ID
methods. Once this is accomplished, instead of the segments being attributed to generic
speaker labels, they can be attributed to real speakers. These segmentation files can then
be translated to a database.
Then it becomes possible to search for clips of audio
associated with a given speaker. This application can be extended even further. If not
only the segment times but also the actual transcript of what was send in those segments
were associated with each speaker, it then becomes possible to search for specific topics
of conversation. Now, not only can one search for clips of Bill Clinton, he/she can also
obtain clips of Bill Clinton talking about tax cuts.
These types of experiments are
detailed in [3].
1.3 Overview
A general overview of the unsupervised segmentation system will be given first.
Following this is an overview of the thesis.
1.3.1 System Overview
The unsupervised segmentation system is comprised of four main parts, as shown in
Figure 1.
The first part of the system is the acoustic change detector. Its role is to
provide the initial segments for the rest of the system. Since these are the bases of the
segments that are to be clustered, it would be ideal to have these segments be
13
homogeneous. The process for this, as well as the other three components, is described in
Chapter 2.
The second piece of the system is the segment labeler. It performs two distinct
tasks. The first is to extract regions of silence from the initial segments. The second is to
uses a speaker ID system (described in Chapter 3) to label the segments as being either
speech, music, noise, speech and music overlap, or speech and noise overlap. Then, the
non-speech (music and noise) segments are discarded and the filtered segments are
passed onto the clusterer. The third component, the clusterer, groups similar segments
together, associating each with a single speaker. The final component is the resegmenter.
Using the speakers and associated segments generated from the clusterer as a reference,
the resegmenter uses a speaker ID system to create a new segmentation file. The idea is
that, by training models for each speaker and then generating segments for each of them,
segment boundaries can be refined over the initial segmentation.
Category
Models
AuA
ie
AudDo File
changec
n
I
Initial
Segments
Segment
Labeler
Speech
Segments
1
Clusterer
Detector
Music
Noise
Segments
Segments
Silence
Speaker-Labeled
Segments
Segments
Resegmenter
Figure 1: System Flowchart. This is a diagram of the unsupervised segmentation system.
indicated, there are four main components to the process. Each will be described in detail.
14
As
1.3.2 Thesis Overview
Chapter 2 provides a detailed description of each of the components of the unsupervised
segmentation system. First described is the acoustic change detector, then the segment
labeler, the clusterer, and finally the resegmenter.
Chapter 3 describes the speaker ID system that is used both in the segment labeler
and the resegmenter. It describes the basic steps and principles on which the system is
based.
Chapter 4 describes the evolution of the five category models of speech, music,
noise, speech and music overlap, and speech and noise overlap used in the labeler. It
begins with an explanation and results from the basic three-model system, then continues
to the filtered-three-model system, and concludes with the five-model system.
Chapter 5 presents the diarization experiments that were run, their associated
results, and a discussion of those results. Results and analysis are presented for each
system component.
Finally, Chapter 6 presents a discussion of the general conclusions that can be
drawn from the experiments. It also presents some future research that can be conducted
to improve the performance of the unsupervised segmentation system.
15
CHAPTER 2
System Description
The following chapter provides an in-depth description of the unsupervised segmentation
system. The discussion of the system is broken into four sections - one for each of the
four main components outlined in Section 1.3.1.
The first component is the acoustic
change detector, which provides the initial segments. The second is the segment labeler,
which removes silence and non-speech segments.
The third is the clusterer, which
groups like segments together. The final component described is the resegmenter, which
refines the results.
2.1 Acoustic Change Detector
The goal of the acoustic change detector is to segment an audio file into a sequence of
homogeneous regions. These segments provide the initial inputs to the rest of the system.
Ideally, each segment should have the same conditions within it: including speaker,
background noise, and sound levels. There are a number of ways to do change detection.
In this work a statistical change point detection algorithm based on the Bayesian
Information Criterion (BIC) [4].
This process of acoustic change detection will be discussed in two parts. The first
part is a base case, in which an audio file is processed to determine whether there is a
single acoustic change or not. If an acoustic change occurs, the base case will also
determine its location. The process will be discussed first in the general terms and then
in the BIC context. The second part expands this to detect an unspecified number of
change points.
16
2.1.1 Single Point Change Detection
For this section, let the audio file be of length N frames, where each frame, fi has an
associated feature vector, xi.
feature vectors, X
ie [1,2,...,N].
={x
The audio file can be represented by a sequence of these
1 ,x,...,x
,where each vector, x , is taken at a discrete frame
The base case scenario is viewed in the context of a likelihood-ratio
detector. The two hypotheses that are being tested, HO and H 1, are as follows:
Ho: The sequence of feature vectors, X = {xiE R', i = 1,...,N}, is represented by a
single model, M .
HI: The sequence of feature vectors, X 1 = {xje Rd, i = 1,...,t} and X 2 = {xiE Rd, I
= t+1,....,N} are represented by two models, M x and MX2 , respectively .
Combining these two produces the maximum likelihood ratio statistic, R(t), [5]:
t
N
R(t) =
log p(x, lMx)i=1
N
log p(x,|Mx, ) -
log p(x, |Mx2)
(1)
i=1i=t+1
After computing this value for all t E (1, N), if R(t)
0 for all t, then no acoustic change
is present. If, on the other hand, max, R(t) > 0, then a change is marked at the t where
this maximum occurs.
Unfortunately, this process does not work as desired. In almost every case, two
models will represent the data better than one model, since there is twice the number of
parameters, in the first case, available for modeling.
To compensate for this
phenomenon, a penalty for complexity is introduced. There are a number of ways to do
this. The method for acoustic change detection chosen for this work is BIC.
17
BIC introduces a penalty that is correlated to a model's complexity. This penalty,
1
P(t), equals -#(M)-logC,
2
where A is the weight of the penalty (generally set to 1),
#(M) is the number of parameters in the model, and C is the number of frames used to
create the model. The penalties for each model are added to equation 1. They are then
combined to create:
BIC(MX
)M,
+ Mx,)=R(t) - I A[#(M
2
) -#(M,) -#(M
)] -log(N)
(2)
For this work, Gaussians are used for modeling. Looking at the complete audio
, and full covariance matrix, Ix , are extracted.
file represented by X, the mean vector,,
These parameters define a Gaussian model M(ux ,Ex). Extending this to the sequences
Mx, = M(,u.,Ex,), and Mx, = M(p.,,1x,). Using
X1 and X2 gives: Mx = M(pl xY),
these models, Equation 2 becomes [4]:
BIC(MX
I',MX,
+ M X2)
R(t) - P(t)
where:
R(t) = N-log Ex I- tolog Z,
j-(N
- t)-log
(3)
1
1
P(t)=-A d+-d(d+1) logN
2 1
2
and d is the dimension of the feature space.
If Equation 3 is positive at frame t, than the model of two Gaussians is favored,
with the change between the two models occurring at t. If there are no positive-valued
frames, then there is no change.
18
2.1.2 Multiple Point Change Detection
The next step is to extend the detection of one change to the detection of many changes.
The following algorithm is presented by Chen and Gopalakrishnan [4].
(1)
(2)
(3)
(4)
initialize the interval [a, b]: a=1; b=2.
detect if there is one changing point in [a, b] via BIC.
if (no change in [a, b])
let b = b+1;
else
let i be the changing point detected;
set a=i+1; b=a+1;
end
go to (2).
Using this algorithm, the change points are marked. Segments are then created from
these marks. These segments are the basis of the rest of the unsupervised segmentation
task. Since each segment is homogeneous, if initial labeling is done (as described in the
next section), regions of pure noise and music can be identified and eliminated. Then the
remaining, homogeneous speech segments can be grouped together and attributed to
speakers.
2.2 Segment Labeler
The segment labeler is used to detect and then remove silence and non-speech audio
segments. This is motivated by the way that clustering works. Simply stated, clustering
groups similar segments together. Ideally, this would mean that segments of speech that
sound like they have the same speaker are combined. In practice however, noise, music,
and silence (the three types of non-speech signal that are present) can become the point of
similarity between segments. This means that, rather than segment A and segment B
being paired together because they have the same sounding speaker, they are paired
19
together because they both contain music. Since the latter is undesirable, initial labeling
is conducted to remove these sections.
Initial labeling can be further broken down into two processes.
First is the
detection and removal of silence, and second is the detection and removal of noise and
music. An energy based activity detector detects silence. This mechanism is described in
detailed in Section 4.1.3. It involves the generation of a frame-by-frame decision of
silence, which is then processed to generate segments of silence. Finally, these segments
are filtered against the initial segment. All segments of speech that fall into a given range
of silence are spliced out, leaving only non-silence audio.
The process of removing noise and music is approached as a speaker ID task
(described in Chapter 3). The categories of speech, music, noise, speech and music
overlap, and speech and noise overlap will be treated as individual classes.
The
motivation for using these five categories and the creation of their respective models is
addressed in Section 4.2.3. Using the models for each of these, segments can be labeled
based on the respective log-likelihood scores. All non-speech segments are removed, and
the speech segments are passed to the clusterer.
Combining these steps yields the following labeling procedure:
-
Energy based activity detection is used to generate a list of silence segments.
-
The initial segments from the acoustic change detector are filtered against
silence.
-
The speaker ID system is run to test the audio file against the input category
models of speech, music, overlap, speech and music overlap, and speech and
noise overlap.
20
-
The speaker ID system output is processed by the labeling post-processing
script (described in Section 3.3.3).
-
The non-speech segments are extracted and saved to a file. The same is done
for the speech segments. The speech file is passed on to the clusterer.
2.3 Clusterer
Clustering is a process by which single items can be arranged together in order to forms
groups that have similar characteristics. The clustering software used in this thesis is that
which is employed at MITLL.
The MITLL clustering software is based on the
agglomerative clustering algorithm [6].
Clustering begins with a number of single-element clusters called "singletons."
For the case at hand, these are the segments generated in the aforementioned steps. A
distance is computed between each of the possible cluster pairings. This distance is
calculated in such a way as to represent the similarity between the candidate clusters.
Then, an iterative process begins where, for each iteration, the two "closest" clusters are
merged to form a new cluster.
New distances are then computed.
They must be
calculated for all of the pairings that contained one of the two pre-merged clusters. There
are a number of ways to determine the new distance between the new merged cluster and
every other cluster. For the MITLL clustering system, the distance between two clusters
is equal to the distance between the two models used to represent each cluster. In the
standard algorithm, this iterative joining continues until only one large cluster exists.
The agglomeration does not proceed this far for the speaker clustering that is
performed. Rather, it is continued until a given set of circumstances is met, at which
point the agglomeration stops. Figure 2 shows an example of agglomerative clustering.
21
At the base of the clustering "tree" are the singletons, or leaves. The earlier processes
generated these segment singletons. Each leaf is its own cluster. As each iteration of the
clustering algorithm proceeds, the two "nearest" clusters are joined.
This should
theoretically continue until one cluster, called the root, remains. But, in order to generate
a meaningful set of speakers, clustering is continued until a stopping criterion is met; be it
a certain metric obtained or the proper number of speakers generated. For the MITLL
clustering system, the clustering stopping criteria is based on ABIC [4]. The point at
which the algorithm stops determines the number of speakers. This results in a number
of unique speakers, each of whom have a given set of segments attributed to him or her.
Each segment appears under one and only one speaker, and should have similarities with
the other segments in its cluster. Ideally, this similarity will be the sound of the speaker's
voice, but it could not be. That is why the filtering of non-speech is likely to improve
results.
Root
Iteration
7
Iteration 6
Iteration 5
_________________B
rach Cut
Iteration 4
Iteration 3
Iteration 2
Iteration 1
Leaf Nodes
S1
[22
3
S5
S4
S6
S7
Time
S8
Figure 2: Example of Speaker Clustering. The leaf nodes represent different segments in a given
audio file. As the clustering algorithm proceeds, segments are combined based on the distance
between them. The clustering is stopped before the theoretical root cluster is reached, when a given
set of circumstances are met. This gives a set of speakers: in this case four speakers.
22
2.4 Resegmenter
As mentioned earlier, the ideal solution to speaker clustering is the generation of clusters
with member segments that have the same speaker. As much as removing non-speech
elements may improve the results, impurities remain.
In addition, the overlapping of
speech and music and speech and noise remain in the segmentation, biasing the
clustering. Both of these cases make misclassifications inevitable.
Re-segmentation attempts to remedy this situation.
First, a script makes the
output of the speaker clustering (a list of segments and speakers) into a set of input files
for the speaker ID system (described in Chapter 3). Then, the system trains models for
each speaker. In the testing phase, these models are used to segment the audio file. After
the generation of the input files, the whole process is nearly identical to that used in the
segment labeler.
Combining these steps yields the following re-segmentation procedure:
-
New lists are generated from the speaker clustering output file. This list
includes a list of speakers present in the output and a set of files for each
speaker.
-
The speaker ID system is run to train models for each of the speakers.
-
The speaker ID system is run to test the audio file against the newly created
speaker models.
-
The speaker ID system output is processed by the segmentation postprocessing script.
o The non-speech segments recorded in the segment labeler (Section 2.3)
are used as a block filter for the segmentation file, so that all of the times
23
that appear in the resulting segmentation do not contain non-speech
regions.
-
The output list is fed back into the re-segmentation script as many times as is
wanted.
2.5 System Flow
Combining these steps yields the following final experimental outline for the
unsupervised segmentation task:
-
The input audio file is segmented using acoustic change detection.
This
generates a list of initial segments for the following steps.
-
Energy based speech activity detection is used to generate silence segments.
-
The initial segments are filtered against the silence segments. This produces a
list of segments that do not contain silence.
-
The input audio file is then tested against the five models that are used in the
segment labeler to produce an output file.
-
The non-silence segments and the above output file are passed onto a script
that labels each segment with one of the five category IDs.
-
The non-speech labeled segments are removed, leaving a list of segments that
contain neither pure music, pure noise, or silence.
-
The segments are passed onto the clustering software.
This produces a
speaker label for each of the segments.
-
The output is re-segmented multiple times, producing as output a list of
segments and speaker labels.
24
CHAPTER 3
The Speaker ID System
Speaker ID is the task of deciding who, among many candidate speakers, has spoken a
given sample of audio. As stated above, the tasks of labeling the initial segments of
audio and the re-segmentation of the results will both be conducted from a speaker ID
approach. Although the categories of speech, music, et cetera in the segment labeler are
not speakers in the classic sense, this method will prove to be appropriate.
3.1 General System Overview
All speaker ID systems contain two phases: training and testing. In the training phase,
the speaker models are created from input audio segments. The second step, the testing,
involves using these models to identify unknown audio.
In the segment labeler, the
training phase occurs only once, after which the generated models are used for all audio
files. On the other hand, each of the iterations through the resegmenter requires training
new models based on the input segmentation file, and testing the audio file with those
new models.
The training is made up of two processes. The first is the front-end processing,
where the feature vectors are extracted from the audio signal. The second is the use of
these vectors to create the appropriate speaker models.
The testing phase has three sub-processes.
The first is the same front-end
processing used for training. The second is the scoring of these feature vectors against
the models created in training. The third and final step is post-processing. In this step,
25
the scores are smoothed and normalized to increase stability and improve edge detection.
The scores are then processed in one of two ways: labeling or segmenting. The nature of
the experiment determines which of the two is conducted. The labeling or segmenting
process yields the results that determine how well the experiment performed.
3.2 Likelihood Ratio Detector
In order to understand this or nearly any experimental apparatus used in speaker ID, it is
necessary to discuss the basis of the system, the likelihood ratio detector. It will be
examined in the context of matching an audio segment to a speaker.
Given a segment of speech, Y, and a set of hypothesized speakers, S = {si,
S2,...,sn}, the task is to determine which speaker has spoken utterance Y. Let a speaker
from S, s, be represented by a speaker model A r, where Ar contains the characteristic
that makes Sr unique. Then, using Bayes' rule:
P(ArIY A,(
p(A,)
(4)
p(Y)
where p(A r) is the a priori probability of speaker s, being the unknown speaker and p(Y)
is the unconditional probability density function (pdf) for an observation segment.
Speaker s, is then the hypothesized true speaker if it has the highest probability, shown
by:
P(YA) P()>(Y IA,)
p(Y)
p(A,)where s
N (r# s)
p(Y)
This rule is simplified by canceling p(Y) and assuming that p(Ar)=1/N, r=l,...,N [7].
This results in:
P(Y IA,) > P(Y|)
where s = I,.., N (r # s).
26
This means that the speaker identification decision is reduced to evaluating each
speaker's pdf for the observed speech segment Y, and choosing the maximum value.
3.3 The MITLL GMM-UMB Speaker ID System
The differences between speaker ID systems are how the speakers are modeled and how
these models are applied. The speaker identification system used in this thesis is called
the Gaussian Mixture Model-Universal Background Model (GMM-UBM)
Identification System.
conducted.
Speaker
It is used at MITLL, where the research for this thesis has been
This system provides the necessary computations for both phases of the
speaker ID task. For the training phase, when passed a set of speaker names, appropriate
segments for each speaker, and an audio file, the GMM-UBM system will create a
Gaussian Mixture Model for each speaker containing all of its identifying characteristics.
The generation of these speaker segments requires a significant amount of work, as
discussed in Chapter 4. In the testing phase, when passed an audio file and a set of
speaker models, the system will apply each model to each audio file. The output for each
audio file will be a file of frame-by-frame log-likelihood score for each model. The loglikelihood is log[p(Y IA,)], where the likelihood is p(Y
output will be discussed in Section 3.3.3.
I
,.).
The exact format of the
What follows now in an overview of the
GMM-UBM system.
3.3.1 Front-end Processing
The goal of front-end processing is to analyze the given speech signal and extract a
salient sequence of features that convey the speaker-dependent information. The output
from this stage is a series of feature vectors that characterize the test sequence,
27
X =x
,x2,...,x1
, where each vector, x , is taken at a discrete time t e [1, 2,..., T]. In the
training phase of the identification process, these feature vectors are used to create a
model A in the feature space of x that characterizes the model's respective speaker. In
the testing phase (when a segment is matched with a speaker), these feature vectors are
used, in conjunction with the models, to determine the conditional probabilities needed to
determine the log-likelihoods.
The first step in processing and producing the feature vectors is the segmentation
of the complete speech signal into 20-ms frames every 10 ms. Next, mel-scale cepstral
feature vectors are extracted from the speech frames. The mel-scale cepstrum is the
discrete cosine transform of the log-spectral energies of the speech segment Y. The
spectral energies are calculated over logarithmically spaced filters with increasing
bandwidths. Next, the zeroth value cepstral coefficient is discarded and the remaining
are used for further processing.
Next, delta cepstra are computed using a first order
orthogonal polynomial temporal fit over ± 2 feature vectors (two to the left and two to
the right over time) from the current vector. Finally, linear channel convolution effects
are removed from the feature vectors using RASTA filtering. These effects are additive
biases, since cepstra features are used. The result is channel normalized feature vectors
[8].
3.3.2 Gaussian Mixture Models
The next step is the creation of the models for the speakers. In addition to providing an
accurate description of the vocal characteristics of the speaker, the model of choice will
also determine how the actual log-likelihoods will be calculated. For text-independent
28
speaker recognition, where there is no prior knowledge of what the speaker will say, the
most successful likelihood function has been Gaussian mixture models.
A Gaussian mixture is the weighted combination of several component Gaussian
distributions. For each speaker, a Gaussian mixture is created from the feature vectors
that represent his or her frames of speech. Given the complexity involved, the actual
creation of these models is beyond the scope of this thesis..
After the model is created the likelihood becomes:
M
where M is the number of mixtures, x is a D-dimensional random vector, bi (x), i =
1,..,M are the component densities, and pi, i = 1,..,M are the mixture weights satisfying the
condition
ip,
=1.
Each component density is a D-variate Gaussian function of the
form:
bx=(27r)
x-M
2nep-
j
JYm
Lx-
where u is the mean vector and Z1 is the covariance matrix. Reynolds provides an indepth discussion of all the aspects of not only of GMMs, but also of the whole GMMUBM system, in his thesis [7].
3.3.3 The Post-processing
In the testing phase, the GMM-UBM system processes the audio file on a frame-by-frame
basis. For each of these frames, the feature vectors are evaluated against the speaker
models, producing a log-likelihood ratio for each model for each frame. All of these
29
score are aggregated and stored in a file.
The post-processing step involves going
through the frame-by-frame output file and generating an output related to the speaker
information contained in the test data. There are two methods by which to accomplish
this processing.
The first involves labeling segments, which is of obvious use in the
segment labeler.
The second involves generating segments, which is of use to the
resegmenter. Both of these steps operate on the output from the GMM-UBM system.
The first method involves passing the pre-determined segments into a script. For
each segment, the script extracts the appropriate frames (those that fall within the upper
and lower time-bounds of the segment definition), and averages the log-likelihood ratios
for each model across frames.
The model with the maximum log-likelihood ratio is
determined to be the speaker of the segment.
The second method involves parsing through the complete dump-scr to generate
segments.
The first step in the process is to average the frames over a given time
window. This serves two purposes. The first purpose is to remove random fluctuations.
Since there are periods of non-speech in any speech segment, this step ensures that these
non-speech periods are not counted as the wrong speaker. The second reason for this
smoothing window is to soften the edges between segments. This makes the transitions
between categories more gradual, and, again, has the purpose of delineating more
precisely the transitions between speakers.
The next step is to determine how these log-likelihood scores are used to
determine who is speaking during each frame.
The method used for this setup is a
maximum-value method. The model with the highest log-likelihood score is determined
to be the speaker. Using this rule, the post-processing script labels each frame with its
30
associate speaker. Finally, these frame-by-frame speaker labels are coalesced to form
segments.
3.4 The Program Flow
Combing these steps to obtain the final experimental outline yields the following speaker
ID task:
-
The GMM-UBM system is given three inputs. The first is a list of speakers.
The second is a list of audio files - either for training or testing, depending on
which step is undertaken. The third is a list of segments for each speaker for
each audio file (if training) or a set of speaker models (if testing).
o
If training, the GMM-UBM system runs its training subroutines and
creates models for each speaker based on the training data.
o
If testing, the GMIM-UBM system runs its testing subroutines using
the input models on the testing data to create an output file for each
test file.
-
The testing output files are processed with the appropriate post-processing
task: either labeling or segmenting, depending on the desired outcome.
31
CHAPTER 4
Creating Category Models
The segment labeler takes five category models as input: a model for speech, music,
noise, speech and music overlap, and speech and noise overlap. Although the models are
created in a process independent of the actual unsupervised segmentation task, the
process of creating good models is important and merits discussion. The GMM-UBM
speaker ID system was used to both create and test the GMMs that were used to model
each of the categories.
The first step in creating the models was in processing the data that was available
for use. The first section describes both the data and the process used to format it for use
by the speaker ID system. The second part of the chapter discusses the three sets of
models that were created, and the qualities of each set.
4.1 The Data
Before the data is discussed, the terminology used to describe it will be presented. After
that, the raw data will be described. Last comes the procedure for parsing and formatting
the data.
4.1.1 Terminology
The terminology that follows is taken from the Linguistic Data Consortium (LDC). The
terms are, therefore, consistent with those used throughout the field.
The following definitions apply when providing a macroscopic description of the
data. A show is defined as a radio or television broadcast production. For instance, ABC
32
Nightline is a show. An episode refers to a particular taping of a show. An example is
ABC Nightline for July 21, 1996. For the Hub4' 1996 data, each separate audio file
recorded an episode of a show.
The following definitions apply when discussing a specific episode. A segment is
a contiguous section of audio that maintains a consistent set of pre-determined criteria.
Speech marks the segments in which someone in the foreground is talking in an
intelligible manner.
Each distinct person speaking is referred to as a speaker.
Background is acoustical activity that is neither speech nor music. It represents two subcategories. The first is unintelligible or background speech, which shall be referred to as
background-speech.
The second encompasses all other sounds that occur that are not
speech or music - helicopters, gunshots, white noise, et cetera.
This shall be called
background-noise, or noise. Finally, the term overlap will describe the segments that
contain more than one of the above categories (speech, music, or noise).
Looking at the earlier definitions, a segment can be labeled with three types of
labels - speech, music, or noise. By the methodology of tagging described in Appendix
A.1, the tags music and noise cannot overlap. Therefore, there are only two types of
overlap (it is assumed that label a overlapping with label b is the same as label b
overlapping with label a). The regions that include an overlap of speech and music will
be labeled "speech+music" and the segments of speech and noise will be called
"speech+noise." The results can be seen below in Table 1. It is important to note that,
for the majority of the time, this document will refer to background-speech as an overlap
condition between speech and noise.
1Hub4 refers to a particular evaluation data set the LDC distributes for use.
33
Labels
Speech
Music
Noise
Speech
-
Speech+Music
Speech+Noise
Music
Noise
Speech+Music
Speech+Noise
-
Table 1: Overlap Classification. This table shows the possible overlap scenarios and the manner in
which they are labeled in the processing of the raw data.
4.1.2 1996 Hub4 Data Corpus
The corpus is made up of 172 episodes that come from eleven shows. The 1996 Hub4
data represents over 104 hours broadcasts from ABC, CNN and CSPAN television
networks and NPR and PRI radio networks. The LDC provides 172 audio files, one for
each episode, and their respective transcripts
2
Each transcript provides a manual recounting of the audio file associated with it.
Each transcript is hand-marked to denote not only what is said by each speaker, but also
speaker information including gender, phrasal level time boundaries, boundaries between
news stories, and background conditions. The full spectrum of tags (i.e. what is marked
in the transcription process) can be viewed in Appendix A, along with the formatting of
the transcript itself.
It is helpful to highlight an important part of the tagging/transcription process.
All tags, except for the "Background" tag, contain a start time and an end time. This
means that, in order to generate segments of any of these tags, only the opening tags must
be viewed, since they contain all of the important time information.
The "Background" tag is unique because it is the only tag that provides a start
time but no end time. This means that every time the "Background" tag is seen, the
Information about the Hub4 data set can be found at
http://www.ldc.upenn.edu/Catalog/Catalo-Entry.isp?catalogld=LDC97S44.
2
34
previous "Background" segment has ended and a new "Background" segment has begun.
This results in a non-spanning behavior of the tag.
There cannot be two types of
background happening simultaneously. To this extent, there are three type of background
categorized by the LDC. They are "music," "speech," and "other." This document refers
to these categories as music, background-speech, and background-noise, respectively. In
the case where a "Background" segment has ended and a new one has not begun, the
transcript simply labels this as starting a new "Background" segment with "Level" =
"Off."
4.1.3 Generating Segments from Raw Data
Based on the categories of interest, segmentation data needs to be created for each
category - speech, music, noise, speech+music, and speech+noise.
These tags will
determine which segments to extract for training. They will also provide truth markings
for testing purposes for both labeling and segmenting.
Before outlining the data segmenting process, it is important to describe the file
naming convention that is used throughout the data sets. The original transcript is stored
as "SYYMMDDP.txt," where S is the show (one of eleven letters, a-k, see Appendix 3.4
for exact pairing). YY equals the year. Since this is the 1996 corpus, the YY value is
always 96. MM is the month of the episode and ranges from 04 (April) to 07 (July). DD
is the date of the broadcast and ranges from 01 to 31. P is the partition of the broadcast.
If the entire broadcast is contained in one audio file, then P equals "_". If the broadcast is
in multiple audio files, it is partitioned into a number of audio files (two to four), and
each part is label sequentially starting from "a".
called the root.
35
The string "SYYMMDDP" will be
Sphere files are labeled as root.sph. They are the actual audio files containing the
broadcasts.
The script used to process the raw data creates new sphere files that are
symbolically linked to the original file. The script creates five such audio files - one for
each category of interest. Each file is labeled root-category.sph, where category is
speech, music, et cetera.
When processed by the script, the original transcript generates a number of files
that keep track of segments. There are two types of these files: segmentation files and
mark files.
Segmentation files are denoted by rootdescription.seg.
"Description"
indicates what types of segments are contained within the file. The segmentation files
not only keep track of the final segments for each of the aforementioned five categories,
but also keep track of a number of intermediate segments. The mark files are generated
from the final set of five segmentation files. They are labeled root_category.sph.mark.
The only difference between the mark and segmentation files is the format of each
segment. In the segmentation files, each segment is of the type "starttime endtime
Label". The mark files, in contrast, are formatted as "Label starttime duration".
Each transcription file is parsed through, producing 21 files of data. Each set is
associated with one audio file. The full process may be outlined as follows:
-
The transcript, root.txt, is read in and all "Segment" tags are extracted.
-
Each tag is translated into a label that contains the speaker name, the start
time, and the end time. This information is stored as root-speech.seg.
-
The transcript is read in again and all "Background" tags are extracted.
36
The background tags are reformatted to contain a type, start time, and an end
time. The type is the one of three provided by the LDC - music, speech, and
other. The end time was determined as described at the end of Section 4.1.2.
Each type of background is saved in a segmentation file: rootmusic.seg,
rootback-speech.seg, and rootback-other.seg. None of these three files have
any regions of overlap, since, as described, the "Background" tag is nonspanning.
Commercials and regions of "Noscore" are extracted from root.txt, processed,
and saved as rootcommercial.seg. Commercials maintain an inconsistent set
of annotation among the transcripts.
All transcripts flag commercials.
Problems arise from the fact that, in some cases, commercials are transcribed
like any other section, and in other cases they are not. In a pro-active step to
remedy this potential problem, all commercials are marked and filtered out.
Cross talk is then run against the audio file itself to determine regions of
silence. Cross talk is a program written at Lincoln Laboratory that determines
regions of low acoustical energy, when compared to an input threshold. The
program breaks the audio signal into a number of 20-ms frames every 10 ms.
For each frame, the total energy in the signal is computed and compared to an
input threshold. The output is a number of ones and zeros - based on whether
the energy was lower (0) or higher (1) than the threshold for each frame.
A script is run which processes the output of the cross talk program.
It
translates the frame-by-frame ones and zeros into a segmentation file, where
regions of silence are marked, and where silence is defined as those frames
37
with an energy level below the threshold.
This file is saved as
rootsilence.seg.
Next, the regions of overlap are extracted from speech, music, and
background-noise.
The
three
main
categorical
segmentation
files,
root-speech,seg, rootmusic.seg, and rootback-other.seg, are read in and the
segments are aggregated into one list.
As mentioned earlier, rootback-
speech.seg is considered speech+noise overlap and is therefore not included in
this step.
o
This list is filtered by root-silence.seg, rootcommercial.seg, and
root back-speech.seg where each of the times associated with these
segments are extracted from the list.
o The list is then processed segment-by-segment so that all regions of
overlapping acoustical descriptions are removed.
For instance the
segments "0 10 speech", "2 3 music", "8 11 music" becomes "0 2
speech","3 8 speech", and "10 11 music".
o
All segments with the same category label are extracted and saved in
a file.
Speech
is saved in rootspeech-only.seg,
music in
rootmusiconly.seg, and noise in rootnoise-only.seg. Each of these
files contains the start times and end time of segments that are purely
from that category.
The combined file is processed in the opposite direction and all regions of
overlap
are
found.
Speech+music
38
overlap
is
saved
in
and
rootspeech musicoverlap.seg
speech+noise overlap is saved in
rootspeech noiseoverlap.seg.
A mark file is created from each of the five final segmentation files as well as
an associated sphere file, which is linked to the original sphere file. That is
root-speech-only.seg,
for
root-speech-only.sph
and
root-speech-only.sph.mark are created.
All of the mark files generated from root.txt are trimmed so that the total
amount of time in each category equals the amount of time in music. There
must be roughly equal amounts of training data per model (in seconds) per file
being used to train the system, in order to prevent over-training any one
category.
o
In each mark file, the segments are sorted by segment duration.
o
The list is split in half at the median length segment. Then, a segment
is selected from above and below the median. These segments are
added to a list.
This process of selection continues until the total
amount of data in the list first exceeds the amount of music for root.
The final product is 21 files: five final segmentation files, five mark files, five
sphere files, and six intermediate files. Figure 3 shows a flowchart of the
breakdown of the categories.
39
Speech-Noise Overlap
labelled.
All Back-speech
segments
Speech and Back-other
combined
Pure Noise
All Back-other
labelled
segments
Speech, Back-other, and
1996 Hub4 Trasciptio
All Speech
Music combined
Pure Speech
Speech and Music
Pue Music
labelled
segments
All Music labelled segments
Speech-Music Overlap
Figure 3: Category Breakdown. This shows the division, recombination, and output of the above
outlined processes.
Appendix B contains an intuitive graphical representation of what occurs during
the segmentation process. Even though there may be some microscopic discrepancies in
the graphical version when compared to the script outline, it provides a fair description of
the macroscopic steps involved in the processing of the transcript files.
Once the script is run, the total amount of data for each category is summed.
Table 2 shows the breakdown by categories of the 104 hours of 1996 Hub 4 data.
40
Type
Time in
Sec
Time in
Hours
Speech
191541.27
3192.35
Music
8288.46
138.14
Speech+Music
Noise
14191.45
1893.91
236.52
31.57
Noise+Music
60926.34
1015.44
Total
276841.43
4614.02
Table 2: Data Breakdown. Total amount of data in each of the five categories.
The data processing procedure described above generates segments for five
categories of acoustical signal. For two sets of the speaker ID experiments, there are only
three categories of interest - speech, music, and speech+music overlap. Although these
three-model systems were developed earlier, it is easiest to think of the pre-processing
step for them as a simplification of that used for the five-model system. The process is
almost identical, with the exception being that rootback-other.seg and rootbackother.seg are combined to rootback.seg in the three-model system.
Then, in the
following steps, only speech and music are combined to produce the overlap and nonoverlap segments. Finally, the filtering that is done on the combined speech and music
list is against the segments from rootback.seg, in addition to the previously used
rootsilence.seg and rootcommercial.seg for one set of models and only the later two for
the other set.
4.2 Models
The experiments that were run in this section fall into three evolving sets of models.
They will be presented with the final version last. The first is a three-model system that
attempts to detect "speech", "speech+music", and "music". The second is an extension
of the original system, where background is filtered from the audio used to create the
41
models. The third is the five-model system that detects "speech", "music", "noise",
"speech+noise", and "speech+music." Each section that follows will provide the major
characteristics of the experiment and then provide the results.
In all of these experiments, the breakdown was as follows. The data was trained
against all of the episodes for two shows, and tested against all of the eleven shows. This
was done for two reasons. The first addresses the nature of music in broadcasts. Music,
as mentioned before, is played as a transition between stories in a broadcast.
Most
notably, music marks the introduction of the episode itself. This music is the same within
shows, and is usually an identifying characteristic of the show. Therefore, if models were
trained on all the shows, the possibility of over-training the models may arise. This
means that instead of detecting music, which is wanted, the models detect the theme
songs from the eleven broadcasts. Also, if training occurs off only one show, the model
may fall short due to a shortage of training data.
4.2.1 Three-Model System
For this experiment, the data was pre-processed with three models and no background
filtering in mind. This experiment uses the simplest of the models and is meant to be a
first-pass test.
Only the segments marked as speech and music are collected and
processed according to the description provided in Section 3.1.3. The two types were
processed for overlapping and exclusive regions while being filtered against the
commercials and silence. Again, as mentioned above, background is ignored.
Table 3 shows the results of the first experiment. One of the scoring methods that
the GMM-UBM system provides, in addition to the frame-by frame dump-scr option, is a
segmentation file based output. When passed a file of segments, the GMM-UBM system
42
scores the audio denoted by all of the given segments and provides a single score for each
model.
This option made sure that the created models had potential before further
processing. All of the audio files were tested with each of their three respective speaker
segmentation files.
truth\hypothesis
Speech
Speech+Music
Music
Sp eech
#Shows
%
100.0%
172
0.0%
0
0.0%
0
Speech+Music
#Shows
%
0.0%
0
100.0%
172
0
0.0%
Music
#Shows
%
0.0%
0
0.0%
0
100.0%
172
Table 3: Three Model Show Classification. This shows the results of the three model classification on
the episode level.
Looking at Table 3, one can see that the results are as desired. All of the shows
were categorized as expected.
This leads to the next step: labeling segments.
As
mentioned earlier, this step involves the post-processing of the dumpscr with a list of
segments. The list used for this experiment is just the aggregation of all of the category
segments for all of the episodes. The log-likelihoods for each model are averaged across
the duration of the given segment, and the maximum score determines the hypothesized
speaker. Table 4 shows the results of this process. There are a total number of 19962
segments scored against.
truth\hypothesis
Speech
Speech+Music
Music
Sp eech
%
# Segs
16284
95.6%
44.7%
815
3.5%
39
Speech+Music
%
# Segs
665
3.9%
47.6%
869
4.4%
48
Music
%
# Segs
84
0.5%
7.7%
141
92.1%
1017
Table 4: Three Model Labeling. This shows the results of the three model labeling. Results show the
number of segments attributed to each class and the percentage correct.
43
The final step is the segmentation of the dump-scr files.
A 201 frame (two
second) averaging window removed spikes and smoothed the data. Then, on a frame-byframe basis, each frame was labeled with the category producing the maximum loglikelihood score. Finally, all neighboring frames with the same label were collapsed
together. A total of 278036.4 seconds were scored against. The results are shown in
Table 5.
truth\hypothesis
Speech
Speech+Music
Music
Speech
# Secs
%
96.5% 246578.30
48.0%
6881.02
303.97
3.7%
Speech+Music
%
# Secs
2.9%
7453.66
6064.40
42.3%
545.70
6.7%
Music
%
# Secs
0.6% 1489.21
9.7% 1394.48
89.6% 7325.75
Total
255521.10
14339.90
8175.42
Table 5: Three Model Segmenting. This shows the results of the three model segmentation. Results
are in the number of seconds attributed to each category.
Given the fact that the speech and music model were trained on only two shows,
these results support the claim that music has distinct and unique identifying
characteristics. It also shows that a speaker ID setup can successfully flag these regions.
There is, overall, an approximately 90% accuracy rate for classifying pure music. Since
the purpose of these models is to flag regions of speech, the classification of speech as
music is not desirable. Therefore, the less than -0.5% error of falsely identifying speech
as music is promising.
The only problem that arises is in speech and music overlap. The results from the
non-filtered three model system are not good for the classification of speech+music. The
quality of speech+music for nearly all cases is loud, foreground speech over soft,
background music.
If the speech model was adulterated with something that had
dynamics similar to this, then these segments could pull speech+music towards speech,
thereby leading to misclassifications, as seen.
44
Even though music and noise are not similar per se, they are alike in the fact that
they are not speech. Therefore, since there is no speech+music in the speech model, the
speech training data could be contaminated with speech+noise. This leads to the next set
of models.
4.2.2 Three-Model Filtered System
As mentioned earlier, the three-model system summarized above did not take background
into account. The logical next step is to filter out these segments, and see if any changes
occur.
The marked background segments are filtered out of the speech and music
segments. After training, the models should be more tuned to their respective category,
since the training data is more pure.
The same two types of results as above are shown again below. For the first set of
labeling and segmenting results, Table 6 and Table 7 respectively, the results were
obtained by comparing the post-processing outputs to the truth determined from the
above filtered segmentation files. As seen from the tables, there are 24521 segments
being labels and 275814.2 seconds being segmented.
The second set, Table 8 and Table 9, are obtained by comparing the dump-scr to
the truth determined by the segmentation files created in Section 4.2.2. These are the
non-filtered segments for speech, music, and their overlap. There are 19962 segments
labeled and 273602.5 seconds segmented.
45
truth\hypothesis
Background
Speech
Speech+Music
Music
Speech
%
# Segs
71.9%
3452
15925
94.8%
32.1%
609
2.1%
22
Speech+Music
%
# Segs
19.0%
914
770
4.6%
60.5%
1147
41
4.0%
Music
# Segs
%
9.1%
435
0.6%
99
7.4%
140
93.9%
967
Total
4801
16794
1896
1030
Table 6: Filtered Three Model Labeling Scored Against Filtered Segments. This shows the results of
the three model labeling using the models created with the filtered segmentation files. Results show
the number of segments attributed to each class and the percentage correct. These are scored against
the filtered segmentation files.
truth\hypothesis
Background
Speech
Speech+Music
Music
Speech
# Secs
%
77.3%
48062.52
96.6% 184998.80
36.1%
5127.89
1.7%
132.15
Speech+Music
%
# Secs
19.4% 12086.49
6061.51
3.2%
54.3%
7699.38
7.6%
610.08
Music
%
# Secs
3.3% 2024.55
402.44
0.2%
9.6% 1365.76
90.7% 7242.62
Total
62173.56
191462.80
14193.03
7984.85
Table 7: Filtered Three Model Segmenting Scored Against Filtered Segments. This shows the results
of the three model segmentation using the models created with the filtered segmentation files.
Results are in the number of seconds attributed to each category. These are scored against the
filtered segmentation files.
truth\hypothesis
Speech
Speech+Music
Music
Speech
%
# Segs
15459
90.8%
31.5%
574
32
2.9%
Speech+Music
%
# Segs
8.6%
1469
61.3%
1119
4.8%
53
Music
%
0.6%
7.2%
92.3%
# Segs
105
132
1019
Total
17033
1825
1104
Table 8: Filtered Three Model Labeling Scored Against Unfiltered Segments. This shows the results
of the three model labeling using the models created with the filtered segmentation files. Results
show the number of segments attributed to each class and the percentage correct. These are scored
against the non-filtered segmentation files.
truth\hypothesis
Speech
Speech+Music
Music
S peech
# Secs
%
92.4% 232402.70
36.1%
5127.89
1 .7%
132.15
Speech+Music
%
# Secs
7.0% 17579.32
7698.66
54.3%
7.7%
618.63
Music
%
# Secs
0.6% 1435.63
9.6% 1364.27
90.6% 7243.27
Total
251417.70
14190.82
7994.05
Table 9: Filtered Three Model Segmenting Scored Against Unfiltered Segments. This shows the
results of the three model segmentation using the models created with the filtered segmentation files.
Results are in the number of seconds attributed to each category. These are scored against the nonfiltered segmentation files.
46
The changes in times and number of segments is what is expected.
First, the
number of seconds segmented is smaller, since sections that were previously called
speech are being extracted as background. Also, the number of segments has increased,
since the previous segments of speech are broken up into two or more segments when the
background was removed.
The results from the first set of experiments confirm what is expected.
Background does seem to be throwing off the results.
Speech+music classification
improves by more than 10%, when the models are created with background removed.
The reason for this improvement is an improved speech model.
As stated earlier, no
background appears with music. With the removal of background from the segments,
speech+noise is removed from the speech training data, and therefore, the model
represents more pure speech. One might be led to believe the changed results may be the
result of the changed truth marks. This is tested by scoring the filtered models against the
unfiltered truth. The same 10% improvement remains.
Unfortunately, this improvement comes at the cost of missed speech speech+noise is filtered out of the models. This is not desirable. Therefore, the natural
extension of the process is to create models for speech+noise and noise, test with them,
and see how the results change.
4.2.3 Five-Model System
The last set of experiments deal with the five-model system. These five models are the
final set, and are used in the segment labler. The five models of speech, music, noise,
speech+music, and speech+noise are created and scored against. The results are shown
in Tables 10 and 11.
The same two methods described above are applied, except
47
expanded to five models.
There are 24813 segments labeled and 229566.2 seconds
segmented.
The models for the music and speech+music are the same as those in the two,
three-model systems.
The same speech model as in the filtered-three-model system
(different than the original three-model system) is generated.
The results that are obtained are less than ideal. Speech+noise and speech have
misclassifications, as well as noise and music.
Speech+music is misclassified as
speech+noise (which further substantiates what was said earlier regarding the speech and
speech+music confusion), and so is noise as speech+noise.
What follows are some
potential explanations for these inaccuracies.
Noise adds problems because of its non-uniformity. Noise/background does not
have any characteristics that make it identifiable, except for the fact that it is not speech
or music.
One of the biggest problems is also the fact that noise volume varies
significantly. It can be quite loud at times and barely audile at others. This variability
helps explain the problems with speech and speech+noise. The speech+noise model is
based on data that has this "quiet" noise. If the noise can barely be heard, then the
segment is speech, for all practical purposes. On the other hand, speech has all type of
background noises that are not marked as noise in the transcript. This means that there is
a common type of audio segments modeled by both models. This of course, leads to
erroneous classifications.
The music and noise confusion may be due to the fact that some of what is called
noise has characteristics similar to some of the music. This similarity could be based on
48
similar frequency distributions or acoustical dynamics. Regardless, whatever the reason
for the similarity, it also explains the misclassification of speech+music as speech+noise.
The errors between noise and speech+noise can be explained by the same reasoning that
explains the music and speech+music overlap. These are just distributions problems. In
these regards, the models are accurate for all of the classes.
49
truth\hypothesis
Speech
Speech+Music
Speech+Noise
Speech
# Segs
%
9438
62.9%
266
14.8%
1396
23.1%
6.6%
1.2%
Noise
Music
Speech+Music
# Segs
%
271
1.8%
749
41.7%
368
6.1%
54
49
6.0%
4.6%
59
13
Speech+Noise
# Segs
%
4771
31.8%
627
34.9%
3808
62.9%
16.3%
3.1%
146
33
Noise
# Segs
%
484
3.2%
32
1.8%
381
6.3%
540
104
60.3%
9.8%
Music
# Segs
%
44
0.3%
123
6.8%
97
1.6%
Total
15008
1797
6050
97
863
896
1062
10.8%
81.3%
Table 10: Five Model Labeling. This shows the results of the five model labeling. Results show the number of segments attributed to each class and the
percentage correct.
0
Speech
Speech+Music
Music
Noise
Speech+Noise
%
# Secs
Total
truth\hypothesis
%
# Secs
%
# Secs
%
# Secs
%
# Secs
Speech
Speech+Music
Speech+Noise
24.9%
13.4%
17.5%
38516.75
1594.18
9580.38
24.0%
18.6%
21.3%
37036.87
2208.07
11680.61
23.8%
23.6%
38.7%
36732.30
2807.92
21184.61
10.1%
28.1%
9.3%
15573.47
3341.80
5077.39
17.3%
16.3%
13.2%
26726.26
1933.20
7207.99
154585.70
11885.17
54730.98
Noise
Music
6.7%
10.3%
106.16
699.57
51.4%
14.4%
813.07
974.95
21.8%
12.7%
344.93
859.34
7.2%
11.6%
113.16
788.56
12.9%
51.0%
204.44
3460.25
1581.76
6782.67
Table 11: Five Model Segmenting. This shows the results of five model segmentation. Results are in the number of seconds attributed to each category.
CHAPTER 5
Experiments, Results, and Discussion
This chapter discusses the experiments run with the unsupervised segmentation system.
First, a short description of the training and testing data used is given. Next, a description
of the five measures of performance appears in the result tables. Finally, the experiments
and results are presented by each tested system component.
5.1 Data
There are two sets of data that are used. One set is the Hub 4 1996 data from the LDC.
This is described in detail in Chapter 4. These data files consist of audio files and
corresponding transcript files.
The transcript files were processed in order to create
training data for the creation of the category GMM models.
The second set of data is from the dry run experiments of the RT-03 3 speaker
evaluation. It contains a set of six 10-minute audio files. Also present in the data set is a
list of reference segmentations.
These are the six files that are each passed to the
unsupervised segmentation system.
After the system is run, the output segments are
collected and compared to the provided reference segmentations.
5.2 Metrics
The output segmentation files are scored against the provided truth using a scoring script
provided by the Nation Institute of Standards and Technology's (NIST's) Speech Group.
When passed the truth and hypothesized segmentation files, the script outputs a score
3 This is a data corpus provided by NIST for their annual speaker evaluations.
51
report for the hypothesized file. The six numbers within the report are missed speech,
false alarm speech, missed speaker time, false alarm speaker time, and speaker error time.
The significance of each number is as follows. In general, missed speech is those
regions that are marked as speech in the truth segmentations, but are not marked as
speech in the test segments. False alarm speech is the opposite: marked speech in test,
but not in truth. Each of these times in seconds is then normalized in order to produce the
percentages seen in the tables.
If these times are normalized by the total amount of
scored audio, the results are "missed speech" and "falarm speech." If the normalization
factor is the total amount of speech in the audio, then the results are "missed speaker
time" and "falarm speaker time." The scoring software then finds the best mapping of
hypothesized speakers to real speakers for each audio file. Once this mapping occurs, the
amount of time in seconds misclassified is normalized by the total amount of speech to
produce the "speaker error time." The three errors of missed speaker, falarm speaker,
and speaker error are summed to produce the total error.
5.3 Acoustic Change Detector
The acoustic change detector was not tested in isolation. Rather, the effects of changing
various detector parameters on the clustering process are noted. None of these results
contain re-segmentation.
Two parameters were perturbed in isolation.
The first, scd_max, dictates the
maximum duration of a created segment in a number of frames. Since clustering can
only combine segments, the segmentation process can breakdown if the initial segments
are too long in duration. Therefore, capping the maximum length of segments avoids this
problem. Column one in Table 12 "baseline" measures the performance of the initial set-
52
up. In this case, scd_max is set to 3500 frames. Since one frame equals 10 ms, this
means that no created segment can be longer than 35 seconds. Other values for scd_max
are 1500 frames, 4500 frames, and 6000 frames.
Missed Speech
Falarm Speech
Missed Speaker Time
Falarm Speaker Time
Speaker Error Time
Total Error
baseline
(
0
4.8
0.4
8.9
10.6
19.86
sod_max =
1500 (%)
0.1
4.8
0.4
9.0
11.2
20.67
scdmax =
4500 (%)
0
4.8
0.4
8.9
10.0
19.27
scd-max =
6000 (%)
3.1
4.7
6.1
8.8
9.8
24.68
Table 12: Change Detection Maximum Duration. This table shows the results of changing the
maximum length of a segment produced by the acoustic change detector.
The second parameter tested was the weighting of the BIC penalty (A in Section
2.1.1). Although the strict definition of BIC states that A= 1, by altering this factor the
potential for segmenting changes. The baseline result of 2 = 1 is shown in Table 12. The
first experiment set A = 0.8. The results are shown in Table 13, column 1. Next, A was
set to 0.9, 1.1, 1.2, and 1.3. Table 13 shows the respective results.
Missed Speech
Falarm Speech
Missed Speaker Time
Falarm Speaker Time
Speaker Error Time
Total Error
A=0.8
(%
8.4
3.7
16.1
7.0
7.7
30.8
Table 13: Change Detection BIC Weighting.
weight in the BIC penalty.
A=0.9
(%
0.1
4.8
0.4
9.0
11.2
20.61
A=1.1
%)
3.1
4.5
6.1
8.4
8.9
23.47
A=1.2
(%)
0.0
4.9
0.4
9.2
11.3
20.87
A=1.3
(%
0.0
5.0
0.4
9.4
13.2
23.05
This table shows the results of changing the penalty
53
Table 12 shows that clustering performance is dependent upon the maximum
duration of segments.
performance drops.
As expected, if the initial segments becomes too long,
What was not expected was the drop in performance when the
maximum duration was limited to 15 seconds and the increase in performance when the
duration was increased to 45 seconds. Theoretically, the clustering system should be able
to rejoin any of the segments that are broken up by a length restriction. Therefore, using
more segments should not lower performance. Unfortunately, this does not occur. Most
likely due to the increase in the number of segments, the clustering process did not
succeed in recombining adjacent segments.
The alteration of the BIC penalty weight has various unexpected results. As A
decreased, the errors increased.
By decreasing the BIC penalty weight, the detector
placed less weight on complexity, thereby increasing the likelihood of a change. This
meant that there were more segments of shorter duration.
This is consistent with the
decrease in maximum segment length, although the resulting placement of changes in the
two processes was most likely different. As 2 increased, the error increased for each
when compared to the baseline, but not with one another. Given the above reasoning, it
can be concluded that the clusterer did not like fewer, longer segments.
5.4 Segment Labeler
The experiments of the labeler can be broken into two groups.
The first set of
experiments is concerned with the accuracy of the models used to perform the labeling.
The second set shows the value of adding various levels of post-label filtering to the
diarization process.
54
5.4.1 Model Performance
The description and performance of these models are shown in Section 4.2.3. The data
used to test this aspect of the labeler are different from that which was used to test the rest
of the components. Although the tables in Section 4.2.3 show the results by category, it
is more meaningful to see the performance of the models based on the speech versus nonspeech distinction. Since this is the reason why the labeler is used, it makes sense to
highlight this aspect of performance. Using the results of Table 10, Table 14 shows the
results of speech and non-speech classification. Speech, in this case, is represented by
any of the three speech-containing categories from above: speech, speech+noise, and
speech+music. Non-speech refers to noise and music.
truth\hypothesis
Speech
Non-Speech
Speech
%
# Segs
94.9%
21694
354
18.1%
Non-speech
%
# Segs
Total
5.1%
1161 22855
1958
1604
81.9%
Table 14: Five Model - Speech vs. Non-speech. This table shows the results of speech and nonspeech, using the five model segmentation results. Results show the number of segments attributed
to each.
As seen in Table 14, the results are satisfactory. There is a 5% misclassification
of speech as non-speech. There is also the problem of roughly 20% of the non-speech
being missed.
5.4.2 Filter Level
The second aspect of labeler experiments considered and determined the level of filter
that should be completed after the segments are labeled. All of these tests were run with
the baseline parameters for clustering, as described in the following section. The first
55
step taken was to run the minimal blind segmentation system on the data to see what the
performance was like.
The minimal system entails the initial segmentation and the
clustering of those segments. The results of this process are shown in Table 15, in the
column "no filter." This label is applied because the segments are not filtered prior to
clustering.
As mentioned earlier, one of the major reasons for which the labeler was
developed is that it removed areas of non-speech, causing lower errors. Since there are
two types of non-speech, the first step taken was to remove all of the segments that were
labeled as music or noise, as described in Section 2.3. The results are shown in the
column labeled "pure music and noise filter".
The next step was to remove all of the silence from the above filtered segments.
The process for doing this is also described in Section 2.3. The results for this process
are shown in the column labeled "all non-speech filter."
Finally, all of the initial segments that are not considered pure speech were
removed. This was done by changing the post-labeling step to reject all of the segments
that were not labeled pure speech. The results are in the "All music and noise filter"
column.
Missed Speech
Falarm Speech
Missed Speaker Time
Falarm Speaker Time
Speaker Error Time
Total Error
no filter
(%)
0
6.2
0.4
11.7
11.7
23.68
pure music and
noise filter (%)
0
4.8
0.4
8.9
10.6
19.86
all non-speech
filter (%)
1.6
3.3
3.3
6.2
9.3
18.73
all music and
noise filter (%)
21.1
1.6
39.8
2.9
7.3
49.97
Table 15: Segment Labeler Filter. This table shows the results of various types of filtering.
56
The results from this stage are mostly as expected, but with some surprises. The
removal of all non-speech provided the best results, dropping the total error by nearly
5%, with the removal of noise and music coming in second, dropping the error by nearly
4%. There is an increase in the amount of missed speech, as the evaluation includes
filtering silence. This occurs because much of the silence occurs in a speaker's speech.
Whenever the speaker pauses to take a breath, think, et cetera, silence is created. This
silence is not represented in the truth marks, and therefore, when these times of silence
are not included, it shows up as missed speech. When only pure speech was kept, the
models performed exceptionally well. Unfortunately, all of the overlapped speech was
neglected, and therefore the missed speech was tremendously high.
The surprise comes in the area of missed speech for the pure music and noise
filter, and as a byproduct the non-speech filter. If the results in Table 14 are an estimate
of performance, a roughly 5% missed speech rate should appear for both experiments.
This does not happen. Rather, there is no missed speech.
5.5 Clusterer
Various parameters dealing with the clusterer were altered in isolation, and their effects
were recorded in Table 16 and Table 17. The baseline results (the same as in Table 12),
are reported again in Table 16.
The first experiment used a RASTA channel
normalization process prior to clustering. Channel normalization attempts to remove
background variations in the signal. The results are shown in column "RASTA" in Table
16. The final two columns in Table 16 report the next set of experiments. The mixorder
denotes the order of the mixture model used to help resolve the speakers. The baseline
value of this is set to 128 mixtures. The experiments tested 256 and 64 mixtures.
57
Missed Speech
Falarm Speech
Missed Speaker Time
Falarm Speaker Time
Speaker Error Time
Total Error
RASTA
(
0
4.8
0.4
8.9
12.2
21.5
baseline
S
0
4.8
0.4
8.9
10.6
19.86
mixorder
= 256 (%)
0
4.8
0.4
8.9
12.6
21.86
mixorder
=64 (%)
0
4.8
0.4
8.9
17.5
26.76
Table 16: Clusterer Parameters 1. This table shows the results of various parameter changes.
Table 17 shows the effects of altering some of the feature processing parameters.
In the audio file, there is a mix of two type of speech: wideband (8 kHz) and narrowband
speech (4 kHz). These experiments are looking at the effect of bandlimiting the data to
either 8 kHz or 4kHz. The second aspect of these tests is whether or not a linear or melscale filterbank (described in Section 3.3.1) should be used. The baseline is a 4 kHz
cutoff using a linear filterbank. The experiments tested the other three combinations: 4
kHz mel-scale, 8 kHz linear, and 8 kHz mel-scale.
Missed Speech
Falarm Speech
Missed Speaker Time
Falarm Speaker Time
Speaker Error Time
Total Error
4 kHz linear
(baseline) (%)
0
4.8
0.4
8.9
10.6
19.86
4 kHz melscale (%)
0.0
4.8
0.4
8.9
14.2
23.55
8 kHz
linear (%)
0.0
4.8
0.4
8.9
14.4
23.66
8 kHz melscale (%)
0.0
4.8
0.4
8.9
14.2
23.53
Table 17: Clusterer Parameters 2. This table shows the results of various parameter changes.
None of the parameter changes showed an improvement from the initial set-up.
By attempting to do a RASTA channel normalization, the cluster is losing some of the
information about ambient conditions that help clustering. From the results, using 256
58
mixtures to model is too much and using 64 mixtures is too little. Finally, the use of the
lowest common bandwidth, 4 kHz, and linear values performs the best.
5.6 Resegmenter
The final set of experiments dealt with the re-segmentation process described in Section
2.4. The output file from the run containing the original parameter values with all nonspeech filtered out was used. This file was passed to the re-segmentation script, and the
results are shown in the "1st reseg" column of Table 18. The output associated with the
first pass is sent to the re-segmentation script to generate a new output list. The scores
from that process are shown in the "2nd reseg" column. The process continues until the
fifth pass through the re-segmentation script. The results are shown below.
Missed Speech
Falarm Speech
Missed Speaker Time
Falarm Speaker Time
Speaker Error Time
Total Error
base (nonspeech
filtered) (%)
1.6
3.3
3.3
6.2
9.3
18.73
1st
reseg
(%)
0.1
4.8
0.5
9.0
9.4
18.98
2nd
reseg
(
0.1
4.8
0.5
9.0
9.5
19.03
3rd
reseg
(
0.1
4.8
0.5
9.0
9.4
19
4th
reseg
()
0.1
4.8
0.5
9.0
9.4
18.99
5th
reseg
(%)
0.1
4.8
0.5
9.0
9.4
19.01
Table 18: Passes Through the Resegmenter. This table shows the results of resegmenting the original
output of the blind segmentation system.
The results of the re-segmentation processes
are somewhat different that
expected. The resegmenter did not decrease the total error of the segmentations. This
could have occurred for one of two reasons. One possibility is that there were no errors
in the acoustic change detection and clustering processes.
The other is that the re-
segmentation process did not work as expected. The first two cannot be ruled out just
59
because the system performance suggests errors in the aforementioned components.
Although the output from the clusterer did have errors when compared to the truth, these
errors probably have no correlation to the accuracy of the acoustic matches. The problem
lies in the fact that the truth is segmented by individual speakers, whereas the
hypothesized is segmented by acoustic speakers. Those segments that have completely
different background conditions with the same foreground speaker are marked as the
same speaker in truth, but different speakers in hypothesis.
Unfortunately, the
resegmenter did not compensate for this discrepancy.
The first reason may explain the results, but in all likelihood it is the third. The
lack of significant results is most likely due to the fact that the testing of the speaker
models is occurring from the same data that was used to train them. Since all of the
feature vectors that are presented in testing are already in one of the speaker models, it
becomes near impossible for reassociations of speakers and segments.
60
CHAPTER 6
Conclusion
6.1 System Overview
The MITLL unsupervised segmentation system used to perform the diarization task has
four main components: the acoustic change detector, the segment labeler, the clusterer,
and the resegmenter.
Each performs a unique function in the diarization task. The
change detector provides the initial list of homogeneous segments used. The labeler
removes all of the silence regions. Then it uses a speaker ID system to label all of the
segments and, finally, it removes the non-speech ones from the list. Next the clusterer
groups the remaining segments together to provide a list of segments and speaker tags.
Finally, the resegmenter smoothes the results by using a speaker ID system to train
speaker models for each speaker and identify region where the speakers are talking. This
creates another segmentation file that can be fed into the resegmenter as many times as is
desired.
The speaker ID system is called the GMM-UBM Speaker ID System. It performs
two main functions: training models and testing audio files with models. For the training
phase, the system needs a list of speaker and corresponding audio segments to be
modeled. It produces a Gaussian Mixture Model for each speaker. These models can
then be used in the testing phase, where an audio file is scored with each model on a
frame-by-frame basis. One of two scripts is then used to process these frame-by-frame
scores, producing either segments labels or a segmentation file.
61
In the case of the segment labeler, five models are created beforehand: one for
each category of acoustic signal - pure speech, pure noise, pure music, speech and music
overlap, and speech and noise overlap. These five category models are developed from
the training data. When the labeler is run, these models are passed on to the speaker ID
testing phase to develop the labels. On the other hand, the re-segmentation task uses both
aspects of the speaker ID system to first create speaker models, and then use those
speaker models to produce segments.
6.2 Performance Summary
At first glance, the results seem to show an adequate, but lacking, unsupervised
segmentation system. The 20% error rate is much higher than is desired. But, this error
rate is somewhat misleading. It measures the deviation of the hypothesized segmentation
from the truth segmentation. As mentioned earlier, the hypothesized marks are created
with the goal of homogeneity in mind.
The truth, on the other hand, has no such
standard. Therefore, higher error rates are inevitable, and very difficult to reduce, given
the fact that there are so many acoustic variations (see Section 1.2.1). It also leads to the
question: is matching the truth exactly what is desired?
The truth marks state that a speaker, Bill Clinton for instance, has spoken at
specific times. Unfortunately, it says nothing about the quality of audio signal contained
within those times. On the other hand, the hypothesized speaker marks say that speaker
A, matched to Bill Clinton in the scoring, has spoken at specific times. When what is
marked as speaker A is compared to what is marked as Bill Clinton, it may be that
speaker A is actually Bill Clinton with no background noises. It may also be that speaker
B and speaker C in the hypothesized segments are actually Bill Clinton over music and
62
Bill Clinton over a roaring crowd, respectively, but, in scoring, these speakers are not
recognized as Bill Clinton. Currently, there is no way of telling if this is the case, but
given the design of the diarization process, it could be extremely likely.
It is, therefore, critical to keep in mind the specific purpose and context of the
diarization process. If the objective is very speaker-identity specific, then there is much
room for improvement. On the other hand, if the desired result is homogeneous speakers,
then the system in place may be very accurate. Unfortunately, there is no method of
testing the accuracy of this aspect.
6.3 Future Experiments
Future experiments will be broken up into four parts, one for each component of the
system.
In terms of the acoustic change detector, the means of penalizing model
complexity may present opportunities to improve results. Currently, the prevalent BIC
method is used.
Other possible methods include, but are not limited to, the Akaike
Information Criterion (AIC), Hurvich and Tsai's corrected AIC, and the minimum
description length criterion (MDL). On the other hand, the use of statistics for change
detection may not prove the best results.
The segment labeler can be improved two ways. The first involves refining the
models and the modeling process. The second involves using the log-likelihood scores
for the models in a more meaningful way. In term of the modeling, one of the first steps
would be creating more models. In the transcripts annotation description, it describes
how "Fidelity" and "Level" tags are maintained to qualify the quality of the audio signal.
For the five-model system, these tags are ignored. If they are not, and instead used to
create more categories, such as high fidelity speech, low fidelity speech, et cetera, many
63
of the classification errors that occurred may be eliminated. This would mean that more
speech was retained, more non-speech was dropped, and the confusion between speech
and overlap regions could be eliminated.
Along these terms, the other area of exploration would be developing a more
elegant technique by which to use the log-likelihood scores produced by these models to
determine the speaker. Currently, the model with the highest log-likelihood is declared
the speaker.
It might be possible to use various combinations of these scores and
threshold them in order to determine speakers. For instance, the noise model may be
subtracted from both the music and speech. Then, if either score is greater than a given
threshold, it is labeled as that. If both scores are greater than the threshold it is labeled as
overlap, if neither than noise.
The clusterer could profit from the use of a different clustering algorithm.
Currently a bottom-up algorithm is used. This means the process begins with many small
segments, and, as the process proceeds, similar segments are combined.
One option
would be to change to a top-down algorithm. This starts with one large segment, and
breaks it up into smaller segments. If this proves to be less effective, then there are also
some changes to the bottom-up algorithm that can be attempted. For instance, different
methods of determining distances may be tried.
Another set of experiments involving the clusterer could involve improving the
RASTA channel normalization that was used in one set of experiments. This could have
advantages in improving the accuracy of speaker based evaluation (as opposed to
homogeneity base evaluation). Since background noises are normalized, the emphasis
can be shifted away from matching acoustic conditions to matching speaker voices.
64
Finally, experiments with the resegmenter should involve developing a way so
that all of the data tested does not also show up in the training data. If the assumption
that the regions of overlap are where the errors happen and the regions of pure speech are
properly clustered, then one possible solution would be to train only on pure speech and
test only on overlap. Unfortunately, this may prove to be difficult, given the fact that this
would entail the ability of detecting speakers over music and noise, which has not been
demonstrated.
Another solution may be in using some of the previously mentioned
techniques in the resegmenter. For instance, if all of the segments for a given speaker are
combined to form one large segment, then a top-down clustering algorithm could be used
to break it down. Then, certain parts of the remaining segments could be used to train
models, and the remaining tested on.
65
APPENDIX A
Transcript Description
The following is a verbatim (minus some formatting) excerpt from a document provided
by the LDC as an ancillary document to the corpus data. It explains the tags and
formatting of the transcript files that are parsed in order to create the data files that are
necessary for many of the steps in the experimental process.
A.1 The SGML tags and their attributes
Episode - <Episode> is a spanning tag, terminated by </Episode>. It spans all of the
annotation and transcription information associated with a particular episode, and it
may contain <Section>, <Background> and <Comment> tags within its span.
The attributes associated with each Episode are:
Filename: The name of the file containing the episode's audio signal.
Scribe: The name of the transcriber who produced the annotation and the transcription.
Program: The name of the program that produced the episode. (E.g., "NPRMarketplace")
Date:
The date and time of the episode broadcast, in "YYMMDD:HHMM" format.
(E.g.,
"960815:1300".)
Version: The version number of the annotation of this episode, starting with "1". Each time the
annotation is revised, the version number is incremented by 1.
VersionDate: The (last) date and time of annotation/transcription input to this annotation.
Section - <Section> is a spanning tag, terminated by </Section>. It spans all of the
annotation and transcription information associated with a particular section of an
episode, and it may contain <Segment>, <Background> and <Comment> tags
within its span. It must be contained within the span of an <Episode>. The
attributes associated with each Section are:
S_time: The start time of the Section, measured from the beginning of the Episode in seconds.
E_time: The end time of the Section, measured from the beginning of the Episode in seconds.
Type: One of the labels "Story", "Filler", "Commercial", "WeatherReport", "TrafficReport",
"SportsReport", or "LocalNews". For the current Hub-4 effort, Commercials and
SportsReports will not be transcribed and will therefore contain no Segments. Sections
of all other Types will be transcribed and will be included in the evaluation.
Topic: An identification of the event or topic discussed in the Section. For example, "TWA flight
800 disaster". Topic is optional and is not currently being supplied by LDC. (Future use
and value of Topic will require additional guidance on how to define it.)
4 The data corpus is available at the LDC website previously listed.
66
Segment - <Segment> is a spanning tag, terminated by </Segment>. It spans all of the annotation
and transcription associated with a particular Segment, and it may contain <Sync>, <Background>
and <Comment> tags within its span, as well as the transcription text. The <Segment> tag must be
contained within the span of a <Section>. (Segment information is allowable only for the PE.) The
attributes associated with each Segment are:
S_time: The start time of the Segment, measured from the beginning of the Episode in seconds.
E_time: The end time of the Segment, measured from the beginning of the Episode in seconds.
Speaker: The speaker's name.
Mode: One of the labels "Spontaneous" or "Planned".
Fidelity: One of the labels "High", "Medium" or "Low.
Sync - <Sync> is a non-spanning tag that provides transcription timing information
within a Segment. It is positioned within the transcription and gives the time at that
point. The <Sync> tag must be contained within the span of a <Segment>. Sync is
a side-effect of the transcription process and is being provided for potential
convenience. Sync has a single attribute, namely Time:
Time: The time at this point in the transcript, measured from the beginning of the Episode in
seconds.
Comment - <Comment> is a spanning tag, terminated by </Comment>. It spans a free-form text
comment by the transcriber, but no other SGML tags. The <Comment> tag must be contained
within the span of an <Episode>.
Background - <Background> is a non-spanning tag that provides information about a
particular (single) background signal, specifically regarding the onset and offset of
the signal. This information is synchronized with the transcript by positioning the
Background tag at the appropriate point in the transcription. (<Background> tag
locations and times will be positioned at word boundaries so that the word within
which the background noise starts or ends will be included in the span of the
background noise.) The <Background> tag must be contained within the span of
an <Episode>. The attributes associated with each Background tag are:
Time: The time at this point in the transcript, measured from the beginning of the Episode in
seconds.
Type: One of the labels "Speech", "Music" or "Other".
Level:
One of the labels "High", "Low" or "Off'. This attribute indicates the level of the
background signal after Time. Thus High or Low implies that the signal starts at Time,
while Off implies that the signal ends at Time.
Foreign (non-English) speech will be labeled as background speech and not
transcribed, even if it appears to be in the foreground. The exception to this is that
occurrences of borrowed foreign words or phrases, when used within English
speech, are transcribed.
Overlap - <Overlap> is a spanning tag, terminated by </Overlap>. It is used to
indicate the presence of simultaneous speech from another foreground speaker.5
This information is synchronized with the transcript by positioning the Overlap tag
67
at the appropriate point in the transcription. (<Overlap> tag locations and times
will be positioned at word boundaries so that the word within which the overlap
starts or ends will be included in the span of the overlap.) The <Overlap> tag must
be contained within the span of a <Segment>. The attributes associated with each
Overlap tag are:
S_time: The start time of the Overlap, measured from the beginning of the Episode in seconds.
E_time: The end time of the Overlap, measured from the beginning of the Episode in seconds.
For example:
Speaker A: ... It was a tough game <Overlap S-time=101.222 Etime=102.111> # but very
exciting # </Overlap>
Speaker B: <Overlap S-time=101.230 Etime=102.309> # Yes it was # </Overlap>
In this example, Speaker B broke into Speaker A's turn. Note that the Overlap times don't coincide
exactly because they have been time-aligned to the most inclusive word boundaries for each speaker
Segment involved in the overlap.
Expand - <Expand> is a spanning tag, terminated by </Expand>. It is used to
indicate the expansion of a transcribed representation, such as a contraction, to a
full representation of the intended words that underlie the transcription. The
<Expand> tag spans the word(s) to be expanded and must be contained within the
span of a <Segment>. Expand has a single attribute, namely Ejform:
E_form: The expanded form of that portion of the transcription spanned by the
Expand tag.
To illustrate, here is a simple example: I <Expand Eform="do not"> don't
</Expand> think <Expand Eform="he is"> he's </Expand> lying. Note that
the transcribed words remain unchanged, while the attribute Eform indicates the
correct expansion of the spanned words. Note also that Eform resolves potential
ambiguity (such as whether "he's" should be expanded to "he is" or "he has").
Noscore - <Noscore> is a spanning tag, terminated by </Noscore>. It is used to
explicitly exclude a portion of a transcription from scoring. The <Noscore> tag
spans the word(s) to be excluded and must be contained within the span of a
<Segment>. Noscore has 3 attributes:
Reason: Short free-form text string containing an explanation of why the tagged
text has been excluded from scoring. The string must be bounded by
double quotes.
Stime: The start time of the excluded portion, measured from the beginning of the Episode in
seconds.
Etime: The end time of the excluded portion, measured from the beginning of the Episode in
seconds.
For example:
<Noscore Reason="Mismatch between evaluation index and final transcript"
S_time=1710.93 Etime=1711.71> ... text to be excluded ... </Noscore>
68
A.2 The Transcription
Character set and line formatting
The transcription text will consist of mixed-case ASCII characters. Only alphabetic
characters and punctuation marks will be used, along with the bracketing characters
listed below. Line breaks may be inserted within the text, to keep lines less than 80
characters wide and to separate the transcription text from SGML tags.
(Transcription text will not be entered on the same line with any SGML tag.)
Numbers, Acronyms and Abbreviations
Numbers are transcribed as words (e.g. "ninety six" rather than "96"). Acronyms are
transcribed as upper-case words when said as words (e.g., "AIDS"). When said as a
sequence of letters, acronyms are transcribed as a sequence of space-separated uppercase letters with periods (e.g., "C. N. N."). Except for "Mr." and "Mrs.",
abbreviations are not used. However, words that are spoken as abbreviated (e.g.,
"corp." rather than "corporation") are spelled that way.
Special Bracketing Conventions
Single word tokens may be bracketed to indicate particular conditions as follows:
**
indicates a neologism - the speaker has coined a term.
E.g.,
**Mediscare**.
+...+
indicates a mispronunciation. The intended word is transcribed, regardless
of its pronunciation.
E.g., +ask+ rather than +aks+.
(Variant
pronunciations that are intended by the speaker, such as "probly" for the
word probably, are not bracketed.)
[...] indicates (a one-word description of) a momentary intrusive acoustic event
not made by the speaker. E.g., [gunshot].
{...} indicates (a one-word description of) a non-speech sound made by the
speaker. E.g., {breath}.
Sequences of one or more word tokens may be bracketed to indicate particular
conditions as follows:
((...)) indicates unclear speech, where what was said isn't clear. The parentheses
may be empty or may include a best guess as to what was said.
# ... # indicates simultaneous speech. This occurs during interactions when the
speech of two people who are being transcribed overlap. The words in
both segments that are affected are bounded by # marks.
Other notations
@ indicates unsure spelling of a proper name. The transcriber makes a best guess
and prefixes the name with the @ sign. E.g., Peter @Sprogus.
- indicates a word fragment. The transcriber truncates the word at the appropriate
place an appends the - sign. E.g., bac-.
Punctuation
With the exception of periods ("."), normal punctuation is permitted, but not required.
Periods are used only after spelled out letters (N. 1. S. T.) and in the accepted CSR
69
abbreviations (Mr., Mrs., Ms.). They may not be used to end sentences. Instead,
sentences may be delimited with semicolons.
Non-English speech
Speech in a foreign language will not be transcribed. This speech will be indicated
using the "((...))" notation.
However, for foreign words and phrases that are
generally understood and in common usage (such as "adios"), these words will be
transcribed with customary English spelling and will be treated as English.
A.3 The Annotation format
With the exception of <Comment>, the beginning mark of each spanning tag will be presented alone and
complete on one line. The corresponding ending mark will also appear alone on a subsequent line. The
<Comment> units are often brief, but they are free to extend to multiple lines. Within the beginning marks
of spanning tags, all attribute value assignments will be bounded by spaces (except the last, where a space
isn't needed before the closing ">"). Attributes containing spaces or other non-alphanumeric characters
must be enclosed in quotes. Here is an example of annotation:
<Episode Filename=f960531.sph Scribe=StephanieKudrac
Program=CNNHeadlineNews Date="960531:1300" Version=1
VersionDate="960731:1730">
<Section Stime=0.28 Etime=105.32 Type=Commercial>
</Section>
<Background Time=1 11.27 Type=Music Level=High>
<Section Stime= 116.55 Etime=124.92 Type=Filler Topic="lead-in">
<Segment Sjtime= 117.61 Etime=121.06 Speaker=Announcer_01 Mode=Planned
Fidelity=High>
Live from Atlanta with Judy Forton
</Segment>
<Segment Sjtime=121.95 Etime=124.92 Speaker=JudyForton Mode=Spontaneous
Fidelity=High>
Lynn Vaughn is off today; Thanks for joining us;
</Segment>
</Section>
<Section Stime=124.92 Etime=299.79 Type=Story Topic="U.S. - Israeli politics">
<Segment Sjtime=124.92 Etime=139.20 Speaker-JudyForton Mode=Planned
Fidelity=High>
President Clinton has congratulated Israel's next
<Sync Time=127.74>
leader
<Background Time=128.30 Type=Music Level=Off>
and has invited him to the White House to talk about Middle East
<Sync Time=131.03>
peace strategies {breath I President Clinton called Benjamin Netenyahu just minutes
<Sync Time=135.04>
after he was declared the winner over Prime Minister Shimon Peres {breath} Fred
Saddler reports
70
</Segment>
<Background Time= 139.65 Type=Other Level=Low>
<Comment> background noise and people </Comment>
<Segment Sjtime=141.32 Etime=154.88 Speaker=FredSaddler Mode=Planned
Fidelity=Medium>
Never doubting that he would win, Benjamin Netenyahu came out on top
</Segment>
</Section>
</Episode>
A.4 Show ID Letters
The mapping of show-id letters to show titles is as follows:
a = ABC Nightline
b = ABC World Nightly News
c = ABC World News Tonight
d = CNN Early Edition
e = CNN Early Prime News
f = CNN Headline News
g = CNN Prime Time News
h = CNN The World Today
i = CSPAN Washington Journal
j = NPR All Things Considered
k = NPR Marketplace"
71
APPENDIX B
Segmentation Example
The following appendix will trace out the steps involved in converting a transcript data
file into a usable format for the other system in the experiments. See Section 3.3 for a
step by step explanation of the process.
B.1 Transcript File
This is the transcription file that is provided by the LDC as part of the corpus data set.
The transcription below is and example created to illustrate the segmentation process and
therefore will neither contain the details involved in a real transcript nor the actual
transcription of the words said. As mentioned earlier all that is important are the tags.
The transcript has been formatted to increase readability.
<<test.txt>>
<Episode Filename=test.sph Program="AppendixB" Scribe="rishi-roy"
Date="030521:0330" Version=1 VersionDate=030521>
<Comment> This is a test case derived for Appendix B </Comment>
<Section Stime=0 Etime=6 Type=Filler>
<Segment Stime=O Etime=3 Speaker=Speakerl Mode=Planned
Fidelity=High>
<Background Time=0 Type=Music Level=High>
<Background Time=1 Type=Other Level=High>
<Background Time=2 Type=Speech Level=Low>
<Background Time=4 Type=Speech Level=Off>
</Segment>
<Background Time=4 Type=Music Level=High>
<Background Time=5 Type=Music Level=Off>
</Section>
<Section Stime=6 Etime=9 Type=Commercial>
<Segment Stime=6 Etime=8 Speaker-Speaker2 Mode=Planned
Fidelity=High>
</Segment>
</Section>
<Section Stime=9 Etime=25 Type=Filler>
<Background Time=9 Type=Other Level=High>
<Background Time=10 Type=Music Level=High>
<Segment Stime=12 Etime=17 Speaker=Speaker3 Mode=Planned
Fidelity=High>
<Background Time=14 Type=Music Level=Off>
<Background Time=15 Type=Other Level=High>
<Background Time=16 Type=Other Level=Off>
</Segment>
72
<Segment Stime=18 Etime=19 Speaker=Speakerl Mode=Planned
Fidelity=High>
<Background Time=18 Type=Speech Level=Low>
</Segment>
<Background Time=19 Type=Other Level=High>
<Background Time=20 Type=Other Level=Off>
<Segment Stime=20 Etime=22 Speaker=Speaker2 Mode=Planned
Fidelity=High>
</Segment>
<Background Time=23 Type=Other Level=High>
<Background Time=24 Type=Other Level=Off>
<Segment Stime=24 Etime=25 Speaker=Speakerl Mode=Planned
Fidelity=High>
</Segment>
</Section>
B.2 Initial Segmentation Files
The following are the initial segmentation files generated by the processing script
for the transcript. The first five are all obtained by processing the SGML tags given in
the transcript. The final segmentation file, testsilence.seg, is obtained by processing the
output of cross talk, as described in Section 3.3.
<<test-speech.seg>>
0 3 Speech
6 8 Speech
12 17 Speech
18 19 Speech
20 22 Speech
24 25 Speech
<<testmusic.seg>>
0 1 Music
4 5 Music
10 14 Music
<<testback-other.seg>>
1 2 Other
9 10 Other
15 16 Other
19 20 Other
23 24 Other
<<testback-speech.seg>>
2 4 BS
14 15 BS
18 19 BS
73
<<testcommercial.seg>>
6 9 Commercial
<<testsilence.seg>>
3 4 Silence
5 6 Silence
9 10 Silence
22 23 Silence
B.3 Result Segmentation
These are the output files from the overlap-removal and overlap-obtaining
processes as described in Section 3.3 and seen in Figure XX.
<<testspeechonly.seg>>
16 17 Speech
20 22 Speech
24 25 Speech
<<test_musiconly.seg>>
4 5 Music
10 12 Music
<<test_noise_only.seg>>
19 20 Other
<<test-speech musicoverlap.seg>>
0 1 speech+music
12 14 speech+music
<<test-speech noiseoverlap.seg>>
1 2 speech+overlap
24BS
14 15 BS
15 16 speech+overlap
18 19 BS
These files are then converted to mark format and saved according to the given
convention. These marks can now be associated with their respective sphere files.
74
Time
MM
ig
\'ii\9combine
d
0
S+M
1 kW%%%%%%%%\1%MS+O
filtere
d
S+M
S+O
nooverla
p
2
3
S
4
M
M
M
M
M
M
S+M
S+M
M
M
5
10
16
11
17
12
18S
13
kWAA1ik%%
%
S
M
S+M
S+M
19
14
0S
kWAAAAAAM
16 kMWAAA\
15
S+O
S
S+O
S
S
0
S
0
17
18 kWAAAAAAM
19
20 kWWAAAANS
21 kWAAAA
M
S
0WWW\MN
O
S
S
0
0
S
S
S
2
23
24
kW%'%%%%
kWWWWAA
S
0
S
Figure 4: Segmentation Process. This is a graphical representation of what is occurring in the segmentation process. Although some of these columns
do not appear as represented, and the steps involved is a bit more complicated than this, it is helpful to think of the segmentation process in this way.
The first group of three columns contains the three main categorical segmentation files - speech, music, and noise. When these are appended together
and sorted, column four - "combined" is the result. The second set of three columns contains the set of filtered files. When column four is filtered by
these three, column five - "filtered" is the result. The last set of the group contains the final segmentation files for the five categories of interest. These
are obtained by extracting the appropriate labels out of the filtered column. See earlier sections of Appendix B to get more detailed, complete, and
accurate description of the process.
References
[1] A. Martin, M. Przybocki, G. Doddington, and D. Reynolds, "The NIST 1998
Development Evaluation of Speaker Recognition on Multi-Speaker Telephone
Channels", http://www.nist.gov/speech/publications, 1999.
[2] J. Kim, Named Entity Recognitionfrom Speech and Its Use in the Generationof
Enhanced Speech Recognition Output.
August 2001.
PhD Thesis, Cambridge University,
[3] NIST Special Publication 500-242: The Seventh Text REtrieval Conference
(TREC 7), http://trec.nist.gov/pubs/trec7/t7 proceedings.html.
[4] J. Ajmera, H. Bourlard, and I. Lapidot, "Improved Unknown-Multiple Speaker
Clustering Using HMM," IDIAP Research Report, pp. 1-6, September 2002.
[5] S. Chen and P.S. Gopalakrishnan, "Speaker, Environment and Channel Change
Detection and Clustering via the Bayesian Information Criterion," Proceedings
of the DARPA Broadcast News Transcription and Understanding Workshop,
February 1998.
[6] L. Wilcox, F. Chen, D. Kimber, and V. Balasubramanian, "Segmentation of
Speech Using Speaker Identification," ICASSP, pp. 161-164, 1994.
[7] D. Reynolds. A Gaussian Mixture Modeling Approach to Text-Independent
peaker Identification. PhD Thesis, Georgia Institute of Technology, September
1992.
[8] D. Reynolds and R. Rose, "Robust Text-Independent Speaker Identification
Using Gaussian Mixture Speaker Models," IEEE Trans. on Speech and Audio
Processing,Vol. 3, No. 1, pp. 72-83, January 1995.
76
Download